In this example, we will construct a nearest neighbor classifier for the Iris data. First, we load the data set as before, and split it into the training (N=100) and testing (N=50) data sets.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
# Loading data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=50,
random_state=2020)
Next, we define a nearest neighbor classification object KNeighborsClassifier
available under the sklearn.neighbors
library. As we define the classifier object kNN
, we use the number of neighbors k=5
. We use uniform
weighting for the parameter weights
.
kNN = KNeighborsClassifier(5, weights='uniform')
Then we train he classifier with the fit
method.
kNN.fit(X_train,y_train)
The trained classifier is then used to generate prediction on the testing data.
y_pred = kNN.predict(X_test)
The confusion matrix and the classification report are generated.
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))
We use a multilayer perceptron (MLP) to classify species in the Iris data set. We define an MLP classifier object MLPClassifier
available under the sklearn.neural_network
library. As we define the classifier object mlp
, we use the stochastic gradient descent solver (solver=sgd
). We use 2 hidden layers 4 and 2 neurons, as defined by the parameter
hidden_layer_sizes=(4, 2)
.
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(solver='sgd',
hidden_layer_sizes=(4, 2), random_state=2020)
Then we train he network with the fit
method.
mlp.fit(X_train,y_train)
The trained network is then used to generate prediction on the testing data.
y_pred = mlp.predict(X_test)
The confusion matrix and the classification report are generated.
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))
Support vector machine (SVM) classifier and regression models are available under the sklearn.svm
library as SVC
and SVR
, respectively. For the SVM classifier, we define a classifier object svc
with the linear kernel (kernel='linear'
) and a somewhat soft margin (C=1.0
).
from sklearn.svm import SVC, SVR
svc = SVC(kernel='linear', C=0.1)
Then we train the classifier on the training data from the Iris data set.
svc.fit(X_train,y_train)
And we use the trained model for prediction.
y_pred = svc.predict(X_test) # predicted class
The confusion matrix and the classification report are generated.
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test, y_pred, target_names=target_names))
For support vector regression, we try to model the petal width with all the other features. First, we define a regression model object svr
with the linear kernel and a soft margin (kernel='linear'
and C=0.1
, respectively).
svr = SVR(kernel='linear', C=0.1)
Then we train the regression model with the features and the target variable from the training data.
svr.fit(X_train[:,:3],X_train[:,3])
Then we calculate predicted values of the petal width based on the available features in the testing data.
y_pred = svr.predict(X_test[:,:3])
We assess the performance of the model by calculating $R^2$ statistic with the r2_score
function in the sklearn.metrics
library.
from sklearn.metrics import r2_score
print(r2_score(X_test[:,3], y_pred))
We now visualize the observed and predicted target variables by a scatter plot of the sepal length against the petal width.
# plotting observed vs predicted (sepal length on x-axis)
plt.plot(X_test[:,0], X_test[:,3],'b.', label='observed')
plt.plot(X_test[:,0], y_pred, 'r.', label='predicted')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[3])
plt.legend()
plt.show()
As an example of ensemble methods, we train a random forest classifier and use it to predict the Iris species. A random forest classifier is available as RandomForestClassifier
in the sklearn.ensemble
library. We define a random forest classifier object rf
, with the following parameters:
criterion = 'entropy'
n_estimators = 100
min_samples_leaf = 3
max_depth = 4
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(criterion='entropy',
n_estimators = 100,
min_samples_leaf = 3,
max_depth = 4,
random_state=2020)
Then the model rf
is trained with the training data.
rf.fit(X_train,y_train)
Predictions are made on the testing data.
y_pred = rf.predict(X_test)
The confusion matrix and the classification report are generated.
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test, y_pred,
target_names=target_names))