Python Code Snippets

Download Code

Data Understanding in Python

Histograms

Histograms are generated by the hist function in the matplotlib library. Here, we use the Iris data set, available as part of the Scikit-learn (sklearn) library, a library of machine learning functions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

# loading the iris data set
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)   # features data frame
X.columns = ['sepal length', 'sepal width', 'petal length', 'petal width']
y = pd.DataFrame(iris.target)  # target data frame
y.columns = ['species']
target_names = iris.target_names

The data set has been read into a data frame object X, a class of object available in the pandas library. In addition, the target describing different species is stored in another data frame y. The feature names have been assigned to this data frame as column names. We generate a histogram of the petal length with the following code.

# histogram of the petal length
plt.hist(X['petal length'])
plt.show()

Note that you need to run plt.show() to display the figure on your screen. You may notice that the histogram includes only a small number of bars, despite the large amount of data! You can change the number of bins, or the number of bars on a histogram by the second parameter.

# histogram with 20 bins
plt.hist(X['petal length'], 20)
plt.show()

Boxplots

Book reference: Chapter 4.3.1, page 43

Boxplots can be generated by the boxplot method associated with data frame objects in pandas. You can specify a particular feature with the columns parameter. Here is a boxplot of the the petal length from the Iris data set.

# box plot of the petal length
X.boxplot(column='petal length')
plt.show()

Or, we can generate boxplots of the petal length and width.

# box plot of the petal length & width
X.boxplot(column=['petal length', 'petal width'])
plt.show()

Or all the features in the data set.

# box plot of all the features
X.boxplot()
plt.show()

We can generate boxplots with notches by specifying the option notch=True.

# notched box boxplots
X.boxplot(notch=True)
plt.show()

Finally, the describe method, associated with a data frame objects, produces various descriptive statistics (mean, median, quartiles, etc.) for each column.

# describing various statsitics
print(X.describe())

       sepal length  sepal width  petal length  petal width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Scatter Plots

Book reference: Chapter 4.3.1, page 45

Scatter plots can be generated by the plot.scatter method associated with data frame objects in pandas. You can specify the columns represented in the x- and y-axes with parameters x and y, respectively. As an example, we plot the petal width against the petal length.

# plotting petal width vs length (as a method)
X.plot.scatter(x='petal width', y='petal length')
plt.show()

plotting petal width vs length (as a method)

Alternatively, we can produce a scatter plot using the scatter function as part of the matplotlib.pyplot library. A scatter plot of the petal width v.s. length is produced with the plot function.

# plotting petal width vs length (as a function)
plt.scatter(X['petal width'], X['petal length'])
plt.show()

plotting petal width vs length (as a function)

All features can be plotted against each other with the plotting.scatter_matrix method associated with a data frame object. Here, all features in the Iris data are plotted in a scatter plot matrix.

# scatter plot matrix
pd.plotting.scatter_matrix(X)
plt.show()

Notice that all data points are plotted in the same color. However, you may want to plot data points corresponding to different species in different colors. To do so, we provide the information on species contained in the data frame column y['species'] to the parameter c in the plotting.scatter_matrix method.

# scatter plot matrix with different colors for species
pd.plotting.scatter_matrix(X, c=y['species'])
plt.show()

scatter plot matrix with different colors for species

Principal Component Analysis

Book reference: Chapter 4.3.2.1, page 48

In Scikit-learn, a Python library for machine learning tools, many algorithms are implemented as objects, rather than functions. In a nutshell, an object is a combination of data, a collection of functions associated with the data (referred as methods), as well as the properties of the object (referred as attributes). To demonstrate this idea, we apply a z-score transformation, or normalize, the Iris data in preparation for a principal component analysis. In particular, we create a normalization object available as a StandardScaler object under the sklearn.preprocessing library.

from sklearn.preprocessing import StandardScaler

# defining an normalization object
normData = StandardScaler()

Next, we fit the feature data X to this newly defined normalization transformation object normData, with the fit method.

# fitting the data to the normalization object
normData.fit(X)
StandardScaler(copy=True, with_mean=True, with_std=True)

Now the normalization object is ready, meaning that means and standard deviations have been calculated for each feature in the provided data set. Now we apply the actual transformation by the transform method to transform the data set X. The resulting normalized features are stored in X_norm.

# applying the normalization transformation to the data
X_norm = normData.transform(X)

To perform these processes, you can also use the fit_transform method, which is a combination of the fit and transform methods performed at once.

X_norm = StandardScaler().fit_transform(X)

Now we apply PCA to the normalized data X_norm. This is done by creating a PCA transformation object, available as part of the sklearn.decomposition library. Then performing the fit_transform method to calculate principal components.

from sklearn.decomposition import PCA

# applying PCA
pca = PCA()  # creating a PCA transformation ojbect
X_pc = pca.fit_transform(X_norm) # fit the data, and get PCs

This produces an array X_pc with 150 rows and 4 columns (corresponding to 4 PCs).

X_pc.shape

(150, 4)

The attribute explained_variance_ratio_ stores the amount of variability in the data explained by each PC. The first PC explains 73% of variability, whereas the second PC explains 23% of variability, and so on.

# proportion of the variance explained
print(pca.explained_variance_ratio_)

[0.72962445 0.22850762 0.03668922 0.00517871]

We plot the first 2 PCs, representing the four original features in a 2D space. The first PC (or the first column of X_pc, X_pc[:,0]) is plotted on the x-axis and the second PC (or the second column of X_pc, X_pc[:,1]) is plotted on the y-axis.

# plotting the first 2 principal compnonents
plt.scatter(X_pc[:,0], X_pc[:,1], c=y['species'])
plt.show()

plotting the first 2 principal compnonents

Multidimensional Scaling

Book reference: Chapter 4.3.2.3, page 53

Multi-dimensional scaling is implemented as an MDS transformation object available in the sklearn.manifold library. Here, we define a transformation object mds, then use the fit_transfrom method to obtain the MDS-transformed coordinates as X_mds.

from sklearn.manifold import MDS

# applying MDS
mds = MDS()
X_mds = mds.fit_transform(X_norm)

By default, the MDS transformation maps a higher-dimension space to a 2D space. We plot the transformation results in 2D.

# plotting the MDS-transformed coordinates
plt.scatter(X_mds[:,0], X_mds[:,1], c=y['species'])
plt.show()

plotting the MDS-transformed coordinates

Parallel Coordinates, Radar, and Star Plots

Book reference: Chapter 4.3.2.6-7, page 59

A parallel coordinates plot can be generated by the plotting.parallel_coordinates function available in the Pandas library. This function assumes that features and labels are stored in a same data frame. Thus, first we combine the feature data frame X and the target label data frame y with the concat function in Pandas. The resulting combined data frame Xy is then used in the parallel coordinates plot.

Xy = pd.concat([X,y], axis=1)
pd.plotting.parallel_coordinates(Xy,'species')
plt.show()

Correlations

Book reference: Chapter 4.4, page 62

The Pearson, Spearman, and Kendall correlations can be calculated by the pearsonr, spearmanr, and kendalltau function available in the SciPy (scientific Python) library, respectively. These functions return the correlation value and the associated p-value (of testing whether the correlation is zero).

import scipy as sp

r_pearson, p_pearson = sp.stats.pearsonr(X['sepal length'], X['sepal width'])
r_spearman, p_spearman = sp.stats.spearmanr(X['sepal length'], X['sepal width'])
r_kendall, p_kendall = sp.stats.kendalltau(X['sepal length'], X['sepal width'])

The resulting correlation values are:

r_pearson

-0.11756978413300208

r_spearman

-0.166777658283235

r_kendall

-0.0769967881165167

Download Code

Validation in Python

Training and Testing data

Book reference: Chapter 5.5.1, page 112

In this example, we will use the Iris data as used in the previous chapter. The features and the target are loaded into arrays named X and y, respectively.

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report

# Loading the iris data
iris = datasets.load_iris()
X = iris.data  # array for the features
y = iris.target  # array for the target
feature_names = iris.feature_names   # feature names
target_names = iris.target_names   # target names

Here, we split the Iris data into the training data set (comprising 2/3 of all observations) and the testing data set (comprising the remaining 1/3 of all observations). This is done by the train_test_split function available in the sklearn.model_selection library. This function takes an feature array and a target array as input parameters, and split them into training and testing data sets. In this example, the input features and targets, X and y respectively, are split into the training set (X_train for the features, and y_train for the target) as well as the testing data (likewise X_test and y_test).

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
                                            random_state=2020)

Here, the parameter test_size specifies the proportion of the original data to be assigned to the testing data. The order of observations are randomized in both training and testing data.

y # original target labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

y_train  # training data target labels

array([0, 2, 1, 0, 1, 1, 2, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 2, 1, 2, 0,
       0, 2, 0, 2, 2, 0, 2, 0, 0, 1, 0, 0, 2, 1, 0, 2, 1, 2, 0, 2, 2, 0,
       1, 2, 0, 2, 1, 1, 2, 1, 0, 2, 1, 0, 1, 1, 1, 2, 1, 0, 2, 0, 0, 1,
       2, 2, 2, 1, 2, 1, 0, 2, 0, 1, 0, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2, 2,
       1, 2, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1])

y_test  # testing data target labels

array([2, 0, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       1, 0, 2, 1, 1, 1, 0, 0, 2, 0, 0, 0, 2, 0, 0, 1, 0, 2, 0, 2, 1, 0,
       1, 2, 2, 1, 1, 1])

The parameter random_state is used to seed (or initialize) the random number generator. It is not a required parameter, but if you want to re-create the same random split, they you can use the same value for the random_state.

Classifier training and prediction

In this example, we train a naive Bayes classifier for the Iris data, and evaluate its performance. We use a Gaussian naive Bayes classifier object GaussianNB available in the sklearn.naive_bayes library. We define the classifier object as gnb. The classifier object is trained with the training data (both features X_train and target y_train) using the fit method. Once the classifier gnb is trained, then it is used to predict target labels based on the testing data features X_test. The predicted labels are stored in y_pred.

# Gaussian naive Bayes classifier
gnb = GaussianNB()  # defining the classifier object
gnb.fit(X_train, y_train)  # training the classifier with the training set
y_pred = gnb.predict(X_test)  # generating prediction with trained classifier

Now we examine the performance of the classifier by generating a confusion matrix. This is done by the confusion_matrix function available in the sklearn.metrics library. Here, we need to provide the true target labels y_test as well as the predicted labels y_pred.

# confusion matrix
print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 14  2]
 [ 0  3 13]]

As you see, there are some misclassifications in the 2nd and 3rd classes (versicolors and virginicas). We can also generate other measures of model performance with the classification_report function under the sklearn.metrics library. This function also takes arrays of target labels for the truth and the predicted. We also provide the list of target class names stored in target_names to the parameter target_names so that the output table has row headings corresponding to different target classes.

# classification report
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.82      0.88      0.85        16
   virginica       0.87      0.81      0.84        16

    accuracy                           0.90        50
   macro avg       0.90      0.90      0.90        50
weighted avg       0.90      0.90      0.90        50

Download Code

Data Preparation in Python

Missing Values

Book reference: Chapter 6.2.2, page 137

If we can assume that missing values occur missing completely at random (MCAR), then we can use a number of imputation strategies available in the sklearn library. To demonstrate, we create some toy data sets.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Creating toy data sets (numerical)
X_train = pd.DataFrame()
X_train['X'] = [3, 2, 1, 4, 5, np.nan, np.nan, 5, 2]
X_test = pd.DataFrame()
X_test['X'] = [3, np.nan, np.nan]

Here, we created data frames X_train and X_test, both containing a numerical data column X. There are some missing values in these data sets, defined by np.nan, a missing value available in the numpy library. If we examine these data frames, the missing values are indicated by NaN.

X_train

	X
0	3.0
1	2.0
2	1.0
3	4.0
4	5.0
5	NaN
6	NaN
7	5.0
8	2.0

X_test

	X
0	3.0
1	NaN
2	NaN

We also created data frames S_train and S_test with a string column, with some missing values indicated by np.nan.

# Creating toy data sets (categorical)
S_train = pd.DataFrame()
S_train['S'] = ['Hi', 'Med', 'Med', 'Hi', 'Low', 'Med', np.nan, 'Med', 'Hi']
S_test = pd.DataFrame()
S_test['S'] = [np.nan, np.nan, 'Low']

S_train

	S
0	Hi
1	Med
2	Med
3	Hi
4	Low
5	Med
6	NaN
7	Med
8	Hi

S_test

	S
0	NaN
1	NaN
2	Low

We impute missing values by creating a simple imputation object SimpleImputer available in the sklearn.impute library. We define a numerical imputation object imp_mean as

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

Here, the parameter missing_values defines which value is considered as missing. The parameter strategy defines the imputation method. In this example, we use 'mean', meaning that the mean of all non-missing values will be imputed. We use the fit_transform method and provide the X_train data to calculate the mean to be imputed, in addition to actually imputing missing values.

X_train_imp = imp_mean.fit_transform(X_train)

X_train_imp

array([[3.        ],
       [2.        ],
       [1.        ],
       [4.        ],
       [5.        ],
       [3.14285714],
       [3.14285714],
       [5.        ],
       [2.        ]])

As you can see, missing values are imputed by the mean. We can apply this imputation strategy (with the mean calculated on the training data) to the second data X_test by the transform method.

X_test_imp = imp_mean.transform(X_test)

X_test_imp

array([[3.        ],
       [3.14285714],
       [3.14285714]])

For categorical or string data, we can impute most frequent value by specifying the parameter strategy to 'most-frequent' in the SimpleImputer object. Here, we define the imputation object imp_mode imputing the most frequent category from the S_train data to both S_train and S_test data sets.

# Imputing categorical data with mode
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
S_train_imp = imp_mode.fit_transform(S_train)
S_test_imp = imp_mode.transform(S_test)

S_train_imp

array([['Hi'],
       ['Med'],
       ['Med'],
       ['Hi'],
       ['Low'],
       ['Med'],
       ['Med'],
       ['Med'],
       ['Hi']], dtype=object)

S_test_imp

array([['Med'],
       ['Med'],
       ['Low']], dtype=object)

Normalization and Scaling

Book reference: Chapter 6.6.3, page 153

For this example, we shall use the Iris data set again. We load the Iris data and split to training and testing data, with the testing data comprising 1/3 of all observations.

from sklearn import datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# loading the Iris data
iris = datasets.load_iris()
X = iris.data  # array for the features
y = iris.target  # array for the target
feature_names = iris.feature_names   # feature names
target_names = iris.target_names   # target names

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
                                                    random_state=2020)

We can normalize the data to Z-scores using the StandardScaler transformation object, as we have seen in a previous chapter. The transformation object is trained with the testing data X_train with the fit_transform method. Then the trained transformation is applied to the testing data with the transformation method.

# z-score normalization
normZ = StandardScaler()
X_train_Z = normZ.fit_transform(X_train)
X_test_Z = normZ.transform(X_test)

The resulting means and standard deviations for the normalized training data set are:

X_train_Z.mean(axis=0)

array([-2.17187379e-15,  1.35447209e-16,  1.24344979e-16,  2.19407825e-16])

X_train_Z.std(axis=0)

array([1., 1., 1., 1.])

Likewise, the means and standard deviations of the normalized testing data set are:

X_test_Z.mean(axis=0)

array([-0.17055031, -0.10396525, -0.10087131, -0.04575343])

X_test_Z.std(axis=0)

array([0.94189599, 0.8734677 , 1.00463184, 0.97844516])

To apply a min-max scaling, thus scaling all features in the [0, 1] interval, we can use the MinMaxScaler object available in the sklearn.preprocessing library.

# min-max normalization
normMinMax = MinMaxScaler()
X_train_MinMax = normMinMax.fit_transform(X_train)
X_test_MinMax = normMinMax.transform(X_test)

Let's examine the minimum and the maximum of the normalized training data.

X_train_MinMax.min(axis=0)

array([0., 0., 0., 0.])

X_train_MinMax.max(axis=0)

array([1., 1., 1., 1.])

Likewise, the minimum and the maximum of the normalized testing data. It should be noted that the minimum and the maximum used in the transformation were determined based on the training data. Thus there is no guarantee that the minimum and the maximum fall within the interval [0, 1].

X_test_MinMax.max(axis=0)

array([0.02777778, 0.08333333, 0.03389831, 0.04166667])

X_test_MinMax.max(axis=0)

array([0.94444444, 0.75      , 0.96610169, 1.        ])

Download Code

Finding Patterns in Python

Hierarchical Clustering

Book reference: Chapter 7.1, page 159

In this example, we apply hierarchical clustering to the Iris data. The data is first normalized with the StandardScaler object.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering, KMeans, DBSCAN

# Loading the iris data
iris = datasets.load_iris()
X = iris.data  # array for the features
y = iris.target  # array for the target
feature_names = iris.feature_names   # feature names
target_names = iris.target_names   # target names

# z-score normalization using fit_transform method
X_norm = StandardScaler().fit_transform(X)

Then the hierarchical clustering is implemented as an AgglomerativeClustering object available in the sklearn.cluster library. We set the number of clusters to 3 by setting the n_clusters parameter. We use the Ward method for linkage calculation by setting the linkage parameter.

# Hierarchical clustering
hc = AgglomerativeClustering(n_clusters=3, linkage='ward')

Then we learn from the normalized data X_norm by using the fit method associated with the clustering object hc. The resulting cluster assignments are stored in the attribute labels_ of the transformation object hc. We store this information in the array y_clus.

hc.fit(X_norm)  # actually fitting the data
y_clus = hc.labels_   # clustering info resulting from hierarchical

We visualize the resulting clusters with a scatter plot, plotting the sepal length on the x-axis and the petal width on the y-axis. Different clusters are indicated by different colors.

# Plotting the clusters
plt.scatter(X_norm[:,3],X_norm[:,0],c=y_clus,marker='+')
plt.show()

Next, we generate a

dendrogram

describing distances between different clusters. To do so, we use the linkage function to calculate distances between observations, and the dendrogram function to plot the actual dendrogram. Both functions are available in the scipy.cluster.hierarchy library. In the linkage function, we provide the normalized features X_norm and the linkage method ward (for the Ward method). The resulting distance matrix D is then used in the dendrogram function.

from scipy.cluster.hierarchy import dendrogram, linkage

D = linkage(X_norm, 'ward')
dn = dendrogram(D)
plt.show()

K-means clustering

Book reference: Chapter 7.3.2.1, page 175

K-means clustering is implemented as the KMeans transformation object available in the sklearn.cluster library. We define the clustering object km with the number of clusters n_clusters=3. We learn from the normalized data X_norm, and store the resulting cluster assignment information in the array y_clus.

# K-means clustering
km = KMeans(n_clusters=3)  # defining the clustering object
km.fit(X)  # actually fitting the data
y_clus = km.labels_   # clustering info resulting from K-means

As before, we plot the normalized sepal length (x-axis) vs the normalized petal width (y-axis), with different clusters indicated by different colors.

# Plotting the clusters
plt.scatter(X_norm[:,3],X_norm[:,0],c=y_clus,marker='+')
plt.show()

As in many clustering algorithms, you have to specify the number of clusters K in the k-means clustering method. What can we do if we do not know the number of clusters beforehand? One way we can determine the number of clusters is to plot the sum of squared distances from cluster centroids (i.e., how far observations are from the centrolids), also known as the inertia. We can get the inertia from k-means clustering by the attribute inertia_.

We run the K-means algorithm with different numbers of clusters, and calculate the corresponding inertiae. Then we plot the inertiae against the number of clusters. The inertia decreases as the number of clusters increases. However, there is an elbow in this plot where the rate of decrease slows down. In this particular example, we calculate the inertia up to 20 clusters.

SSE = []
for iClus in range(1,21):
    # K-means clustering
    km = KMeans(n_clusters=iClus)  # K-means with a given number of clusters
    km.fit(X_norm)  # fitting the principal components
    SSE.append(km.inertia_) # recording the sum of square distances

In brief, we use a for loop to run k-means clustering with k=1 to 20 clusters. With for each value of k, we record the corresponding inertia and store in a list SSE. Then we plot k (number of clusters, x-axis) against the inertiae (y-axis).

# plotting the sum of square distance (a.k.a., inertia)
plt.plot(np.arange(1,21),SSE,marker = "o")
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

plotting the sum of square distance (a.k.a., inertia)

In the plot, the rate of decrease seems to slow down (i.e., the elbow) at k=3. It should be noted that this elbow method is somewhat subjective; someone can easily choose k=2 or k=4 as the elbow. Moreover, this method requires repeating k-means clustering many times, which may not be possible for large data.

DBSCAN clustering

Book reference: Chapter 7.4, page 181

DBSCAN clustering is implemented as the DBSCAN transformation object available in the sklearn.cluster library. We define the clustering object dbscan, and we learn from the normalized data X_norm, and store the resulting cluster assignment information in the array y_clus. In addition, the indices for core points are available as the attribute core_sample_indices_ so we store that information in „code.

# DBSCAN clustering
dbscan = DBSCAN()  # defining the clustering object
dbscan.fit(X_norm)  # fitting the data
y_clus = dbscan.labels_   # cluster labels
indCore = dbscan.core_sample_indices_   # indices of core points

As before, we plot the normalized sepal length (x-axis) vs the normalized petal width (y-axis), with different clusters indicated by different colors. A cluster index of -1 corresponds to a noise points, whereas all the other indices are integers corresponding to different clusters. In the plot, core points are plotted with larger markers, and noise points are plotted in red.

# plotting non-noise points
plt.scatter(X_norm[y_clus>0,3],X_norm[y_clus>0,0], c=y_clus[y_clus>0],
            marker='o', s=10)
# plotting core points
plt.scatter(X_norm[indCore,3], X_norm[indCore,0],c=y_clus[indCore],
            marker='o', s=100)
# plotting noise points
plt.scatter(X_norm[y_clus==-1,3],X_norm[y_clus==-1,0], c='r',
            marker='o', s=10)
plt.show()

Download Code

Finding Explanations

Decision tree

Book reference: Chapter 8.1, page 220

In this example, we will construct a decision tree for the Iris data. First, we load the data set as before, and split it into the training (N=100) and testing (N=50) data sets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Loading data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=50,
                                                    random_state=2020)

Next, we define a decision tree classification object DecisionTreeClassifier available under the sklearn.tree library. As we define the classifier object dt, we use the entropy criterion (criterion='entropy') to describe sample inhomogeneity at each node. We also set the minimum leaf size min_samples_leaf to 3 and the maximum tree depth max_depth to 4 in order to avoid overfitting. We seed the random number generator for this algorithm with the random seed random_state=0.

# decision tree classifier
dt = DecisionTreeClassifier(criterion='entropy',
                            min_samples_leaf = 3,
                            max_depth = 4,
                            random_state=0)

Then we train he classifier with the fit method.

dt.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

The trained classifier is then used to generate prediction on the testing data.

# classification on the testing data set
y_pred = dt.predict(X_test)

The confusion matrix and the classification report are generated.

print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 14  2]
 [ 0  1 15]]

print(classification_report(y_test, y_pred,
                            target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.93      0.88      0.90        16
   virginica       0.88      0.94      0.91        16

    accuracy                           0.94        50
   macro avg       0.94      0.94      0.94        50
weighted avg       0.94      0.94      0.94        50

The resulting decision tree can be visualized by the plot_tree function available in the sklearn.tree library. The trained classifier is passed on as the required input parameter, along with the feature names and class names.

# plotting the tree
plt.figure(figsize=[7.5,7.5])
plot_tree(dt, feature_names=feature_names, class_names=target_names)
plt.show()

Regression

Book reference: Chapter 8.3, page 241

Least square linear regression is implemented with a LinearRegression object available in the sklearn.linear_model library. In this example, we model the petal width from the Iris data as the dependent variable, and the three other features as the regressors.

from sklearn.linear_model import LinearRegression

# Target is petal width
y = iris.data[:,3]
# All the other variables are input features
X = iris.data[:,:3]

Now we fit a regression model with the fit method. The resulting predictor object is referred as reg.

# linear regression learner
reg = LinearRegression().fit(X,y)

The information regarding the regression model can be examined with various methods and attributes, such as the R-square (with the .score method)

reg.score(X,y)

0.9378502736046809

as well as the regression coefficients (with the coef_ attribute) and the intercept (with the .intercept_ attribute).

reg.coef_

array([-0.20726607,  0.22282854,  0.52408311])

reg.intercept_

-0.24030738911225824

Finally, we are plotting the sepal length against the petal width, with the predicted regression line overlaid on observed data.

# Observed vs predicted plot
y_pred = reg.predict(X)
plt.plot(X[:,0], y, 'b.', label='observed')
plt.plot([X[:,0].min(), X[:,0].max()], [y_pred.min(), y_pred.max()],
         'r-', label='predicted')
plt.xlabel('Sepal length')
plt.ylabel('Petal width')
plt.legend()
plt.show()

Download Code

Predictors

Nearest Neighbor Classifiers

Book reference: Chapter 9.1, page 275

In this example, we will construct a nearest neighbor classifier for the Iris data. First, we load the data set as before, and split it into the training (N=100) and testing (N=50) data sets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

# Loading data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=50,
                                                    random_state=2020)

Next, we define a nearest neighbor classification object Then we train he classifier with the fit method. available under the sklearn.neighbors library. As we define the classifier object kNN, we use the number of neighbors k=5. We use uniform weighting for the parameter weights.

kNN = KNeighborsClassifier(5, weights='uniform')

Then we train he classifier with the fit method.

kNN.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

The trained classifier is then used to generate prediction on the testing data.

y_pred = kNN.predict(X_test)

The confusion matrix and the classification report are generated.

print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 14  2]
 [ 0  2 14]]

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.88      0.88      0.88        16
   virginica       0.88      0.88      0.88        16

    accuracy                           0.92        50
   macro avg       0.92      0.92      0.92        50
weighted avg       0.92      0.92      0.92        50

Neural Networks

Book reference: Chapter 9.2, page 282

We use a multilayer perceptron (MLP) to classify species in the Iris data set. We define an MLP classifier object MLPClassifier available under the sklearn.neural_network library. As we define the classifier object mlp, we use the stochastic gradient descent solver (solver=sgd). We use 2 hidden layers 4 and 2 neurons, as defined by the parameter hidden_layer_sizes=(4, 2).

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(solver='sgd',
                    hidden_layer_sizes=(4, 2), random_state=2020)

Then we train he network with the fit method.

mlp.fit(X_train,y_train)

/home/satoru/.local/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:571: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(4, 2), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=2020, shuffle=True, solver='sgd',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

The trained network is then used to generate prediction on the testing data.

y_pred = mlp.predict(X_test)

The confusion matrix and the classification report are generated.

print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 14  2]
 [ 0  2 14]]

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.88      0.88      0.88        16
   virginica       0.88      0.88      0.88        16

    accuracy                           0.92        50
   macro avg       0.92      0.92      0.92        50
weighted avg       0.92      0.92      0.92        50

Support Vector Machines

Book reference: Chapter 9.4, page 297

Support vector machine (SVM) classifier and regression models are available under the sklearn.svm library as SVC and SVR, respectively. For the SVM classifier, we define a classifier object svc with the linear kernel (kernel='linear') and a somewhat soft margin (C=1.0).

Book reference: Chapter 9.4, page 297

from sklearn.svm import SVC, SVR

svc = SVC(kernel='linear', C=0.1)

Then we train the classifier on the training data from the Iris data set.

svc.fit(X_train,y_train)

SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

And we use the trained model for prediction.

y_pred = svc.predict(X_test)   # predicted class

The confusion matrix and the classification report are generated.

print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 15  1]
 [ 0  3 13]]

print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.83      0.94      0.88        16
   virginica       0.93      0.81      0.87        16

    accuracy                           0.92        50
   macro avg       0.92      0.92      0.92        50
weighted avg       0.92      0.92      0.92        50

For support vector regression, we try to model the petal width with all the other features. First, we define a regression model object svr with the linear kernel and a soft margin (kernel='linear' and C=0.1, respectively).

svr = SVR(kernel='linear', C=0.1)

Then we train the regression model with the features and the target variable from the training data.

svr.fit(X_train[:,:3],X_train[:,3])

SVR(C=0.1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Then we calculate predicted values of the petal width based on the available features in the testing data.

y_pred = svr.predict(X_test[:,:3])

We assess the performance of the model by calculating 𝑅² statistic with the r2_score function in the sklearn.metrics library.

from sklearn.metrics import r2_score

print(r2_score(X_test[:,3], y_pred))

0.9427223321100942

We now visualize the observed and predicted target variables by a scatter plot of the sepal length against the petal width.

# plotting observed vs predicted (sepal length on x-axis)
plt.plot(X_test[:,0], X_test[:,3],'b.', label='observed')
plt.plot(X_test[:,0], y_pred, 'r.', label='predicted')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[3])
plt.legend()
plt.show()

Ensemble Methods

Book reference: Chapter 9.5, page 304

As an example of ensemble methods, we train a random forest classifier and use it to predict the Iris species. A random forest classifier is available as RandomForestClassifier in the sklearn.ensemble library. We define a random forest classifier object rf, with the following parameters:

Criterion: criterion = 'entropy'
Number of trees: n_estimators = 100
Minimum leaf size: min_samples_leaf = 3
Maximum tree depth: max_depth = 4

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(criterion='entropy',
                            n_estimators = 100,
                            min_samples_leaf = 3,
                            max_depth = 4,
                            random_state=2020)

Then the model rf is trained with the training data.

rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=4, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=2020,
                       verbose=0, warm_start=False)

Predictions are made on the testing data.

y_pred = rf.predict(X_test)

The confusion matrix and the classification report are generated.

print(confusion_matrix(y_test,y_pred))

[[18  0  0]
 [ 0 14  2]
 [ 0  3 13]]

print(classification_report(y_test, y_pred,
                            target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        18
  versicolor       0.82      0.88      0.85        16
   virginica       0.87      0.81      0.84        16

    accuracy                           0.90        50
   macro avg       0.90      0.90      0.90        50
weighted avg       0.90      0.90      0.90        50