Histograms are generated by the hist
function in the matplotlib
library. Here, we use the Iris data set, available as part of the Scikit-learn (sklearn
) library, a library of machine learning functions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
# loading the iris data set
iris = datasets.load_iris()
X = pd.DataFrame(iris.data) # features data frame
X.columns = ['sepal length', 'sepal width', 'petal length', 'petal width']
y = pd.DataFrame(iris.target) # target data frame
y.columns = ['species']
target_names = iris.target_names
The data set has been read into a data frame object X
, a class of object available in the pandas
library. In addition, the target describing different species is stored in another data frame y
. The feature names have been assigned to this data frame as column names. We generate a histogram of the petal length with the following code.
# histogram of the petal length
plt.hist(X['petal length'])
plt.show()
Note that you need to run plt.show()
to display the figure on your screen. You may notice that the histogram includes only a small number of bars, despite the large amount of data! You can change the number of bins, or the number of bars on a histogram by the second parameter.
# histogram with 20 bins
plt.hist(X['petal length'], 20)
plt.show()
Boxplots can be generated by the boxplot
method associated with data frame objects in pandas
. You can specify a particular feature with the columns
parameter. Here is a boxplot of the the petal length from the Iris data set.
# box plot of the petal length
X.boxplot(column='petal length')
plt.show()
Or, we can generate boxplots of the petal length and width.
# box plot of the petal length & width
X.boxplot(column=['petal length', 'petal width'])
plt.show()
Or all the features in the data set.
# box plot of all the features
X.boxplot()
plt.show()
We can generate boxplots with notches by specifying the option notch=True
.
# notched box boxplots
X.boxplot(notch=True)
plt.show()
Finally, the describe
method, associated with a data frame objects, produces various descriptive statistics (mean, median, quartiles, etc.) for each column.
# describing various statsitics
print(X.describe())
Scatter plots can be generated by the plot.scatter
method associated with data frame objects in pandas
. You can specify the columns represented in the x- and y-axes with parameters x
and y
, respectively. As an example, we plot the petal width against the petal length.
# plotting petal width vs length (as a method)
X.plot.scatter(x='petal width', y='petal length')
plt.show()
Alternatively, we can produce a scatter plot using the scatter
function as part of the matplotlib.pyplot
library. A scatter plot of the petal width v.s. length is produced with the plot
function.
# plotting petal width vs length (as a function)
plt.scatter(X['petal width'], X['petal length'])
plt.show()
All features can be plotted against each other with the plotting.scatter_matrix
method associated with a data frame object. Here, all features in the Iris data are plotted in a scatter plot matrix.
# scatter plot matrix
pd.plotting.scatter_matrix(X)
plt.show()
Notice that all data points are plotted in the same color. However, you may want to plot data points corresponding to different species in different colors. To do so, we provide the information on species contained in the data frame column y['species']
to the parameter c
in the plotting.scatter_matrix
method.
# scatter plot matrix with different colors for species
pd.plotting.scatter_matrix(X, c=y['species'])
plt.show()
In Scikit-learn, a Python library for machine learning tools, many algorithms are implemented as objects, rather than functions. In a nutshell, an object is a combination of data, a collection of functions associated with the data (referred as methods), as well as the properties of the object (referred as attributes). To demonstrate this idea, we apply a z-score transformation, or normalize, the Iris data in preparation for a principal component analysis. In particular, we create a normalization object available as a StandardScaler
object under the sklearn.preprocessing
library.
from sklearn.preprocessing import StandardScaler
# defining an normalization object
normData = StandardScaler()
Next, we fit the feature data X
to this newly defined normalization transformation object normData
, with the fit
method.
# fitting the data to the normalization object
normData.fit(X)
Now the normalization object is ready, meaning that means and standard deviations have been calculated for each feature in the provided data set. Now we apply the actual transformation by the transform
method to transform the data set X
. The resulting normalized features are stored in X_norm
.
# applying the normalization transformation to the data
X_norm = normData.transform(X)
To perform these processes, you can also use the fit_transform
method, which is a combination of the fit
and transform
methods performed at once.
X_norm = StandardScaler().fit_transform(X)
Now we apply PCA to the normalized data X_norm
. This is done by creating a PCA
transformation object, available as part of the sklearn.decomposition
library. Then performing the fit_transform
method to calculate principal components.
from sklearn.decomposition import PCA
# applying PCA
pca = PCA() # creating a PCA transformation ojbect
X_pc = pca.fit_transform(X_norm) # fit the data, and get PCs
This produces an array X_pc
with 150 rows and 4 columns (corresponding to 4 PCs).
X_pc.shape
The attribute explained_variance_ratio_
stores the amount of variability in the data explained by each PC. The first PC explains 73% of variability, whereas the second PC explains 23% of variability, and so on.
# proportion of the variance explained
print(pca.explained_variance_ratio_)
We plot the first 2 PCs, representing the four original features in a 2D space. The first PC (or the first column of X_pc
, X_pc[:,0]
) is plotted on the x-axis and the second PC (or the second column of X_pc
, X_pc[:,1]
) is plotted on the y-axis.
# plotting the first 2 principal compnonents
plt.scatter(X_pc[:,0], X_pc[:,1], c=y['species'])
plt.show()
Multi-dimensional scaling is implemented as an MDS
transformation object available in the sklearn.manifold
library. Here, we define a transformation object mds
, then use the fit_transfrom
method to obtain the MDS-transformed coordinates as X_mds
.
from sklearn.manifold import MDS
# applying MDS
mds = MDS()
X_mds = mds.fit_transform(X_norm)
By default, the MDS transformation maps a higher-dimension space to a 2D space. We plot the transformation results in 2D.
# plotting the MDS-transformed coordinates
plt.scatter(X_mds[:,0], X_mds[:,1], c=y['species'])
plt.show()
A parallel coordinates plot can be generated by the plotting.parallel_coordinates
function available in the Pandas library. This function assumes that features and labels are stored in a same data frame. Thus, first we combine the feature data frame X
and the target label data frame y
with the concat
function in Pandas. The resulting combined data frame Xy
is then used in the parallel coordinates plot.
Xy = pd.concat([X,y], axis=1)
pd.plotting.parallel_coordinates(Xy,'species')
plt.show()
The Pearson, Spearman, and Kendall correlations can be calculated by the pearsonr
, spearmanr
, and kendalltau
function available in the SciPy (scientific Python) library, respectively. These functions return the correlation value and the associated p-value (of testing whether the correlation is zero).
import scipy as sp
r_pearson, p_pearson = sp.stats.pearsonr(X['sepal length'], X['sepal width'])
r_spearman, p_spearman = sp.stats.spearmanr(X['sepal length'], X['sepal width'])
r_kendall, p_kendall = sp.stats.kendalltau(X['sepal length'], X['sepal width'])
The resulting correlation values are:
r_pearson
r_spearman
r_kendall