Data Preparation


Missing values


If we can assume that missing values occur missing completely at random (MCAR), then we can use a number of imputation strategies available in the sklearn library. To demonstrate, we create some toy data sets.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Creating toy data sets (numerical)
X_train = pd.DataFrame()
X_train['X'] = [3, 2, 1, 4, 5, np.nan, np.nan, 5, 2]
X_test = pd.DataFrame()
X_test['X'] = [3, np.nan, np.nan]

Here, we created data frames X_train and X_test, both containing a numerical data column X. There are some missing values in these data sets, defined by np.nan, a missing value available in the numpy library. If we examine these data frames, the missing values are indicated by NaN.

X_train
X
0 3.0
1 2.0
2 1.0
3 4.0
4 5.0
5 NaN
6 NaN
7 5.0
8 2.0
X_test
X
0 3.0
1 NaN
2 NaN

We also created data frames S_train and S_test with a string column, with some missing values indicated by np.nan.

# Creating toy data sets (categorical)
S_train = pd.DataFrame()
S_train['S'] = ['Hi', 'Med', 'Med', 'Hi', 'Low', 'Med', np.nan, 'Med', 'Hi']
S_test = pd.DataFrame()
S_test['S'] = [np.nan, np.nan, 'Low']
S_train
S
0 Hi
1 Med
2 Med
3 Hi
4 Low
5 Med
6 NaN
7 Med
8 Hi
S_test
S
0 NaN
1 NaN
2 Low

We impute missing values by creating a simple imputation object SimpleImputer available in the sklearn.impute library. We define a numerical imputation object imp_mean as

# Imputing numerical data with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

Here, the parameter missing_values defines which value is considered as missing. The parameter strategy defines the imputation method. In this example, we use 'mean', meaning that the mean of all non-missing values will be imputed. We use the fit_transform method and provide the X_train data to calculate the mean to be imputed, in addition to actually imputing missing values.

X_train_imp = imp_mean.fit_transform(X_train)
X_train_imp
array([[3.        ],
       [2.        ],
       [1.        ],
       [4.        ],
       [5.        ],
       [3.14285714],
       [3.14285714],
       [5.        ],
       [2.        ]])

As you can see, missing values are imputed by the mean. We can apply this imputation strategy (with the mean calculated on the training data) to the second data X_test by the transform method.

X_test_imp = imp_mean.transform(X_test)
X_test_imp
array([[3.        ],
       [3.14285714],
       [3.14285714]])

For categorical or string data, we can impute most frequent value by specifying the parameter strategy to 'most-frequent' in the SimpleImputer object. Here, we define the imputation object imp_mode imputing the most frequent category from the S_train data to both S_train and S_test data sets.

# Imputing categorical data with mode
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
S_train_imp = imp_mode.fit_transform(S_train)
S_test_imp = imp_mode.transform(S_test)
S_train_imp
array([['Hi'],
       ['Med'],
       ['Med'],
       ['Hi'],
       ['Low'],
       ['Med'],
       ['Med'],
       ['Med'],
       ['Hi']], dtype=object)
S_test_imp
array([['Med'],
       ['Med'],
       ['Low']], dtype=object)

Normalization and scaling


For this example, we shall use the Iris data set again. We load the Iris data and split to training and testing data, with the testing data comprising 1/3 of all observations.

from sklearn import datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# loading the Iris data
iris = datasets.load_iris()
X = iris.data  # array for the features
y = iris.target  # array for the target
feature_names = iris.feature_names   # feature names
target_names = iris.target_names   # target names

# spliting the data into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333,
                                                    random_state=2020)

We can normalize the data to Z-scores using the StandardScaler transformation object, as we have seen in a previous chapter. The transformation object is trained with the testing data X_train with the fit_transform method. Then the trained transformation is applied to the testing data with the transformation method.

# z-score normalization
normZ = StandardScaler()
X_train_Z = normZ.fit_transform(X_train)
X_test_Z = normZ.transform(X_test)

The resulting means and standard deviations for the normalized training data set are:

X_train_Z.mean(axis=0)
array([-2.17187379e-15,  1.35447209e-16,  1.24344979e-16,  2.19407825e-16])
X_train_Z.std(axis=0)
array([1., 1., 1., 1.])

Likewise, the means and standard deviations of the normalized testing data set are:

X_test_Z.mean(axis=0)
array([-0.17055031, -0.10396525, -0.10087131, -0.04575343])
X_test_Z.std(axis=0)
array([0.94189599, 0.8734677 , 1.00463184, 0.97844516])

To apply a min-max scaling, thus scaling all features in the [0, 1] interval, we can use the MinMaxScaler object available in the sklearn.preprocessing library.

# min-max normalization
normMinMax = MinMaxScaler()
X_train_MinMax = normMinMax.fit_transform(X_train)
X_test_MinMax = normMinMax.transform(X_test)

Let's examine the minimum and the maximum of the normalized training data.

X_train_MinMax.min(axis=0)
array([0., 0., 0., 0.])
X_train_MinMax.max(axis=0)
array([1., 1., 1., 1.])

Likewise, the minimum and the maximum of the normalized testing data. It should be noted that the minimum and the maximum used in the transformation were determined based on the training data. Thus there is no guarantee that the minimum and the maximum fall within the interval [0, 1].

X_test_MinMax.min(axis=0)
array([0.02777778, 0.08333333, 0.03389831, 0.04166667])
X_test_MinMax.max(axis=0)
array([0.94444444, 0.75      , 0.96610169, 1.        ])