Predictors


Nearest Neighbor Classifiers


A nearest-neighbor classifier based on the Euclidean distance is implemented in the package class in R. To show how to use the nearest-neighbor classifier in R, we use splitting of the Iris data set into a training set iris.training and test set iris.test as it was demonstrated in Sect. 5.6.2. The function knn requires a training and a set with only numerical attributes and a vector containing the classifications for the training set. The parameter k determines how many nearest neighbors are considered for the classification decision.

# generating indices for shuffling
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)

# shuffling the observations
ord <- order(permut)
iris.shuffled <- iris[ord,]

# splitting into training and testing data
prop.train <- 2/3  # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
library(class)
iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],iris.training[,5],k=3)
table(iris.knn,iris.test[,5])
            
iris.knn     setosa versicolor virginica
  setosa         18          0         0
  versicolor      0         18         0
  virginica       0          1        13

The last line prints the confusion matrix.

Neural Networks


For the example of multilayer perceptrons in R, we use the same training and test data as for the nearest-neighbor classifier above. The multilayer perceptron can only process numerical values. Therefore, we first have to transform the categorical attribute Species into a numerical attribute:

x <- iris.training
x$Species <- as.numeric(x$Species)

The multilayer perceptron is constructed and trained in the following way, where the library neuralnet needs to be installed first:

library(neuralnet)
iris.nn <- neuralnet(Species + Sepal.Length ~
                     Sepal.Width + Petal.Length + Petal.Width, x,
                     hidden=c(3))

The first argument of neuralnet defines that the attributes Species and sepal length correspond to the output neurons. The other three attributes correspond to the input neurons. x specifies the training data set. The parameter hidden defines how many hidden layers the multilayer perceptron should have and how many neurons in each hidden layer should be. In the above example, there is only one hidden layer with three neurons. When we replace c(3) by c(4,2), there would be two hidden layers, one with four and one with two neurons.

The training of the multilayer perceptron can take some time, especially for larger data sets.

When the training is finished, the multilayer perceptron can be visualized:

plot(iris.nn)

The visualization includes also dummy neurons as shown in Fig. 9.4.

The output of the multilayer perceptron for the test set can be calculated in the following way. Note that we first have to remove the output attributes from the test set:

y <- iris.test
y <- y[-5]
y <- y[-1]
y.out <- compute(iris.nn,y)

We can then compare the target outputs for the training set with the outputs from the multilayer perceptron. If we want to compute the squared errors for the second output neuron -— the sepal length —- we can do this in the following way:

y.sqerr <- (y[1] - y.out$net.result[,2])^2

Support Vector Machines


For support vector machine, we use the same training and test data as already for the nearest-neighbor classifier and for the neural networks. A support vector machine to predict the species in the Iris data set based on the other attributes can be constructed in the following way. The package e1071 is needed and should be installed first if it has not been installed before:

library(e1071)
iris.svm <- svm(Species ~ ., data = iris.training)
table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
            
             setosa versicolor virginica
  setosa         18          0         0
  versicolor      0         19         1
  virginica       0          0        12

The last line prints the confusion matrix for the test data set.

The function svm works also for support vector regression. We could, for instance, use

iris.svm <- svm(Petal.Width ~ ., data = iris.training)
sqerr <- (predict(iris.svm,iris.test[-4])-iris.test[4])^2

to predict the numerical attribute petal width based on the other attributes and to compute the squared errors for the test set.

Ensemble Methods


As an example for ensemble methods, we consider random forest with the training and test data of the Iris data set as before. The package randomForest needs to be installed first:

library(randomForest)
iris.rf <- randomForest(Species ~., iris.training)
table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
            
             setosa versicolor virginica
  setosa         18          0         0
  versicolor      0         19         1
  virginica       0          0        12

In this way, a random forest is constructed to predict the species in the Iris data set based on the other attributes. The last line of the code prints the confusion matrix for the test data set.