# R Code Snippets

## Data Understanding in R

### Histograms

Book reference: Chapter 4.3.1, page 40

Histograms are generated by the function hist. The simplest way to create a histogram is to just use the corresponding attribute as an argument of the function hist, and R will automatically determine the number of bins for the histogram based on Sturge’s rule. In order to generate the histogram for the petal length of the Iris data set, the following command is sufficient:

hist(iris$Petal.Length) The partition into bins can also be specified directly. One of the parameters of hist is breaks. If the bins should cover the intervals $$[a_{0},a_{1}),[a_{1},a_{2}),...,[a_{k−1},a_{k}]$$ , then one can simply create a vector in R containing the values $$a_{i}$$ and assign it to breaks. Note that $$a_{0}$$ and $$a_{k}$$ should be the minimum and maximum values of the corresponding attribute. If we want the boundaries for the bins at 1.0, 3.0, 4.5, 4.0, 6.1, then we would use hist(iris$Petal.Length,breaks=c(1.0,3.0,4.5,4.0,6.9))

to generate the histogram. Note that in the case of bins with different length, the heights of the boxes in the histogram do not show the relative frequencies. The areas of the boxes are chosen in such a way that they are proportional to the relative frequencies.

### Boxplots

Book reference: Chapter 4.3.1, page 43

A boxplot for a single attribute is generated by

boxplot(iris$Petal.Length) yielding the boxplot for the petal length of the Iris data set. Instead of a single attribute, we can hand over more than one attribute boxplot(iris$Petal.Length,iris$Petal.Width) to show the boxplots in the same plot. We can even use the whole data set as an argument to see the boxplots of all attributes in one plot: boxplot(iris) In this case, categorical attributes will be turned into numerical attributes by coding the values of the categorical attribute as 1, 2, . . . , so that these boxplots are also shown but do not really make sense. In order to include the notches in the boxplots, we need to set the parameter notch to true: boxplot(iris,notch=TRUE) If one is interested in the precise values of the boxplot like the median, etc., one can use the print-command: print(boxplot(iris$Sepal.Width))
$stats [,1] [1,] 2.2 [2,] 2.8 [3,] 3.0 [4,] 3.3 [5,] 4.0$n
[1] 150

$conf [,1] [1,] 2.935497 [2,] 3.064503$out
[1] 4.4 4.1 4.2 2.0

$group [1] 1 1 1 1$names
[1] "1"

The first five values are the minimum, the first quartile, the median, the third quartile, and the maximum value of the attribute, respectively. $$n$$ is the number of data. Then come the boundaries for the confidence interval for the notch, followed by the list of outliers. The last values group and names only make sense when more than one boxplot is included in the same plot. Then group is needed to identify to which attribute the outliers in the list of outliers belong. names just lists the names of the attributes.

### Scatter Plots

Book reference: Chapter 4.3.1, page 45

A scatter plot of the petal width against petal length of the Iris data is obtained by

plot(iris$Petal.Width,iris$Petal.Length)

All scatter plots of each attribute against each other in one diagram are created with

plot(iris)

If symbols representing the values for some categorical attribute should be included in a scatter plot, this can be achieved by

plot(iris$Petal.Width,iris$Petal.Length,pch=as.numeric(iris$Species)) where in this example the three types of Iris are plotted with different symbols. If there are some interesting or suspicious points in a scatter plot and one wants to find out which data records these are, one can do this by  plot(iris$Petal.Width,iris$Petal.Length) identify(iris$Petal.Width,iris$Petal.Length) and then clicking on the points. The index of the corresponding records will be added to the scatter plot. To finish selecting points, press the ESCAPE-key. Jitter can be added to a scatter plot in the following way: plot(jitter(iris$Petal.Width),jitter(iris$Petal.Length)) Intensity plots and density plots with hexagonal binning, as they are shown Fig. 4.9, can be generated by plot(iris$Petal.Width,iris$Petal.Length, col=rgb(0,0,0,50,maxColorValue=255),pch=16) and library(hexbin) bin<-hexbin(iris$Petal.Width,iris$Petal.Length,xbins=50) plot(bin) respectively, where the library hexbin does not come along with the standard version of R and needs to be installed as described in the appendix on R. Note that such plots are not very useful for such a small data sets like the Iris data set. For three-dimensional scatter plots, the library scatterplots3d is needed and has to be installed first: library(scatterplot3d) scatterplot3d(iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Length)

### Principal Component Analysis

Book reference: Chapter 4.3.2.1, page 48

PCA can be carried out with R in the following way:

species <- which(colnames(iris)=="Species")
iris.pca <- prcomp(iris[,-species],center=T,scale=T)
print(iris.pca)
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971
summary(iris.pca)
Importance of components:
PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
plot(predict(iris.pca))

For the Iris data set, it is necessary to exclude the categorical attribute Species from PCA. This is achieved by the first line of the code and calling prcomp with iris[,-species] instead of iris.

The parameter settings center=T, scale=T, where T is just a short form of TRUE, mean that z-score standardization is carried out for each attribute before applying PCA.

The function predict can be applied in the above-described way to obtain the transformed data from which the PCA was computed. If the computed PCA transformation should be applied to another data set x, this can be achieved by

predict(iris.pca,newdata=x)

where x must have the same number of columns as the data set from which the PCA has been computed. In this case, x must have four columns which must be numerical. predict will compute the full transformation, so that the above command will also yield transformed data with four columns.

### Multidimensional Scaling

Book reference: Chapter 4.3.2.3, page 53

MDS requires the library MASS which is not included in the standard version of R and needs installing. First, a distance matrix is needed for MDS. Identical objects leading to zero distances are not admitted. Therefore, if there are identical objects in a data set, all copies of the same object except one must be removed. In the Iris data set, there is only one pair of identical objects, so that one of them needs to be removed. The Species is not a numerical attribute and will be ignored for the distance.

library(MASS)
x <- iris[-102,]
species <- which(colnames(x)=="Species")
x.dist <- dist(x[,-species])
x.sammon <- sammon(x.dist,k=2)
plot(x.sammon$points)  Initial stress : 0.00678 stress after 10 iters: 0.00404, magic = 0.500 stress after 12 iters: 0.00402 k = 2 means that MDS should reduce the original data set to two dimensions. Note that in the above example code no normalization or z-score standardization is carried out. ### Parallel Coordinates, Radar, and Star Plots Book reference: Chapter 4.3.2.6-7, page 59 Parallel coordinates need the library MASS. All attributes must be numerical. If the attribute Species should be included in the parallel coordinates, one can achieve this in the following way: library(MASS) x <- iris x$Species <- as.numeric(iris$Species) parcoord(x) Star and radar plots are obtained by the following two commands: stars(iris) stars(iris,locations=c(0,0)) ### Correlation Coefficients Book reference: Chapter 4.4, page 62 Pearson’s, Spearman’s, and Kendall’s correlation coefficients are obtained by the following three commands: cor(iris$Sepal.Length,iris$Sepal.Width) cor.test(iris$Sepal.Length,iris$Sepal.Width,method="spearman") cor.test(iris$Sepal.Length,iris$Sepal.Width,method="kendall") ### Grubbs Test for Outlier Detection Book reference: Chapter 4.5, page 65 Grubbs test for outlier detection needs the installation of the library outliers: library(outliers) grubbs.test(iris$Petal.Width)
	Grubbs test for one outlier

data:  iris$Petal.Width G = 1.70638, U = 0.98033, p-value = 1 alternative hypothesis: highest value 2.5 is an outlier ## Errors and Validation in R Book reference: Chapter 5.5.1, page 112 ### Training and Testing Data In order to apply the idea of using separate parts of a data set for training and testing, one needs to select random subsets of the data set. As a very simple example, we consider the Iris data set that we want to split into training and test sets. The size of the training set should contain 2/3 of the original data, and the test set 1/3. It would not be a good idea to take the first 100 records in the Iris data set for training purposes and the remaining 50 as a test set, since the records in the Iris data set are ordered with respect to the species.With such a split, all examples of Iris setosa and Iris versicolor would end up in the training set, but none of Iris versicolor, which would form the test set. Therefore, we need random sample from the Iris data set. If the records in the Iris data set were not systematically orderer, but in a random order, we could just take the first 100 records for training purposes and the remaining 50 as a test set. Sampling and orderings in R provide a simple way to shuffle a data set, i.e., to generate a random order of the records. First, we need to know the number n of records in our data set. Then we generate a permutation of the numbers 1, . . . , n by sampling from the vector containing the numbers 1, . . . , n, generated by the R-command c(1:n). We sample n numbers without replacement from this vector: n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)

Then we define this permutation as an ordering in which the records of our data set should be ordered and store the shuffled data set in the object iris.shuffled:

ord <- order(permut)
iris.shuffled <- iris[ord,]

Now define how large the fraction for the training set should be—here 2/3—and take the first two thirds of the data set as a training set and the last third as a test set:

prop.train <- 2/3  # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]

The R-command sample can also be used to generate bootstrap samples by setting the parameter replace to TRUE instead of F (FALSE).

## Data Preparation in R

### Missing Values

Book reference: Chapter 6.2.2, page 137

The logical constant NA (not available) is used to represent missing values in R. There are various methods in R that can handle missing values directly.

As a very simple example, we consider the mean value.We create a data set with one attribute with four missing values and try to compute the mean:

x <- c(3,2,NA,4,NA,1,NA,NA,5)
mean(x)
<NA>

The mean value is in this case also a missing value, since R has no information about the missing values and how to handle them. But if we explicitly say that missing values should simply be ignored for the computation of the mean value (na.rm=T), then R returns the mean value of all nonmissing values:

mean(x,na.rm=T)
3

Note that this computation of the mean value implicitly assumes that the values are missing completely at random (MCAR).

### Normalization and Scaling

Book reference: Chapter 6.6.3, page 153

Normalization and standardization of numeric attributes can be achieved in the following way. The function is.factor returns true if the corresponding attribute is categorical (or ordinal), so that we can ensure with this function that normalization is only applied to all numerical attributes, but not to the categorical ones. With the following R-code, z-score standardization is applied to all numerical attributes:

iris.norm <- iris

# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}

Other normalization and standardization techniques can carried out in a similar manner. Of course, instead of the functions mean (for the mean value) and sd (for the standard deviation), other functions like min (for the minimum), max (for the maximum), median (for the median), or IQR (for the interquartile range) have to be used.

## Finding Patterns in R

### Hierarchical Clustering

Book reference: Chapter 7.1, page 159

As an example, we apply hierarchical clustering to the Iris data set, ignoring the categorical attribute Species. We use the normalized Iris data set iris.norm that is constructed in Sect. 6.6.2.3. We can apply hierarchical clustering after removing the categorical attribute and can plot the dendrogram afterwards:

#
# NORMALIZATION
#

# using the iris data as an example
iris.norm <- iris

# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
iris.num <- iris.norm[1:4]
iris.cl <- hclust(dist(iris.num), method="ward.D")
plot(iris.cl)

Here, the Ward method for the cluster distance aggregation function as described in Table 7.3 was chosen. For the other cluster distance aggregation functions in the table, one simply has to replace ward.D by single (for single linkage), by complete (for complete linkage), by average (for average linkage), or by centroid.

For heatmaps, the library gplots is required that needs installing first:

library(gplots)
rowv <- as.dendrogram(hclust(dist(iris.num), method="ward.D"))
colv <- as.dendrogram(hclust(dist(t(iris.num)), method="ward.D"))
heatmap.2(as.matrix(iris.num), Rowv=rowv,Colv=colv, trace="none")

### Prototype-Based Clustering

Book reference: Chapter 7.3, page 173

The R-function kmeans carries out k-means clustering.

iris.km <- kmeans(iris.num,centers=3)

The desired numbers of clusters is specified by the parameter centers. The location of the cluster centers and the assignment of the data to the clusters is obtained by the print function:

print(iris.km)
K-means clustering with 3 clusters of sizes 50, 53, 47

Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1  -1.01119138  0.85041372   -1.3006301  -1.2507035
2  -0.05005221 -0.88042696    0.3465767   0.2805873
3   1.13217737  0.08812645    0.9928284   1.0141287

Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2
[75] 2 3 3 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
[112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
[149] 3 2

Within cluster sum of squares by cluster:
[1] 47.35062 44.08754 47.45019
(between_SS / total_SS =  76.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

For fuzzy c-means clustering, the library cluster is required. The clustering is carried out by the method fanny similar to kmeans:

library(cluster)
iris.fcm <- fanny(iris.num,3)
iris.fcm

The last line provides the necessary information on the clustering results, especially the membership degrees to the clusters.

Gaussian mixture decomposition automatically determining the number of clusters requires the library mclust to be installed first:

library(mclust)
iris.em <- mclustBIC(iris.num[,1:4])
iris.mod <- mclustModel(iris.num[,1:4],iris.em)
summary(iris.mod)

The last line lists the assignment of the data to the clusters.

Density-based clustering with DBSCAN is implemented in the library fpc which needs installation first:

library(fpc)
iris.dbscan <- dbscan(iris.num[,1:4],1.0,showplot=T)
iris.dbscan$cluster The last line will print out the assignment of the data to the clusters. Singletons or outliers are marked by the number zero. The second argument in dbscan (in the above example 1:0) is the parameter " for DBSCAN. showplot=T will generate a plot of the clustering result projected to the first two dimensions of the data set. ### Self Organizing Maps Book reference: Chapter 7.5, page 187 The library som provides methods for self organizing maps. The library som needs to be installed: library(som) iris.som <- som(iris.num,xdim=5,ydim=5) plot(iris.som) xdim and ydim define the number of nodes in the mesh in x- and y-directions, respectively. plot will show, for each node in the mesh, a representation of the values in the form of parallel coordinates. ### Association Rules Book reference: Chapter 7.9, page 189 For association rule mining, the library arules is required in which the function apriori is defined. This library does not come along with R directly and needs to be installed first. Here we use an artificial data set basket that we enter manually. The data set is a list of vectors where each vector contains the items that were bought: library(arules) baskets <- list(c("a","b","c"), c("a","d","e"), c("b","c","d"), c("a","b","c","d"), c("b","c"), c("a","b","d"), c("d","e"), c("a","b","c","d"), c("c","d","e"), c("a","b","c")) rules <- apriori(baskets,parameter = list(supp=0.1,conf=0.8,target="rules")) inspect(rules) Loading required package: Matrix Attaching package: ‘arules’ The following objects are masked from ‘package:base’: abbreviate, write Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen 0.8 0.1 1 none FALSE TRUE 5 0.1 1 maxlen target ext 10 rules TRUE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 1 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[5 item(s), 10 transaction(s)] done [0.00s]. sorting and recoding items ... [5 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [9 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. lhs rhs support confidence coverage lift count [1] {e} => {d} 0.3 1.0000000 0.3 1.428571 3 [2] {a} => {b} 0.5 0.8333333 0.6 1.190476 5 [3] {b} => {c} 0.6 0.8571429 0.7 1.224490 6 [4] {c} => {b} 0.6 0.8571429 0.7 1.224490 6 [5] {a,e} => {d} 0.1 1.0000000 0.1 1.428571 1 [6] {c,e} => {d} 0.1 1.0000000 0.1 1.428571 1 [7] {a,b} => {c} 0.4 0.8000000 0.5 1.142857 4 [8] {a,c} => {b} 0.4 1.0000000 0.4 1.428571 4 [9] {a,c,d} => {b} 0.2 1.0000000 0.2 1.428571 2  The last command lists the rules with their support, confidence, and lift. ## Finding Explanations ### Decision tree Book reference: Chapter 8.1, page 220 R also allows for much finer control of the decision tree construction. The script below demonstrates how to create a simple tree for the Iris data set using a training set of 100 records. Then the tree is displayed, and a confusion matrix for the test set—the remaining 50 records of the Iris data set—is printed. The libraries rpart, which comes along with the standard installation of R, and rattle, that needs to be installed, are required: library(rpart) iris.train <- c(sample(1:150,50)) iris.dtree <- rpart(Species~.,data=iris,subset=iris.train) library(rattle) drawTreeNodes(iris.dtree) table(predict(iris.dtree,iris[-iris.train,],type="class"), iris[-iris.train,"Species"]) Loading required package: tibble Loading required package: bitops Rattle: A free graphical interface for data science with R. Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd. Type 'rattle()' to shake, rattle, and roll your data.  setosa versicolor virginica setosa 35 0 0 versicolor 0 27 1 virginica 0 6 31 In addition to many options related to tree construction, R also offers many ways to beautify the graphical representation. We refer to R manuals for more details. ### Naive Bayes Classifiers Book reference: Chapter 8.2, page 230 Naive Bayes classifiers use normal distributions by default for numerical attributes. The package e1071 must be installed first: library(e1071) iris.train <- c(sample(1:150,75)) iris.nbayes <- naiveBayes(Species~.,data=iris,subset=iris.train) table(predict(iris.nbayes,iris[-iris.train,],type="class"), iris[-iris.train,"Species"])  setosa versicolor virginica setosa 28 0 0 versicolor 0 22 1 virginica 0 2 22 As in the example of the decision tree, the Iris data set is split into a training and a test data set, and the confusion matrix is printed. The parameters for the normal distributions of the classes can be obtained in the following way: print(iris.nbayes) Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.default(x = X, y = Y, laplace = laplace) A-priori probabilities: Y setosa versicolor virginica 0.2933333 0.3466667 0.3600000 Conditional probabilities: Sepal.Length Y [,1] [,2] setosa 4.972727 0.3781935 versicolor 6.034615 0.5059188 virginica 6.555556 0.5401804 Sepal.Width Y [,1] [,2] setosa 3.422727 0.4264268 versicolor 2.807692 0.2682135 virginica 2.937037 0.3236191 Petal.Length Y [,1] [,2] setosa 1.413636 0.1457181 versicolor 4.315385 0.4333057 virginica 5.514815 0.4888792 Petal.Width Y [,1] [,2] setosa 0.2409091 0.09081164 versicolor 1.3307692 0.18279875 virginica 2.0185185 0.30386869  If Laplace correction should be applied for categorical attribute, this can be achieved by setting the parameter laplace to the desired value when calling the function naiveBayes. ### Regression Book reference: Chapter 8.3, page 241 Least squares linear regression is implemented by the function lm (linear model). As an example, we construct a linear regression function to predict the petal width of the Iris data set based on the other numerical attributes: iris.lm <- lm(iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length) summary(iris.lm) Call: lm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length) Residuals: Min 1Q Median 3Q Max -0.60959 -0.10134 -0.01089 0.09825 0.60685 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.24031 0.17837 -1.347 0.18 iris$Sepal.Length -0.20727    0.04751  -4.363 2.41e-05 ***
iris$Sepal.Width 0.22283 0.04894 4.553 1.10e-05 *** iris$Petal.Length  0.52408    0.02449  21.399  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared:  0.9379,	Adjusted R-squared:  0.9366
F-statistic: 734.4 on 3 and 146 DF,  p-value: < 2.2e-16

The summary provides the necessary information about the regression result, including the coefficient of the regression function.

If we want to use a polynomial as the regression function, we need to protect the evaluation of the corresponding power by the function I inhibiting interpretation. As an example, we compute a regression function to predict the petal width based on a quadratic function in the petal length:

iris.lm <- lm(iris$Petal.Width ~ iris$Petal.Length +
I(iris$Petal.Length^2)) summary(iris.lm) Call: lm(formula = iris$Petal.Width ~ iris$Petal.Length + I(iris$Petal.Length^2))

Residuals:
Min       1Q   Median       3Q      Max
-0.56213 -0.12392 -0.01555  0.13547  0.64105

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)            -0.386781   0.079883  -4.842 3.22e-06 ***
iris$Petal.Length 0.433833 0.053652 8.086 2.10e-13 *** I(iris$Petal.Length^2) -0.002569   0.007501  -0.342    0.732
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2071 on 147 degrees of freedom
Multiple R-squared:  0.9272,	Adjusted R-squared:  0.9262
F-statistic: 935.7 on 2 and 147 DF,  p-value: < 2.2e-16

Robust regression requires the library MASS, which needs installation. Otherwise it is handled in the same way as least squares regression, using the function rlm instead of lm:

library(MASS)
iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length)
summary(iris.rlm)
Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length)
Residuals:
Min        1Q    Median        3Q       Max
-0.606532 -0.099003 -0.009158  0.101636  0.609940

Coefficients:
Value   Std. Error t value
(Intercept)       -0.2375  0.1771    -1.3412
iris$Sepal.Length -0.2062 0.0472 -4.3713 iris$Sepal.Width   0.2201  0.0486     4.5297
iris$Petal.Length 0.5231 0.0243 21.5126 Residual standard error: 0.1492 on 146 degrees of freedom The default method is based on Huber’s error function. If Tukey’s biweight should be used, the parameter method should be changed in the following way: # ridge regression with Tukey's biweight iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length, method="MM") summary(iris.rlm) Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length, method = "MM") Residuals: Min 1Q Median 3Q Max -0.60608 -0.09975 -0.01030 0.10375 0.61963 Coefficients: Value Std. Error t value (Intercept) -0.1896 0.1718 -1.1036 iris$Sepal.Length -0.2152  0.0457    -4.7036
iris$Sepal.Width 0.2181 0.0471 4.6276 iris$Petal.Length  0.5252  0.0236    22.2689

Residual standard error: 0.1672 on 146 degrees of freedom

A plot of the computed weights can be obtained by the following command:

plot(iris.rlm$w) ## Predictors ### Nearest Neighbor Classifiers Book reference: Chapter 9.1, page 275 A nearest-neighbor classifier based on the Euclidean distance is implemented in the package class in R. To show how to use the nearest-neighbor classifier in R, we use splitting of the Iris data set into a training set iris.training and test set iris.test as it was demonstrated in Sect. 5.6.2. The function knn requires a training and a set with only numerical attributes and a vector containing the classifications for the training set. The parameter k determines how many nearest neighbors are considered for the classification decision. # generating indices for shuffling n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)

# shuffling the observations
ord <- order(permut)
iris.shuffled <- iris[ord,]

# splitting into training and testing data
prop.train <- 2/3  # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
library(class)
iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],iris.training[,5],k=3)
table(iris.knn,iris.test[,5])
iris.knn     setosa versicolor virginica
setosa         18          0         0
versicolor      0         18         0
virginica       0          1        13

The last line prints the confusion matrix.

### Neural Networks

Book reference: Chapter 9.2, page 282

For the example of multilayer perceptrons in R, we use the same training and test data as for the nearest-neighbor classifier above. The multilayer perceptron can only process numerical values. Therefore, we first have to transform the categorical attribute Species into a numerical attribute:

x <- iris.training
x$Species <- as.numeric(x$Species)

The multilayer perceptron is constructed and trained in the following way, where the library neuralnet needs to be installed first:

library(neuralnet)
iris.nn <- neuralnet(Species + Sepal.Length ~
Sepal.Width + Petal.Length + Petal.Width, x,
hidden=c(3))

The first argument of neuralnet defines that the attributes Species and sepal length correspond to the output neurons. The other three attributes correspond to the input neurons. x specifies the training data set. The parameter hidden defines how many hidden layers the multilayer perceptron should have and how many neurons in each hidden layer should be. In the above example, there is only one hidden layer with three neurons. When we replace c(3) by c(4,2), there would be two hidden layers, one with four and one with two neurons.

The training of the multilayer perceptron can take some time, especially for larger data sets.

When the training is finished, the multilayer perceptron can be visualized:

plot(iris.nn)

The visualization includes also dummy neurons as shown in Fig. 9.4.

The output of the multilayer perceptron for the test set can be calculated in the following way. Note that we first have to remove the output attributes from the test set:

y <- iris.test
y <- y[-5]
y <- y[-1]
y.out <- compute(iris.nn,y)

We can then compare the target outputs for the training set with the outputs from the multilayer perceptron. If we want to compute the squared errors for the second output neuron -— the sepal length —- we can do this in the following way:

y.sqerr <- (y[1] - y.out\$net.result[,2])^2

### Support Vector Machines

Book reference: Chapter 9.4, page 297

For support vector machine, we use the same training and test data as already for the nearest-neighbor classifier and for the neural networks. A support vector machine to predict the species in the Iris data set based on the other attributes can be constructed in the following way. The package e1071 is needed and should be installed first if it has not been installed before:

library(e1071)
iris.svm <- svm(Species ~ ., data = iris.training)
table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
             setosa versicolor virginica
setosa         18          0         0
versicolor      0         19         1
virginica       0          0        12

The last line prints the confusion matrix for the test data set.

The function svm works also for support vector regression. We could, for instance, use

iris.svm <- svm(Petal.Width ~ ., data = iris.training)
sqerr <- (predict(iris.svm,iris.test[-4])-iris.test[4])^2

to predict the numerical attribute petal width based on the other attributes and to compute the squared errors for the test set.

### Ensemble Methods

Book reference: Chapter 9.5, page 304

As an example for ensemble methods, we consider random forest with the training and test data of the Iris data set as before. The package randomForest needs to be installed first:

library(randomForest)
iris.rf <- randomForest(Species ~., iris.training)
table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
             setosa versicolor virginica
setosa         18          0         0
versicolor      0         19         1
virginica       0          0        12

In this way, a random forest is constructed to predict the species in the Iris data set based on the other attributes. The last line of the code prints the confusion matrix for the test data set.