# R Code Snippets

Download Code

## Data Understanding in R

### Histograms

Book reference: Chapter 4.3.1, page 40Histograms are generated by the function `hist`

. The simplest way to create a histogram is to just use the corresponding attribute as an argument of the function `hist`

, and R will automatically determine the number of bins for the histogram based on Sturge’s rule. In order to generate the histogram for the petal length of the Iris data set, the following command is sufficient:

`hist(iris$Petal.Length)`

The partition into bins can also be specified directly. One of the parameters of `hist`

is breaks. If the bins should cover the intervals \([a_{0},a_{1}),[a_{1},a_{2}),...,[a_{k−1},a_{k}]\) , then one can simply create a vector in R containing the values \(a_{i}\) and assign it to breaks. Note that \(a_{0}\) and \(a_{k}\) should be the minimum and maximum values of the corresponding attribute. If we want the boundaries for the bins at 1.0, 3.0, 4.5, 4.0, 6.1, then we would use

`hist(iris$Petal.Length,breaks=c(1.0,3.0,4.5,4.0,6.9))`

to generate the histogram. Note that in the case of bins with different length, the heights of the boxes in the histogram do not show the relative frequencies. The areas of the boxes are chosen in such a way that they are proportional to the relative frequencies.

### Boxplots

Book reference: Chapter 4.3.1, page 43A boxplot for a single attribute is generated by

`boxplot(iris$Petal.Length)`

yielding the boxplot for the petal length of the Iris data set. Instead of a single attribute, we can hand over more than one attribute

`boxplot(iris$Petal.Length,iris$Petal.Width)`

to show the boxplots in the same plot. We can even use the whole data set as an argument to see the boxplots of all attributes in one plot:

`boxplot(iris)`

In this case, categorical attributes will be turned into numerical attributes by coding the values of the categorical attribute as 1, 2, . . . , so that these boxplots are also shown but do not really make sense.

In order to include the notches in the boxplots, we need to set the parameter `notch`

to true:

`boxplot(iris,notch=TRUE)`

If one is interested in the precise values of the boxplot like the median, etc., one can use the print-command:

`print(boxplot(iris$Sepal.Width))`

$stats [,1] [1,] 2.2 [2,] 2.8 [3,] 3.0 [4,] 3.3 [5,] 4.0 $n [1] 150 $conf [,1] [1,] 2.935497 [2,] 3.064503 $out [1] 4.4 4.1 4.2 2.0 $group [1] 1 1 1 1 $names [1] "1"

The first five values are the minimum, the first quartile, the median, the third quartile, and the maximum value of the attribute, respectively. \(n\) is the number of data. Then come the boundaries for the confidence interval for the notch, followed by the list of outliers. The last values `group`

and `names`

only make sense when more than one boxplot is included in the same plot. Then `group`

is needed to identify to which attribute the outliers in the list of outliers belong. `names`

just lists the names of the attributes.

### Scatter Plots

Book reference: Chapter 4.3.1, page 45A scatter plot of the petal width against petal length of the Iris data is obtained by

`plot(iris$Petal.Width,iris$Petal.Length)`

All scatter plots of each attribute against each other in one diagram are created with

`plot(iris)`

If symbols representing the values for some categorical attribute should be included in a scatter plot, this can be achieved by

`plot(iris$Petal.Width,iris$Petal.Length,pch=as.numeric(iris$Species))`

where in this example the three types of Iris are plotted with different symbols.

If there are some interesting or suspicious points in a scatter plot and one wants to find out which data records these are, one can do this by

```
plot(iris$Petal.Width,iris$Petal.Length)
identify(iris$Petal.Width,iris$Petal.Length)
```

and then clicking on the points. The index of the corresponding records will be added to the scatter plot. To finish selecting points, press the ESCAPE-key.

Jitter can be added to a scatter plot in the following way:

`plot(jitter(iris$Petal.Width),jitter(iris$Petal.Length))`

Intensity plots and density plots with hexagonal binning, as they are shown Fig. 4.9, can be generated by

```
plot(iris$Petal.Width,iris$Petal.Length,
col=rgb(0,0,0,50,maxColorValue=255),pch=16)
```

and

```
library(hexbin)
bin<-hexbin(iris$Petal.Width,iris$Petal.Length,xbins=50)
plot(bin)
```

respectively, where the library `hexbin`

does not come along with the standard version of R and needs to be installed as described in the appendix on R. Note that such plots are not very useful for such a small data sets like the Iris data set.

For three-dimensional scatter plots, the library `scatterplots3d`

is needed and has to be installed first:

```
library(scatterplot3d)
scatterplot3d(iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Length)
```

### Principal Component Analysis

Book reference: Chapter 4.3.2.1, page 48PCA can be carried out with R in the following way:

```
species <- which(colnames(iris)=="Species")
iris.pca <- prcomp(iris[,-species],center=T,scale=T)
print(iris.pca)
```

Standard deviations (1, .., p=4): [1] 1.7083611 0.9560494 0.3830886 0.1439265 Rotation (n x k) = (4 x 4): PC1 PC2 PC3 PC4 Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096 Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492 Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971

`summary(iris.pca)`

Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.7084 0.9560 0.38309 0.14393 Proportion of Variance 0.7296 0.2285 0.03669 0.00518 Cumulative Proportion 0.7296 0.9581 0.99482 1.00000

`plot(predict(iris.pca))`

For the Iris data set, it is necessary to exclude the categorical attribute `Species`

from PCA. This is achieved by the first line of the code and calling `prcomp`

with `iris[,-species]`

instead of `iris`

.

The parameter settings `center=T`

, `scale=T`

, where `T`

is just a short form of `TRUE`

, mean that z-score standardization is carried out for each attribute before applying PCA.

The function `predict`

can be applied in the above-described way to obtain the transformed data from which the PCA was computed. If the computed PCA transformation should be applied to another data set `x`

, this can be achieved by

`predict(iris.pca,newdata=x)`

where `x`

must have the same number of columns as the data set from which the PCA has been computed. In this case, `x`

must have four columns which must be numerical. `predict`

will compute the full transformation, so that the above command will also yield transformed data with four columns.

### Multidimensional Scaling

Book reference: Chapter 4.3.2.3, page 53MDS requires the library `MASS`

which is not included in the standard version of R and needs installing. First, a distance matrix is needed for MDS. Identical objects leading to zero distances are not admitted. Therefore, if there are identical objects in a data set, all copies of the same object except one must be removed. In the Iris data set, there is only one pair of identical objects, so that one of them needs to be removed. The `Species`

is not a numerical attribute and will be ignored for the distance.

```
library(MASS)
x <- iris[-102,]
species <- which(colnames(x)=="Species")
x.dist <- dist(x[,-species])
x.sammon <- sammon(x.dist,k=2)
plot(x.sammon$points)
```

Initial stress : 0.00678 stress after 10 iters: 0.00404, magic = 0.500 stress after 12 iters: 0.00402

`k = 2`

means that MDS should reduce the original data set to two dimensions.

Note that in the above example code no normalization or z-score standardization is carried out.

### Parallel Coordinates, Radar, and Star Plots

Book reference: Chapter 4.3.2.6-7, page 59Parallel coordinates need the library `MASS`

. All attributes must be numerical. If the attribute `Species`

should be included in the parallel coordinates, one can achieve this in the following way:

```
library(MASS)
x <- iris
x$Species <- as.numeric(iris$Species)
parcoord(x)
```

Star and radar plots are obtained by the following two commands:

`stars(iris)`

`stars(iris,locations=c(0,0))`

### Correlation Coefficients

Book reference: Chapter 4.4, page 62Pearson’s, Spearman’s, and Kendall’s correlation coefficients are obtained by the following three commands:

```
cor(iris$Sepal.Length,iris$Sepal.Width)
cor.test(iris$Sepal.Length,iris$Sepal.Width,method="spearman")
cor.test(iris$Sepal.Length,iris$Sepal.Width,method="kendall")
```

### Grubbs Test for Outlier Detection

Book reference: Chapter 4.5, page 65Grubbs test for outlier detection needs the installation of the library `outliers`

:

```
library(outliers)
grubbs.test(iris$Petal.Width)
```

Grubbs test for one outlier data: iris$Petal.Width G = 1.70638, U = 0.98033, p-value = 1 alternative hypothesis: highest value 2.5 is an outlier

## Errors and Validation in R

Book reference: Chapter 5.5.1, page 112### Training and Testing Data

In order to apply the idea of using separate parts of a data set for training and testing, one needs to select random subsets of the data set. As a very simple example, we consider the Iris data set that we want to split into training and test sets. The size of the training set should contain 2/3 of the original data, and the test set 1/3. It would not be a good idea to take the first 100 records in the Iris data set for training purposes and the remaining 50 as a test set, since the records in the Iris data set are ordered with respect to the species.With such a split, all examples of Iris setosa and Iris versicolor would end up in the training set, but none of Iris versicolor, which would form the test set. Therefore, we need random sample from the Iris data set. If the records in the Iris data set were not systematically orderer, but in a random order, we could just take the first 100 records for training purposes and the remaining 50 as a test set.

Sampling and orderings in R provide a simple way to shuffle a data set, i.e., to generate a random order of the records.

First, we need to know the number `n`

of records in our data set. Then we generate a permutation of the numbers `1, . . . , n`

by sampling from the vector containing the numbers `1, . . . , n`

, generated by the R-command `c(1:n)`

. We sample `n`

numbers without replacement from this vector:

```
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
```

Then we define this permutation as an ordering in which the records of our data set should be ordered and store the shuffled data set in the object `iris.shuffled`

:

```
ord <- order(permut)
iris.shuffled <- iris[ord,]
```

Now define how large the fraction for the training set should be—here 2/3—and take the first two thirds of the data set as a training set and the last third as a test set:

```
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
```

The R-command `sample`

can also be used to generate bootstrap samples by setting the parameter `replace`

to `TRUE`

instead of `F`

(`FALSE`

).

## Data Preparation in R

### Missing Values

Book reference: Chapter 6.2.2, page 137The logical constant `NA`

(not available) is used to represent missing values in R. There are various methods in R that can handle missing values directly.

As a very simple example, we consider the mean value.We create a data set with one attribute with four missing values and try to compute the mean:

```
x <- c(3,2,NA,4,NA,1,NA,NA,5)
mean(x)
```

<NA>

The mean value is in this case also a missing value, since R has no information about the missing values and how to handle them. But if we explicitly say that missing values should simply be ignored for the computation of the mean value (`na.rm=T`

), then R returns the mean value of all nonmissing values:

`mean(x,na.rm=T)`

3

Note that this computation of the mean value implicitly assumes that the values are missing completely at random (MCAR).

### Normalization and Scaling

Book reference: Chapter 6.6.3, page 153Normalization and standardization of numeric attributes can be achieved in the following way. The function `is.factor`

returns true if the corresponding attribute is categorical (or ordinal), so that we can ensure with this function that normalization is only applied to all numerical attributes, but not to the categorical ones. With the following R-code, z-score standardization is applied to all numerical attributes:

```
iris.norm <- iris
# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
```

Other normalization and standardization techniques can carried out in a similar manner. Of course, instead of the functions mean (for the mean value) and `sd`

(for the standard deviation), other functions like `min`

(for the minimum), `max`

(for the maximum), `median`

(for the median), or `IQR`

(for the interquartile range) have to be used.

## Finding Patterns in R

### Hierarchical Clustering

Book reference: Chapter 7.1, page 159As an example, we apply hierarchical clustering to the Iris data set, ignoring the categorical attribute `Species`

. We use the normalized Iris data set `iris.norm`

that is constructed in Sect. 6.6.2.3. We can apply hierarchical clustering after removing the categorical attribute and can plot the dendrogram afterwards:

```
#
# NORMALIZATION
#
# using the iris data as an example
iris.norm <- iris
# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
```

```
iris.num <- iris.norm[1:4]
iris.cl <- hclust(dist(iris.num), method="ward.D")
plot(iris.cl)
```

Here, the Ward method for the cluster distance aggregation function as described in Table 7.3 was chosen. For the other cluster distance aggregation functions in the table, one simply has to replace `ward.D`

by `single`

(for single linkage), by `complete`

(for complete linkage), by `average`

(for average linkage), or by `centroid`

.

For heatmaps, the library `gplots`

is required that needs installing first:

```
library(gplots)
rowv <- as.dendrogram(hclust(dist(iris.num), method="ward.D"))
colv <- as.dendrogram(hclust(dist(t(iris.num)), method="ward.D"))
heatmap.2(as.matrix(iris.num), Rowv=rowv,Colv=colv, trace="none")
```

### Prototype-Based Clustering

Book reference: Chapter 7.3, page 173The R-function `kmeans`

carries out k-means clustering.

`iris.km <- kmeans(iris.num,centers=3)`

The desired numbers of clusters is specified by the parameter `centers`

. The location of the cluster centers and the assignment of the data to the clusters is obtained by the `print`

function:

`print(iris.km)`

K-means clustering with 3 clusters of sizes 50, 53, 47 Cluster means: Sepal.Length Sepal.Width Petal.Length Petal.Width 1 -1.01119138 0.85041372 -1.3006301 -1.2507035 2 -0.05005221 -0.88042696 0.3465767 0.2805873 3 1.13217737 0.08812645 0.9928284 1.0141287 Clustering vector: [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 [75] 2 3 3 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 [149] 3 2 Within cluster sum of squares by cluster: [1] 47.35062 44.08754 47.45019 (between_SS / total_SS = 76.7 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"

For fuzzy c-means clustering, the library `cluster`

is required. The clustering is carried out by the method `fanny`

similar to `kmeans`

:

```
library(cluster)
iris.fcm <- fanny(iris.num,3)
iris.fcm
```

The last line provides the necessary information on the clustering results, especially the membership degrees to the clusters.

Gaussian mixture decomposition automatically determining the number of clusters requires the library `mclust`

to be installed first:

```
library(mclust)
iris.em <- mclustBIC(iris.num[,1:4])
iris.mod <- mclustModel(iris.num[,1:4],iris.em)
summary(iris.mod)
```

The last line lists the assignment of the data to the clusters.

Density-based clustering with DBSCAN is implemented in the library `fpc`

which needs installation first:

```
library(fpc)
iris.dbscan <- dbscan(iris.num[,1:4],1.0,showplot=T)
iris.dbscan$cluster
```

The last line will print out the assignment of the data to the clusters. Singletons or outliers are marked by the number zero. The second argument in `dbscan`

(in the above example 1:0) is the parameter " for DBSCAN. `showplot=T`

will generate a plot of the clustering result projected to the first two dimensions of the data set.

### Self Organizing Maps

Book reference: Chapter 7.5, page 187The library `som`

provides methods for self organizing maps. The library `som`

needs to be installed:

```
library(som)
iris.som <- som(iris.num,xdim=5,ydim=5)
plot(iris.som)
```

`xdim`

and `ydim`

define the number of nodes in the mesh in x- and y-directions, respectively. `plot`

will show, for each node in the mesh, a representation of the values in the form of parallel coordinates.

### Association Rules

Book reference: Chapter 7.9, page 189For association rule mining, the library `arules`

is required in which the function `apriori`

is defined. This library does not come along with R directly and needs to be installed first.

Here we use an artificial data set `basket`

that we enter manually. The data set is a list of vectors where each vector contains the items that were bought:

```
library(arules)
baskets <- list(c("a","b","c"), c("a","d","e"),
c("b","c","d"), c("a","b","c","d"),
c("b","c"), c("a","b","d"),
c("d","e"), c("a","b","c","d"),
c("c","d","e"), c("a","b","c"))
rules <- apriori(baskets,parameter = list(supp=0.1,conf=0.8,target="rules"))
inspect(rules)
```

Loading required package: Matrix Attaching package: ‘arules’ The following objects are masked from ‘package:base’: abbreviate, write

Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen 0.8 0.1 1 none FALSE TRUE 5 0.1 1 maxlen target ext 10 rules TRUE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 1 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[5 item(s), 10 transaction(s)] done [0.00s]. sorting and recoding items ... [5 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [9 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. lhs rhs support confidence coverage lift count [1] {e} => {d} 0.3 1.0000000 0.3 1.428571 3 [2] {a} => {b} 0.5 0.8333333 0.6 1.190476 5 [3] {b} => {c} 0.6 0.8571429 0.7 1.224490 6 [4] {c} => {b} 0.6 0.8571429 0.7 1.224490 6 [5] {a,e} => {d} 0.1 1.0000000 0.1 1.428571 1 [6] {c,e} => {d} 0.1 1.0000000 0.1 1.428571 1 [7] {a,b} => {c} 0.4 0.8000000 0.5 1.142857 4 [8] {a,c} => {b} 0.4 1.0000000 0.4 1.428571 4 [9] {a,c,d} => {b} 0.2 1.0000000 0.2 1.428571 2

The last command lists the rules with their support, confidence, and lift.

## Finding Explanations

### Decision tree

Book reference: Chapter 8.1, page 220R also allows for much finer control of the decision tree construction. The script below demonstrates how to create a simple tree for the Iris data set using a training set of 100 records. Then the tree is displayed, and a confusion matrix for the test set—the remaining 50 records of the Iris data set—is printed. The libraries `rpart`

, which comes along with the standard installation of R, and `rattle`

, that needs to be installed, are required:

```
library(rpart)
iris.train <- c(sample(1:150,50))
iris.dtree <- rpart(Species~.,data=iris,subset=iris.train)
```

```
library(rattle)
drawTreeNodes(iris.dtree)
table(predict(iris.dtree,iris[-iris.train,],type="class"),
iris[-iris.train,"Species"])
```

Loading required package: tibble Loading required package: bitops Rattle: A free graphical interface for data science with R. Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd. Type 'rattle()' to shake, rattle, and roll your data.

setosa versicolor virginica setosa 35 0 0 versicolor 0 27 1 virginica 0 6 31

In addition to many options related to tree construction, R also offers many ways to beautify the graphical representation. We refer to R manuals for more details.

### Naive Bayes Classifiers

Book reference: Chapter 8.2, page 230Naive Bayes classifiers use normal distributions by default for numerical attributes. The package `e1071`

must be installed first:

```
library(e1071)
iris.train <- c(sample(1:150,75))
iris.nbayes <- naiveBayes(Species~.,data=iris,subset=iris.train)
table(predict(iris.nbayes,iris[-iris.train,],type="class"),
iris[-iris.train,"Species"])
```

setosa versicolor virginica setosa 28 0 0 versicolor 0 22 1 virginica 0 2 22

As in the example of the decision tree, the Iris data set is split into a training and a test data set, and the confusion matrix is printed. The parameters for the normal distributions of the classes can be obtained in the following way:

`print(iris.nbayes)`

Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.default(x = X, y = Y, laplace = laplace) A-priori probabilities: Y setosa versicolor virginica 0.2933333 0.3466667 0.3600000 Conditional probabilities: Sepal.Length Y [,1] [,2] setosa 4.972727 0.3781935 versicolor 6.034615 0.5059188 virginica 6.555556 0.5401804 Sepal.Width Y [,1] [,2] setosa 3.422727 0.4264268 versicolor 2.807692 0.2682135 virginica 2.937037 0.3236191 Petal.Length Y [,1] [,2] setosa 1.413636 0.1457181 versicolor 4.315385 0.4333057 virginica 5.514815 0.4888792 Petal.Width Y [,1] [,2] setosa 0.2409091 0.09081164 versicolor 1.3307692 0.18279875 virginica 2.0185185 0.30386869

If Laplace correction should be applied for categorical attribute, this can be achieved by setting the parameter `laplace`

to the desired value when calling the function `naiveBayes`

.

### Regression

Book reference: Chapter 8.3, page 241Least squares linear regression is implemented by the function `lm`

(linear model). As an example, we construct a linear regression function to predict the petal width of the Iris data set based on the other numerical attributes:

```
iris.lm <- lm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length)
summary(iris.lm)
```

Call: lm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length) Residuals: Min 1Q Median 3Q Max -0.60959 -0.10134 -0.01089 0.09825 0.60685 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.24031 0.17837 -1.347 0.18 iris$Sepal.Length -0.20727 0.04751 -4.363 2.41e-05 *** iris$Sepal.Width 0.22283 0.04894 4.553 1.10e-05 *** iris$Petal.Length 0.52408 0.02449 21.399 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.192 on 146 degrees of freedom Multiple R-squared: 0.9379, Adjusted R-squared: 0.9366 F-statistic: 734.4 on 3 and 146 DF, p-value: < 2.2e-16

The `summary`

provides the necessary information about the regression result, including the coefficient of the regression function.

If we want to use a polynomial as the regression function, we need to protect the evaluation of the corresponding power by the function `I`

inhibiting interpretation. As an example, we compute a regression function to predict the petal width based on a quadratic function in the petal length:

```
iris.lm <- lm(iris$Petal.Width ~ iris$Petal.Length +
I(iris$Petal.Length^2))
summary(iris.lm)
```

Call: lm(formula = iris$Petal.Width ~ iris$Petal.Length + I(iris$Petal.Length^2)) Residuals: Min 1Q Median 3Q Max -0.56213 -0.12392 -0.01555 0.13547 0.64105 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.386781 0.079883 -4.842 3.22e-06 *** iris$Petal.Length 0.433833 0.053652 8.086 2.10e-13 *** I(iris$Petal.Length^2) -0.002569 0.007501 -0.342 0.732 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2071 on 147 degrees of freedom Multiple R-squared: 0.9272, Adjusted R-squared: 0.9262 F-statistic: 935.7 on 2 and 147 DF, p-value: < 2.2e-16

Robust regression requires the library `MASS`

, which needs installation. Otherwise it is handled in the same way as least squares regression, using the function `rlm`

instead of `lm`

:

```
library(MASS)
iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length)
summary(iris.rlm)
```

Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length) Residuals: Min 1Q Median 3Q Max -0.606532 -0.099003 -0.009158 0.101636 0.609940 Coefficients: Value Std. Error t value (Intercept) -0.2375 0.1771 -1.3412 iris$Sepal.Length -0.2062 0.0472 -4.3713 iris$Sepal.Width 0.2201 0.0486 4.5297 iris$Petal.Length 0.5231 0.0243 21.5126 Residual standard error: 0.1492 on 146 degrees of freedom

The default method is based on Huber’s error function. If Tukey’s biweight should be used, the parameter `method`

should be changed in the following way:

```
# ridge regression with Tukey's biweight
iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length,
method="MM")
summary(iris.rlm)
```

Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width + iris$Petal.Length, method = "MM") Residuals: Min 1Q Median 3Q Max -0.60608 -0.09975 -0.01030 0.10375 0.61963 Coefficients: Value Std. Error t value (Intercept) -0.1896 0.1718 -1.1036 iris$Sepal.Length -0.2152 0.0457 -4.7036 iris$Sepal.Width 0.2181 0.0471 4.6276 iris$Petal.Length 0.5252 0.0236 22.2689 Residual standard error: 0.1672 on 146 degrees of freedom

A plot of the computed weights can be obtained by the following command:

`plot(iris.rlm$w)`

## Predictors

### Nearest Neighbor Classifiers

Book reference: Chapter 9.1, page 275A nearest-neighbor classifier based on the Euclidean distance is implemented in the package `class`

in R. To show how to use the nearest-neighbor classifier in R, we use splitting of the Iris data set into a training set `iris.training`

and test set `iris.test`

as it was demonstrated in Sect. 5.6.2. The function `knn`

requires a training and a set with only numerical attributes and a vector containing the classifications for the training set. The parameter `k`

determines how many nearest neighbors are considered for the classification decision.

```
# generating indices for shuffling
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
# shuffling the observations
ord <- order(permut)
iris.shuffled <- iris[ord,]
# splitting into training and testing data
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
```

```
library(class)
iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],iris.training[,5],k=3)
table(iris.knn,iris.test[,5])
```

iris.knn setosa versicolor virginica setosa 18 0 0 versicolor 0 18 0 virginica 0 1 13

The last line prints the confusion matrix.

### Neural Networks

Book reference: Chapter 9.2, page 282For the example of multilayer perceptrons in R, we use the same training and test data as for the nearest-neighbor classifier above. The multilayer perceptron can only process numerical values. Therefore, we first have to transform the categorical attribute `Species`

into a numerical attribute:

```
x <- iris.training
x$Species <- as.numeric(x$Species)
```

The multilayer perceptron is constructed and trained in the following way, where the library `neuralnet`

needs to be installed first:

```
library(neuralnet)
iris.nn <- neuralnet(Species + Sepal.Length ~
Sepal.Width + Petal.Length + Petal.Width, x,
hidden=c(3))
```

The first argument of `neuralnet`

defines that the attributes `Species`

and `sepal length`

correspond to the output neurons. The other three attributes correspond to the input neurons. `x`

specifies the training data set. The parameter `hidden`

defines how many hidden layers the multilayer perceptron should have and how many neurons in each hidden layer should be. In the above example, there is only one hidden layer with three neurons. When we replace `c(3)`

by `c(4,2`

), there would be two hidden layers, one with four and one with two neurons.

The training of the multilayer perceptron can take some time, especially for larger data sets.

When the training is finished, the multilayer perceptron can be visualized:

`plot(iris.nn)`

The visualization includes also dummy neurons as shown in Fig. 9.4.

The output of the multilayer perceptron for the test set can be calculated in the following way. Note that we first have to remove the output attributes from the test set:

```
y <- iris.test
y <- y[-5]
y <- y[-1]
y.out <- compute(iris.nn,y)
```

We can then compare the target outputs for the training set with the outputs from the multilayer perceptron. If we want to compute the squared errors for the second output neuron -— the `sepal length`

—- we can do this in the following way:

`y.sqerr <- (y[1] - y.out$net.result[,2])^2`

### Support Vector Machines

Book reference: Chapter 9.4, page 297For support vector machine, we use the same training and test data as already for the nearest-neighbor classifier and for the neural networks. A support vector machine to predict the `species`

in the Iris data set based on the other attributes can be constructed in the following way. The package `e1071`

is needed and should be installed first if it has not been installed before:

```
library(e1071)
iris.svm <- svm(Species ~ ., data = iris.training)
table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
```

setosa versicolor virginica setosa 18 0 0 versicolor 0 19 1 virginica 0 0 12

The last line prints the confusion matrix for the test data set.

The function `svm`

works also for support vector regression. We could, for instance, use

```
iris.svm <- svm(Petal.Width ~ ., data = iris.training)
sqerr <- (predict(iris.svm,iris.test[-4])-iris.test[4])^2
```

to predict the numerical attribute `petal width`

based on the other attributes and to compute the squared errors for the test set.

### Ensemble Methods

Book reference: Chapter 9.5, page 304As an example for ensemble methods, we consider `random forest`

with the training and test data of the Iris data set as before. The package randomForest needs to be installed first:

```
library(randomForest)
iris.rf <- randomForest(Species ~., iris.training)
table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
```

setosa versicolor virginica setosa 18 0 0 versicolor 0 19 1 virginica 0 0 12

In this way, a random forest is constructed to predict the `species`

in the Iris data set based on the other attributes. The last line of the code prints the confusion matrix for the test data set.