R Code Snippets
Download Code
Data Understanding in R
Histograms
Book reference: Chapter 4.3.1, page 40Histograms are generated by the function hist. The simplest way to create a histogram is to just use the corresponding attribute as an argument of the function hist, and R will automatically determine the number of bins for the histogram based on Sturge’s rule. In order to generate the histogram for the petal length of the Iris data set, the following command is sufficient:
hist(iris$Petal.Length)
The partition into bins can also be specified directly. One of the parameters of hist is breaks. If the bins should cover the intervals \([a_{0},a_{1}),[a_{1},a_{2}),...,[a_{k−1},a_{k}]\) , then one can simply create a vector in R containing the values \(a_{i}\) and assign it to breaks. Note that \(a_{0}\) and \(a_{k}\) should be the minimum and maximum values of the corresponding attribute. If we want the boundaries for the bins at 1.0, 3.0, 4.5, 4.0, 6.1, then we would use
hist(iris$Petal.Length,breaks=c(1.0,3.0,4.5,4.0,6.9))
to generate the histogram. Note that in the case of bins with different length, the heights of the boxes in the histogram do not show the relative frequencies. The areas of the boxes are chosen in such a way that they are proportional to the relative frequencies.
Boxplots
Book reference: Chapter 4.3.1, page 43A boxplot for a single attribute is generated by
boxplot(iris$Petal.Length)
yielding the boxplot for the petal length of the Iris data set. Instead of a single attribute, we can hand over more than one attribute
boxplot(iris$Petal.Length,iris$Petal.Width)
to show the boxplots in the same plot. We can even use the whole data set as an argument to see the boxplots of all attributes in one plot:
boxplot(iris)
In this case, categorical attributes will be turned into numerical attributes by coding the values of the categorical attribute as 1, 2, . . . , so that these boxplots are also shown but do not really make sense.
In order to include the notches in the boxplots, we need to set the parameter notch to true:
boxplot(iris,notch=TRUE)
If one is interested in the precise values of the boxplot like the median, etc., one can use the print-command:
print(boxplot(iris$Sepal.Width))
$stats
[,1]
[1,] 2.2
[2,] 2.8
[3,] 3.0
[4,] 3.3
[5,] 4.0
$n
[1] 150
$conf
[,1]
[1,] 2.935497
[2,] 3.064503
$out
[1] 4.4 4.1 4.2 2.0
$group
[1] 1 1 1 1
$names
[1] "1"
The first five values are the minimum, the first quartile, the median, the third quartile, and the maximum value of the attribute, respectively. \(n\) is the number of data. Then come the boundaries for the confidence interval for the notch, followed by the list of outliers. The last values group and names only make sense when more than one boxplot is included in the same plot. Then group is needed to identify to which attribute the outliers in the list of outliers belong. names just lists the names of the attributes.
Scatter Plots
Book reference: Chapter 4.3.1, page 45A scatter plot of the petal width against petal length of the Iris data is obtained by
plot(iris$Petal.Width,iris$Petal.Length)
All scatter plots of each attribute against each other in one diagram are created with
plot(iris)
If symbols representing the values for some categorical attribute should be included in a scatter plot, this can be achieved by
plot(iris$Petal.Width,iris$Petal.Length,pch=as.numeric(iris$Species))
where in this example the three types of Iris are plotted with different symbols.
If there are some interesting or suspicious points in a scatter plot and one wants to find out which data records these are, one can do this by
plot(iris$Petal.Width,iris$Petal.Length)
identify(iris$Petal.Width,iris$Petal.Length)
and then clicking on the points. The index of the corresponding records will be added to the scatter plot. To finish selecting points, press the ESCAPE-key.
Jitter can be added to a scatter plot in the following way:
plot(jitter(iris$Petal.Width),jitter(iris$Petal.Length))
Intensity plots and density plots with hexagonal binning, as they are shown Fig. 4.9, can be generated by
plot(iris$Petal.Width,iris$Petal.Length,
col=rgb(0,0,0,50,maxColorValue=255),pch=16)
and
library(hexbin)
bin<-hexbin(iris$Petal.Width,iris$Petal.Length,xbins=50)
plot(bin)
respectively, where the library hexbin does not come along with the standard version of R and needs to be installed as described in the appendix on R. Note that such plots are not very useful for such a small data sets like the Iris data set.
For three-dimensional scatter plots, the library scatterplots3d is needed and has to be installed first:
library(scatterplot3d)
scatterplot3d(iris$Sepal.Length,iris$Sepal.Width,iris$Petal.Length)
Principal Component Analysis
Book reference: Chapter 4.3.2.1, page 48PCA can be carried out with R in the following way:
species <- which(colnames(iris)=="Species")
iris.pca <- prcomp(iris[,-species],center=T,scale=T)
print(iris.pca)
Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
summary(iris.pca)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
plot(predict(iris.pca))
For the Iris data set, it is necessary to exclude the categorical attribute Species from PCA. This is achieved by the first line of the code and calling prcomp with iris[,-species] instead of iris.
The parameter settings center=T, scale=T, where T is just a short form of TRUE, mean that z-score standardization is carried out for each attribute before applying PCA.
The function predict can be applied in the above-described way to obtain the transformed data from which the PCA was computed. If the computed PCA transformation should be applied to another data set x, this can be achieved by
predict(iris.pca,newdata=x)
where x must have the same number of columns as the data set from which the PCA has been computed. In this case, x must have four columns which must be numerical. predict will compute the full transformation, so that the above command will also yield transformed data with four columns.
Multidimensional Scaling
Book reference: Chapter 4.3.2.3, page 53MDS requires the library MASS which is not included in the standard version of R and needs installing. First, a distance matrix is needed for MDS. Identical objects leading to zero distances are not admitted. Therefore, if there are identical objects in a data set, all copies of the same object except one must be removed. In the Iris data set, there is only one pair of identical objects, so that one of them needs to be removed. The Species is not a numerical attribute and will be ignored for the distance.
library(MASS)
x <- iris[-102,]
species <- which(colnames(x)=="Species")
x.dist <- dist(x[,-species])
x.sammon <- sammon(x.dist,k=2)
plot(x.sammon$points)
Initial stress : 0.00678 stress after 10 iters: 0.00404, magic = 0.500 stress after 12 iters: 0.00402
k = 2 means that MDS should reduce the original data set to two dimensions.
Note that in the above example code no normalization or z-score standardization is carried out.
Parallel Coordinates, Radar, and Star Plots
Book reference: Chapter 4.3.2.6-7, page 59Parallel coordinates need the library MASS. All attributes must be numerical. If the attribute Species should be included in the parallel coordinates, one can achieve this in the following way:
library(MASS)
x <- iris
x$Species <- as.numeric(iris$Species)
parcoord(x)
Star and radar plots are obtained by the following two commands:
stars(iris)
stars(iris,locations=c(0,0))
Correlation Coefficients
Book reference: Chapter 4.4, page 62Pearson’s, Spearman’s, and Kendall’s correlation coefficients are obtained by the following three commands:
cor(iris$Sepal.Length,iris$Sepal.Width)
cor.test(iris$Sepal.Length,iris$Sepal.Width,method="spearman")
cor.test(iris$Sepal.Length,iris$Sepal.Width,method="kendall")
Grubbs Test for Outlier Detection
Book reference: Chapter 4.5, page 65Grubbs test for outlier detection needs the installation of the library outliers:
library(outliers)
grubbs.test(iris$Petal.Width)
Grubbs test for one outlier data: iris$Petal.Width G = 1.70638, U = 0.98033, p-value = 1 alternative hypothesis: highest value 2.5 is an outlier
Errors and Validation in R
Book reference: Chapter 5.5.1, page 112Training and Testing Data
In order to apply the idea of using separate parts of a data set for training and testing, one needs to select random subsets of the data set. As a very simple example, we consider the Iris data set that we want to split into training and test sets. The size of the training set should contain 2/3 of the original data, and the test set 1/3. It would not be a good idea to take the first 100 records in the Iris data set for training purposes and the remaining 50 as a test set, since the records in the Iris data set are ordered with respect to the species.With such a split, all examples of Iris setosa and Iris versicolor would end up in the training set, but none of Iris versicolor, which would form the test set. Therefore, we need random sample from the Iris data set. If the records in the Iris data set were not systematically orderer, but in a random order, we could just take the first 100 records for training purposes and the remaining 50 as a test set.
Sampling and orderings in R provide a simple way to shuffle a data set, i.e., to generate a random order of the records.
First, we need to know the number n of records in our data set. Then we generate a permutation of the numbers 1, . . . , n by sampling from the vector containing the numbers 1, . . . , n, generated by the R-command c(1:n). We sample n numbers without replacement from this vector:
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
Then we define this permutation as an ordering in which the records of our data set should be ordered and store the shuffled data set in the object iris.shuffled:
ord <- order(permut)
iris.shuffled <- iris[ord,]
Now define how large the fraction for the training set should be—here 2/3—and take the first two thirds of the data set as a training set and the last third as a test set:
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
The R-command sample can also be used to generate bootstrap samples by setting the parameter replace to TRUE instead of F (FALSE).
Data Preparation in R
Missing Values
Book reference: Chapter 6.2.2, page 137The logical constant NA (not available) is used to represent missing values in R. There are various methods in R that can handle missing values directly.
As a very simple example, we consider the mean value.We create a data set with one attribute with four missing values and try to compute the mean:
x <- c(3,2,NA,4,NA,1,NA,NA,5)
mean(x)
<NA>
The mean value is in this case also a missing value, since R has no information about the missing values and how to handle them. But if we explicitly say that missing values should simply be ignored for the computation of the mean value (na.rm=T), then R returns the mean value of all nonmissing values:
mean(x,na.rm=T)
3
Note that this computation of the mean value implicitly assumes that the values are missing completely at random (MCAR).
Normalization and Scaling
Book reference: Chapter 6.6.3, page 153Normalization and standardization of numeric attributes can be achieved in the following way. The function is.factor returns true if the corresponding attribute is categorical (or ordinal), so that we can ensure with this function that normalization is only applied to all numerical attributes, but not to the categorical ones. With the following R-code, z-score standardization is applied to all numerical attributes:
iris.norm <- iris
# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
Other normalization and standardization techniques can carried out in a similar manner. Of course, instead of the functions mean (for the mean value) and sd (for the standard deviation), other functions like min (for the minimum), max (for the maximum), median (for the median), or IQR (for the interquartile range) have to be used.
Finding Patterns in R
Hierarchical Clustering
Book reference: Chapter 7.1, page 159As an example, we apply hierarchical clustering to the Iris data set, ignoring the categorical attribute Species. We use the normalized Iris data set iris.norm that is constructed in Sect. 6.6.2.3. We can apply hierarchical clustering after removing the categorical attribute and can plot the dendrogram afterwards:
#
# NORMALIZATION
#
# using the iris data as an example
iris.norm <- iris
# for loop over each coloumn
for (i in c(1:length(iris.norm))){
if (!is.factor(iris.norm[,i])){
attr.mean <- mean(iris.norm[,i])
attr.sd <- sd(iris.norm[,i])
iris.norm[,i] <- (iris.norm[,i]-attr.mean)/attr.sd
}
}
iris.num <- iris.norm[1:4]
iris.cl <- hclust(dist(iris.num), method="ward.D")
plot(iris.cl)
Here, the Ward method for the cluster distance aggregation function as described in Table 7.3 was chosen. For the other cluster distance aggregation functions in the table, one simply has to replace ward.D by single (for single linkage), by complete (for complete linkage), by average (for average linkage), or by centroid.
For heatmaps, the library gplots is required that needs installing first:
library(gplots)
rowv <- as.dendrogram(hclust(dist(iris.num), method="ward.D"))
colv <- as.dendrogram(hclust(dist(t(iris.num)), method="ward.D"))
heatmap.2(as.matrix(iris.num), Rowv=rowv,Colv=colv, trace="none")
Prototype-Based Clustering
Book reference: Chapter 7.3, page 173The R-function kmeans carries out k-means clustering.
iris.km <- kmeans(iris.num,centers=3)
The desired numbers of clusters is specified by the parameter centers. The location of the cluster centers and the assignment of the data to the clusters is obtained by the print function:
print(iris.km)
K-means clustering with 3 clusters of sizes 50, 53, 47 Cluster means: Sepal.Length Sepal.Width Petal.Length Petal.Width 1 -1.01119138 0.85041372 -1.3006301 -1.2507035 2 -0.05005221 -0.88042696 0.3465767 0.2805873 3 1.13217737 0.08812645 0.9928284 1.0141287 Clustering vector: [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2 [75] 2 3 3 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 [149] 3 2 Within cluster sum of squares by cluster: [1] 47.35062 44.08754 47.45019 (between_SS / total_SS = 76.7 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" [6] "betweenss" "size" "iter" "ifault"
For fuzzy c-means clustering, the library cluster is required. The clustering is carried out by the method fanny similar to kmeans:
library(cluster)
iris.fcm <- fanny(iris.num,3)
iris.fcm
The last line provides the necessary information on the clustering results, especially the membership degrees to the clusters.
Gaussian mixture decomposition automatically determining the number of clusters requires the library mclust to be installed first:
library(mclust)
iris.em <- mclustBIC(iris.num[,1:4])
iris.mod <- mclustModel(iris.num[,1:4],iris.em)
summary(iris.mod)
The last line lists the assignment of the data to the clusters.
Density-based clustering with DBSCAN is implemented in the library fpc which needs installation first:
library(fpc)
iris.dbscan <- dbscan(iris.num[,1:4],1.0,showplot=T)
iris.dbscan$cluster
The last line will print out the assignment of the data to the clusters. Singletons or outliers are marked by the number zero. The second argument in dbscan (in the above example 1:0) is the parameter " for DBSCAN. showplot=T will generate a plot of the clustering result projected to the first two dimensions of the data set.
Self Organizing Maps
Book reference: Chapter 7.5, page 187The library som provides methods for self organizing maps. The library som needs to be installed:
library(som)
iris.som <- som(iris.num,xdim=5,ydim=5)
plot(iris.som)
xdim and ydim define the number of nodes in the mesh in x- and y-directions, respectively. plot will show, for each node in the mesh, a representation of the values in the form of parallel coordinates.
Association Rules
Book reference: Chapter 7.9, page 189For association rule mining, the library arules is required in which the function apriori is defined. This library does not come along with R directly and needs to be installed first.
Here we use an artificial data set basket that we enter manually. The data set is a list of vectors where each vector contains the items that were bought:
library(arules)
baskets <- list(c("a","b","c"), c("a","d","e"),
c("b","c","d"), c("a","b","c","d"),
c("b","c"), c("a","b","d"),
c("d","e"), c("a","b","c","d"),
c("c","d","e"), c("a","b","c"))
rules <- apriori(baskets,parameter = list(supp=0.1,conf=0.8,target="rules"))
inspect(rules)
Loading required package: Matrix
Attaching package: ‘arules’
The following objects are masked from ‘package:base’:
abbreviate, write
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.8 0.1 1 none FALSE TRUE 5 0.1 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 1
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[5 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [5 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [9 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
lhs rhs support confidence coverage lift count
[1] {e} => {d} 0.3 1.0000000 0.3 1.428571 3
[2] {a} => {b} 0.5 0.8333333 0.6 1.190476 5
[3] {b} => {c} 0.6 0.8571429 0.7 1.224490 6
[4] {c} => {b} 0.6 0.8571429 0.7 1.224490 6
[5] {a,e} => {d} 0.1 1.0000000 0.1 1.428571 1
[6] {c,e} => {d} 0.1 1.0000000 0.1 1.428571 1
[7] {a,b} => {c} 0.4 0.8000000 0.5 1.142857 4
[8] {a,c} => {b} 0.4 1.0000000 0.4 1.428571 4
[9] {a,c,d} => {b} 0.2 1.0000000 0.2 1.428571 2
The last command lists the rules with their support, confidence, and lift.
Finding Explanations
Decision tree
Book reference: Chapter 8.1, page 220R also allows for much finer control of the decision tree construction. The script below demonstrates how to create a simple tree for the Iris data set using a training set of 100 records. Then the tree is displayed, and a confusion matrix for the test set—the remaining 50 records of the Iris data set—is printed. The libraries rpart, which comes along with the standard installation of R, and rattle, that needs to be installed, are required:
library(rpart)
iris.train <- c(sample(1:150,50))
iris.dtree <- rpart(Species~.,data=iris,subset=iris.train)
library(rattle)
drawTreeNodes(iris.dtree)
table(predict(iris.dtree,iris[-iris.train,],type="class"),
iris[-iris.train,"Species"])
Loading required package: tibble Loading required package: bitops Rattle: A free graphical interface for data science with R. Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd. Type 'rattle()' to shake, rattle, and roll your data.
setosa versicolor virginica setosa 35 0 0 versicolor 0 27 1 virginica 0 6 31
In addition to many options related to tree construction, R also offers many ways to beautify the graphical representation. We refer to R manuals for more details.
Naive Bayes Classifiers
Book reference: Chapter 8.2, page 230Naive Bayes classifiers use normal distributions by default for numerical attributes. The package e1071 must be installed first:
library(e1071)
iris.train <- c(sample(1:150,75))
iris.nbayes <- naiveBayes(Species~.,data=iris,subset=iris.train)
table(predict(iris.nbayes,iris[-iris.train,],type="class"),
iris[-iris.train,"Species"])
setosa versicolor virginica setosa 28 0 0 versicolor 0 22 1 virginica 0 2 22
As in the example of the decision tree, the Iris data set is split into a training and a test data set, and the confusion matrix is printed. The parameters for the normal distributions of the classes can be obtained in the following way:
print(iris.nbayes)
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
setosa versicolor virginica
0.2933333 0.3466667 0.3600000
Conditional probabilities:
Sepal.Length
Y [,1] [,2]
setosa 4.972727 0.3781935
versicolor 6.034615 0.5059188
virginica 6.555556 0.5401804
Sepal.Width
Y [,1] [,2]
setosa 3.422727 0.4264268
versicolor 2.807692 0.2682135
virginica 2.937037 0.3236191
Petal.Length
Y [,1] [,2]
setosa 1.413636 0.1457181
versicolor 4.315385 0.4333057
virginica 5.514815 0.4888792
Petal.Width
Y [,1] [,2]
setosa 0.2409091 0.09081164
versicolor 1.3307692 0.18279875
virginica 2.0185185 0.30386869
If Laplace correction should be applied for categorical attribute, this can be achieved by setting the parameter laplace to the desired value when calling the function naiveBayes.
Regression
Book reference: Chapter 8.3, page 241Least squares linear regression is implemented by the function lm (linear model). As an example, we construct a linear regression function to predict the petal width of the Iris data set based on the other numerical attributes:
iris.lm <- lm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length)
summary(iris.lm)
Call:
lm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length)
Residuals:
Min 1Q Median 3Q Max
-0.60959 -0.10134 -0.01089 0.09825 0.60685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.24031 0.17837 -1.347 0.18
iris$Sepal.Length -0.20727 0.04751 -4.363 2.41e-05 ***
iris$Sepal.Width 0.22283 0.04894 4.553 1.10e-05 ***
iris$Petal.Length 0.52408 0.02449 21.399 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.192 on 146 degrees of freedom
Multiple R-squared: 0.9379, Adjusted R-squared: 0.9366
F-statistic: 734.4 on 3 and 146 DF, p-value: < 2.2e-16
The summary provides the necessary information about the regression result, including the coefficient of the regression function.
If we want to use a polynomial as the regression function, we need to protect the evaluation of the corresponding power by the function I inhibiting interpretation. As an example, we compute a regression function to predict the petal width based on a quadratic function in the petal length:
iris.lm <- lm(iris$Petal.Width ~ iris$Petal.Length +
I(iris$Petal.Length^2))
summary(iris.lm)
Call:
lm(formula = iris$Petal.Width ~ iris$Petal.Length + I(iris$Petal.Length^2))
Residuals:
Min 1Q Median 3Q Max
-0.56213 -0.12392 -0.01555 0.13547 0.64105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.386781 0.079883 -4.842 3.22e-06 ***
iris$Petal.Length 0.433833 0.053652 8.086 2.10e-13 ***
I(iris$Petal.Length^2) -0.002569 0.007501 -0.342 0.732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2071 on 147 degrees of freedom
Multiple R-squared: 0.9272, Adjusted R-squared: 0.9262
F-statistic: 935.7 on 2 and 147 DF, p-value: < 2.2e-16
Robust regression requires the library MASS, which needs installation. Otherwise it is handled in the same way as least squares regression, using the function rlm instead of lm:
library(MASS)
iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length)
summary(iris.rlm)
Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length)
Residuals:
Min 1Q Median 3Q Max
-0.606532 -0.099003 -0.009158 0.101636 0.609940
Coefficients:
Value Std. Error t value
(Intercept) -0.2375 0.1771 -1.3412
iris$Sepal.Length -0.2062 0.0472 -4.3713
iris$Sepal.Width 0.2201 0.0486 4.5297
iris$Petal.Length 0.5231 0.0243 21.5126
Residual standard error: 0.1492 on 146 degrees of freedom
The default method is based on Huber’s error function. If Tukey’s biweight should be used, the parameter method should be changed in the following way:
# ridge regression with Tukey's biweight
iris.rlm <- rlm(iris$Petal.Width ~ iris$Sepal.Length
+ iris$Sepal.Width + iris$Petal.Length,
method="MM")
summary(iris.rlm)
Call: rlm(formula = iris$Petal.Width ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length, method = "MM")
Residuals:
Min 1Q Median 3Q Max
-0.60608 -0.09975 -0.01030 0.10375 0.61963
Coefficients:
Value Std. Error t value
(Intercept) -0.1896 0.1718 -1.1036
iris$Sepal.Length -0.2152 0.0457 -4.7036
iris$Sepal.Width 0.2181 0.0471 4.6276
iris$Petal.Length 0.5252 0.0236 22.2689
Residual standard error: 0.1672 on 146 degrees of freedom
A plot of the computed weights can be obtained by the following command:
plot(iris.rlm$w)
Predictors
Nearest Neighbor Classifiers
Book reference: Chapter 9.1, page 275A nearest-neighbor classifier based on the Euclidean distance is implemented in the package class in R. To show how to use the nearest-neighbor classifier in R, we use splitting of the Iris data set into a training set iris.training and test set iris.test as it was demonstrated in Sect. 5.6.2. The function knn requires a training and a set with only numerical attributes and a vector containing the classifications for the training set. The parameter k determines how many nearest neighbors are considered for the classification decision.
# generating indices for shuffling
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
# shuffling the observations
ord <- order(permut)
iris.shuffled <- iris[ord,]
# splitting into training and testing data
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
library(class)
iris.knn <- knn(iris.training[,1:4],iris.test[,1:4],iris.training[,5],k=3)
table(iris.knn,iris.test[,5])
iris.knn setosa versicolor virginica setosa 18 0 0 versicolor 0 18 0 virginica 0 1 13
The last line prints the confusion matrix.
Neural Networks
Book reference: Chapter 9.2, page 282For the example of multilayer perceptrons in R, we use the same training and test data as for the nearest-neighbor classifier above. The multilayer perceptron can only process numerical values. Therefore, we first have to transform the categorical attribute Species into a numerical attribute:
x <- iris.training
x$Species <- as.numeric(x$Species)
The multilayer perceptron is constructed and trained in the following way, where the library neuralnet needs to be installed first:
library(neuralnet)
iris.nn <- neuralnet(Species + Sepal.Length ~
Sepal.Width + Petal.Length + Petal.Width, x,
hidden=c(3))
The first argument of neuralnet defines that the attributes Species and sepal length correspond to the output neurons. The other three attributes correspond to the input neurons. x specifies the training data set. The parameter hidden defines how many hidden layers the multilayer perceptron should have and how many neurons in each hidden layer should be. In the above example, there is only one hidden layer with three neurons. When we replace c(3) by c(4,2), there would be two hidden layers, one with four and one with two neurons.
The training of the multilayer perceptron can take some time, especially for larger data sets.
When the training is finished, the multilayer perceptron can be visualized:
plot(iris.nn)
The visualization includes also dummy neurons as shown in Fig. 9.4.
The output of the multilayer perceptron for the test set can be calculated in the following way. Note that we first have to remove the output attributes from the test set:
y <- iris.test
y <- y[-5]
y <- y[-1]
y.out <- compute(iris.nn,y)
We can then compare the target outputs for the training set with the outputs from the multilayer perceptron. If we want to compute the squared errors for the second output neuron -— the sepal length —- we can do this in the following way:
y.sqerr <- (y[1] - y.out$net.result[,2])^2
Support Vector Machines
Book reference: Chapter 9.4, page 297For support vector machine, we use the same training and test data as already for the nearest-neighbor classifier and for the neural networks. A support vector machine to predict the species in the Iris data set based on the other attributes can be constructed in the following way. The package e1071 is needed and should be installed first if it has not been installed before:
library(e1071)
iris.svm <- svm(Species ~ ., data = iris.training)
table(predict(iris.svm,iris.test[1:4]),iris.test[,5])
setosa versicolor virginica setosa 18 0 0 versicolor 0 19 1 virginica 0 0 12
The last line prints the confusion matrix for the test data set.
The function svm works also for support vector regression. We could, for instance, use
iris.svm <- svm(Petal.Width ~ ., data = iris.training)
sqerr <- (predict(iris.svm,iris.test[-4])-iris.test[4])^2
to predict the numerical attribute petal width based on the other attributes and to compute the squared errors for the test set.
Ensemble Methods
Book reference: Chapter 9.5, page 304As an example for ensemble methods, we consider random forest with the training and test data of the Iris data set as before. The package randomForest needs to be installed first:
library(randomForest)
iris.rf <- randomForest(Species ~., iris.training)
table(predict(iris.rf,iris.test[1:4]),iris.test[,5])
setosa versicolor virginica setosa 18 0 0 versicolor 0 19 1 virginica 0 0 12
In this way, a random forest is constructed to predict the species in the Iris data set based on the other attributes. The last line of the code prints the confusion matrix for the test data set.