In order to apply the idea of using separate parts of a data set for training and testing, one needs to select random subsets of the data set. As a very simple example, we consider the Iris data set that we want to split into training and test sets. The size of the training set should contain 2/3 of the original data, and the test set 1/3. It would not be a good idea to take the first 100 records in the Iris data set for training purposes and the remaining 50 as a test set, since the records in the Iris data set are ordered with respect to the species.With such a split, all examples of Iris setosa and Iris versicolor would end up in the training set, but none of Iris versicolor, which would form the test set. Therefore, we need random sample from the Iris data set. If the records in the Iris data set were not systematically orderer, but in a random order, we could just take the first 100 records for training purposes and the remaining 50 as a test set.
Sampling and orderings in R provide a simple way to shuffle a data set, i.e., to generate a random order of the records.
First, we need to know the number n
of records in our data set. Then we generate
a permutation of the numbers 1, . . . , n
by sampling from the vector containing the
numbers 1, . . . , n
, generated by the R-command c(1:n)
. We sample n
numbers
without replacement from this vector:
n <- length(iris$Species)
permut <- sample(c(1:n),n,replace=F)
Then we define this permutation as an ordering in which the records of our data set
should be ordered and store the shuffled data set in the object iris.shuffled
:
ord <- order(permut)
iris.shuffled <- iris[ord,]
Now define how large the fraction for the training set should be—here 2/3—and take the first two thirds of the data set as a training set and the last third as a test set:
prop.train <- 2/3 # training data consists of 2/3 of observations
k <- round(prop.train*n)
iris.training <- iris.shuffled[1:k,]
iris.test <- iris.shuffled[(k+1):n,]
The R-command sample
can also be used to generate bootstrap samples by setting
the parameter replace
to TRUE
instead of F
(FALSE
).