We will use Kahn data set which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. We have both training data and testing data.
# Examine the dimension of the data
library(ISLR)
names(Khan)
## [1] "xtrain" "xtest" "ytrain" "ytest"
dim(Khan$xtrain)
## [1] 63 2308
dim(Khan$xtest)
## [1] 20 2308
We will use SVM with the cost of 10. Note that we have more features than observations and so linear kernel should be sufficient.
# Make the Khan data frame
dat=data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
# Load svm library
library(e1071)
# Fit SVM model with cost parametr 10
out=svm(y ~., data=dat, kernel="linear",cost=10)
# Print the summary
summary(out)
##
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.0004332756
##
## Number of Support Vectors: 58
##
## ( 20 20 11 7 )
##
##
## Number of Classes: 4
##
## Levels:
## 1 2 3 4
# Make the confusion matrix
table(out$fitted , dat$y)
##
## 1 2 3 4
## 1 8 0 0 0
## 2 0 23 0 0
## 3 0 0 12 0
## 4 0 0 0 20
We see there are no training errors. We now fit the model on the test data
# Make test data frame
dat.te=data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))
# Make predcitions for the test data
pred.te=predict(out, newdata=dat.te)
# Make the confusion matrix table
table(pred.te, dat.te$y)
##
## pred.te 1 2 3 4
## 1 3 0 0 0
## 2 0 6 2 0
## 3 0 0 4 0
## 4 0 0 0 5
We see the errors for the class labeled 3. There are 2 misclassifications.
An Introduction to Statistical Learning by Gareth James et al. Chapter 9: Lab exercises.