Support Vector Machines

Recorded Stream

Application to Gene Expression Data

We will use Kahn data set which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. We have both training data and testing data.

  # Examine the dimension of the data
  library(ISLR)
  names(Khan)

## [1] "xtrain" "xtest"  "ytrain" "ytest"

  dim(Khan$xtrain)

## [1]   63 2308

  dim(Khan$xtest)

## [1]   20 2308

We will use SVM with the cost of 10. Note that we have more features than observations and so linear kernel should be sufficient.

  # Make the Khan data frame
  dat=data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
  # Load svm library
  library(e1071)
  # Fit SVM model with cost parametr 10
  out=svm(y ~., data=dat, kernel="linear",cost=10)
  # Print the summary
  summary(out)

## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.0004332756 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4

  # Make the confusion matrix
  table(out$fitted , dat$y)

##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

We see there are no training errors. We now fit the model on the test data

  # Make test data frame
  dat.te=data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))
  # Make predcitions for the test data
  pred.te=predict(out, newdata=dat.te)
  # Make the confusion matrix table
  table(pred.te, dat.te$y)

##        
## pred.te 1 2 3 4
##       1 3 0 0 0
##       2 0 6 2 0
##       3 0 0 4 0
##       4 0 0 0 5

We see the errors for the class labeled 3. There are 2 misclassifications.

Selected materials and references

An Introduction to Statistical Learning by Gareth James et al. Chapter 9: Lab exercises.