对于线性不可分时,加入松弛项,折衷考虑最小错分样本和最大分类间隔。增加了算法的容错性,允许训练集不完全分类,以防出现过拟合。加入的办法有以下3类,写成最优化问题形式总结如上图:
上图中e为所有元素都为1的列向量,Qij=yiyjK(xi; xj), K(xi; xj) =phi(xi) ∙phi (xj), phi(.)为核函数,K(. ;.)表示对应元素核函数的内积。
现在我们来看看svm()函数的用法。
## S3 method for class 'formula'
svm(formula, data = NULL, ..., subset,na.action =na.omit, scale = TRUE)
## Default S3 method:
svm(x, y = NULL, scale = TRUE, type = NULL,kernel =
"radial", degree = 3, gamma = if(is.vector(x)) 1 else 1 / ncol(x),
coef0 = 0, cost = 1, nu = 0.5,
class.weights = NULL, cachesize = 40,tolerance = 0.001, epsilon = 0.1,shrinking = TRUE, cross = 0, probability =FALSE, fitted = TRUE, seed = 1L,..., subset, na.action = na.omit)
主要参数说明:
Formula:分类模型形式,在第二个表达式中使用的的x,y可以理解为y~x。
Data:数据集
Subset:可以指定数据集的一部分作为训练集
Na.action:缺失值处理,默认为删除数据条目
Scale:将数据标准化,中心化,使其均值为0,方差为1.默认自动执行。
Type:SVM的形式,使用可参见上面的SVMformulation,type的选项有:C-classification,nu-classification,one-classification (for novelty detection),eps-regression,nu-regression。后面两者为利用SVM做回归时用到的,这里暂不介绍。默认为C分类器,使用nu分类器会使决策边界更光滑一些,单一分类适用于所有的训练数据提取自同一个类里,然后SVM建立了一个分界线以分割该类在特征空间中所占区域和其它类在特征空间中所占区域。
Kernel:在非线性可分时,我们引入核函数来做非线性可分,R提供的核介绍如下:
- 线性核:u'*v
- 多项式核:(gamma*u'*v + coef0)^degree
- 高斯核:exp(-gamma*|u-v|^2)
- Sigmoid核:tanh(gamma*u'*v + coef0)
默认为高斯核(RBF),libSVM的作者对于核的选择有如下建议:Ingeneral we suggest you to try the RBF kernel first. A recent result by Keerthiand Lin shows that if RBF is used with model selection, then there is no need to consider the linear kernel. The kernel matrix using sigmoid may not be positive definite and in general it's accuracy is not better than RBF. (see thepaper by Lin and Lin. Polynomial kernels are ok but if a high degree is used,numerical difficulties tend to happen (thinking about dth power of (<1) goes to 0 and (>1) goes to infinity).
顺带说一句,在kernlab包中,可以自定义核函数。
Degree:多项式核的次数,默认为3
Gamma:除去线性核外,其他的核的参数,默认为1/数据维数
Coef0,:多项式核与sigmoid核的参数,默认为0
Cost:C分类的惩罚项C的取值
Nu:nu分类,单一分类中nu的取值
Cross:做K折交叉验证,计算分类正确性。
由于svm的编程确实过于复杂,还涉及到不少最优化的内容,所以在第二部分我的分类都是使用svm函数完成的(偷一下懒),现将部分R代码展示如下:
dataSim的函数:
[plain] view plaincopyprint
- simData=function(radius,width,distance,sample_size)
- {
- aa1=runif(sample_size/2)
- aa2=runif(sample_size/2)
- rad=(radius-width/2)+width*aa1
- theta=pi*aa2
- x=rad*cos(theta)
- y=rad*sin(theta)
- label=1*rep(1,length(x))
-
- x1=rad*cos(-theta)+rad
- y1=rad*sin(-theta)-distance
- label1=-1*rep(1,length(x1))
-
- n_row=length(x)+length(x1)
- data=matrix(rep(0,3*n_row),nrow=n_row,ncol=3)
- data[,1]=c(x,x1)
- data[,2]=c(y,y1)
- data[,3]=c(label,label1)
-
- data
-
- }
- dataSim=simData(radius=10,width=6,distance=-6,sample_size=3000)
- colnames(dataSim)<-c("x","y","label")
- dataSim<-as.data.frame(dataSim)
Sigmoid核的分类预测:
[plain] view plaincopyprint
- m1 <- svm(label ~x+y, data =dataSim,cross=10,type="C-classification",kernel="sigmoid")
- m1
- summary(m1)
- pred1<-fitted(m1)
- table(pred1,dataSim[,3])
核函数那一小节作图的各种东西:
[plain] view plaincopyprint
- linear.svm.fit <- svm(label ~ x + y, data = dataSim, kernel ='linear')
- with(dataSim, mean(label == ifelse(predict(linear.svm.fit) > 0,1, -1)))
-
- polynomial.svm.fit <- svm(label ~ x + y, data = dataSim, kernel ='polynomial')
- with(dataSim, mean(label == ifelse(predict(polynomial.svm.fit) >0, 1, -1)))
-
- radial.svm.fit <- svm(label ~ x + y, data = dataSim, kernel ='radial')
- with(dataSim, mean(label == ifelse(predict(radial.svm.fit) > 0,1, -1)))
-
- sigmoid.svm.fit <- svm(label ~ x + y, data = dataSim, kernel ='sigmoid')
- with(dataSim, mean(label == ifelse(predict(sigmoid.svm.fit) > 0,1, -1)))
-
- df <- cbind(dataSim,
- data.frame(LinearSVM = ifelse(predict(linear.svm.fit) > 0, 1, -1),
- PolynomialSVM = ifelse(predict(polynomial.svm.fit) > 0, 1, -1),
- RadialSVM = ifelse(predict(radial.svm.fit) > 0, 1, -1),
- SigmoidSVM = ifelse(predict(sigmoid.svm.fit) > 0, 1, -1)))
- library("reshape")
- predictions <- melt(df, id.vars = c('x', 'y'))
- library('ggplot2')
- ggplot(predictions, aes(x = x, y = y, color = factor(value))) +
- geom_point() +
- facet_grid(variable ~ .)
最后,我们回到最开始的那个手写数字的案例,我们试着利用支持向量机重做这个案例。
运行代码:
[plain] view plaincopyprint
- setwd("D:/R/data/digits/trainingDigits")
- names<-list.files("D:/R/data/digits/trainingDigits")
- data<-paste("train",1:1934,sep="")
- for(i in 1:length(names))
- assign(data[i],as.vector(as.matrix(read.fwf(names[i],widths=rep(1,32)))))
- label<-rep(0:9,c(189,198,195,199,186,187,195,201,180,204))
-
- data1<-get(data[1])
- for(i in 2:length(names))
- data1<-rbind(data1,get(data[i]))
-
- m <- svm(data1,label,cross=10,type="C-classification")
- m
- summary(m)
- pred<-fitted(m)
- table(pred,label)
-
- setwd("D:/R/data/digits/testDigits")
- names<-list.files("D:/R/data/digits/testDigits")
- data<-paste("train",1:1934,sep="")
- for(i in 1:length(names))
- assign(data[i],as.vector(as.matrix(read.fwf(names[i],widths=rep(1,32)))))
- data2<-get(data[1])
- for(i in 2:length(names))
- data2<-rbind(data2,get(data[i]))
- pred<-predict(m,data2)
- labeltest<-rep(0:9,c(87,97,92,85,114,108,87,96,91,89))
- table(pred,labeltest)
模型摘要:
Call:
svm.default(x = data1, y = label, type ="C-classification", cross =10)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.0009765625
Number of Support Vectors: 1139 ( 78 130 101 124 109 122 87 93 135 160 )
Number of Classes: 10
Levels: 0 1 2 3 4 5 6 7 8 9
10-fold cross-validation on training data:
Total Accuracy: 96.7425
Single Accuracies:
97.40933 98.96373 91.7525899.48187 94.84536 94.30052 97.40933 96.90722 98.96373 97.42268
当然,我们还可以通过m$SV查看支持向量的情况,m$index查看支持向量的标签,m$rho查看分类时的截距b。
训练集分类结果: