sklearn的SVM函数没有对数据做scale操作,而e1071包的对应函数做数据做了scale。因此在R语言中需要指定scale=FALSE
,才会产生跟sklearn类似的结果。
这里以Machine learning with R(机器学习与R语言)一书的letter recognition举例,该数据集也在UCI数据库中,uci letter recognition,这里为了可重复性,使用UCI的数据。
首先在python中,使用pandas读取相应的数据,并将前16000条数据放入训练集,后4000条数据放入测试集,用以评估svm的预测性能。
import pandas as pd
letter_reco_path = "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
colnames = [
"letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar",
"xybar", "x2ybar", "xy2bar", "xedge", "xedgey", "yedge", "yedgex"
]
letter_data = pd.read_csv(letter_reco_path, header = None, names = colnames)
training = letter_data.iloc[0:16000,]
testing = letter_data.iloc[16000:, ]
X_train, y_train = training.ix[:, 1:].values, training.ix[:, 0].values
X_train, y_train = training.ix[:, 1:].values, training.ix[:, 0].values
下面使用sklearn的SVC进行SVM的分类,并使用高斯核。
from sklearn.svm import SVC
svm_model = SVC(kernel="rbf", random_state=1071).fit(X_train, y_train)
再对测试集进行预测,得到预测精度0.9722。
svm_pred = svm_model.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, svm_pred)
0.97224999999999995
同样地,在R语言中,读取UCI对应的数据,把前16000条放入训练集,剩下的放入测试集。
letter_reco_path <- "https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data"
colnames <- c("letter", "xbox", "ybox", "width", "height", "onpix", "xbar", "ybar", "x2bar", "y2bar", "xybar", "x2ybar", "xy2bar", "xedge", "xedgey", "yedge", "yedgex")
letter_data <- read.csv( letter_reco_path, header = FALSE, col.names = colnames)
training_index <- seq.int(1, 16000)
training <- letter_data[training_index, ]
testing <- letter_data[-training_index, ]
通过e1071的svm函数做对应的模型训练,使用高斯核,且对数据不做scale操作,即scale=FALSE
。
svm_model2 <- svm(
letter ~.,
data = training,
kernal = "radial",
type = "C-classification",
scale = FALSE
)
再通过predict对测试集进行预测,得到精度,0.9725,与sklearn的精度接近。
svm_pred2 <- predict(svm_model2, newdata = testing)
table(svm_pred2 == testing$letter) %>% prop.table
FALSE TRUE
0.0275 0.9725
查阅文档,发现sklearn的SVC函数不会对数据做scale操作,而e1071的svm函数默认情况下有scale的操作,需要在实际的使用中注意这种差异。