In this experiment, support vector machines (SVM - Support Vector Machine) are fitted to solve the binary
classification task using the Spam Database dataset. The aim of this experiment is
the comparison of different kinds of SVM defined by linear, polynomial etc. kernel functions. The classification
accuracy of the resulting classifiers are determined and we review the interpretation of statistics and graphs
related to support vector machines. The model fitting is carried out by the
Spambase [UCI MLR]
Before fitting the models the dataset is partitionated by the
Data Partition operator
according to the rates 60/20/20 among the training, validatation and test datasets.
Firstly, a support vector machine is fitted using linear kernel. The goodness-of-fit of the resulting model can
be checked by standard statistics (e.g. misclassification rate, the number of incorrectly classified cases) and
graphs (e.g. response and lift curve). These tools will be discussed later during the comparison of the two models.
Besides these results the
SVM operator provides such additional statistics as goodness-of-fit
measures and list of support vectors which have meaning only for support vector machines.
Secondly, a polynomial kernel SVM is fitted to the dataset and compared to the previous model which one
is the best. The parametrization of the
SVM operator can be seen below.
The support vector machines with two different kernels (linear and polynomial) can be compared by the usual statistical and graphical tools.
The above figures and statistics clearly show that the polynomial kernel support vector machine can improve the fit of the model against the linear kernel one. The misclassification rate is improved by 2 per cent and the lift and ROC curves also show a significant improvement. The cumulative lift curve shows a better model at second to third deciles, while the ROC curve also show an improvement if the specificity is very close to 1.