Classification is an important task in machine learning which classifies the data into one of the many classes. Classic examples of this is classifying emails into spam and ham(useful emails). Another example is flower classification. Classification takes into consideration certain attributes. In the machine learning lingo, these attributes are called features, which is akin to dimension in linear algebra lingo. There are many correct answers in deciding the features to consider for the task of classification.
In this post, we are going to look at supervised classification. In supervised classification, the classifier is given the correctly classified dataset first. Looking all those data, it develops an internal representation of the dataset. This internal representation is called model among the statisticians and separate algorithm uses this representation to classify unseen new data.
If the internal representation performs exceedingly well on the given classified examples but performs poorly on the unseen new examples, we have the problem of overfitting. The problem of overfitting is relevant to many machine learning methods and is not only limited to support vector classifiers.
The input given is often converted into a reasonable way of numerical representation to facilitate the mathematical analysis of the dataset. The feature could be a text data, or a country currency or Unix timestamps and we have to efficiently represent them as numerical data in machine learning. This is done after data cleansing.
There are many classification techniques and the most used are tree based. In this post, we look at support vector classification which is a very useful method to classify small datasets.
Support vector classification is a classification technique which has its roots in statistical learning theory. It has shown promising results in various practical scenarios and is particularly useful when the dataset is small. It also works very well with high dimensional data for reasons we will discuss soon.
The figure shows a 2D plot of a linearly separable dataset. Two sets are linearly separable if there exists at least one line in the plane with all the green points on one side of the line and all the red points on the other side. This line becomes hyperplane if we generalize this idea to higher dimensional Euclidean spaces.
A support vector classifier’s decision boundary is associated with two hyperplane, which can be obtained by moving parallel until it meets closest red/green circles. The distance between these two lines is called margin. Now, draw a hyperplane which is at equal distance from these two hyperplanes and parallel to them. This hyperplane is called a maximum-margin hyperplane.
A hyperplane with maximum margin is a good generalization of the dataset as classifiers with small margins are susceptible to model overfitting and tend to generalize poorly on unseen examples.
The data points which are closest to the maximum margin hyperplane are called the support vectors. Hence, this classification method is known as support vector classification.
In practice, it is not always possible to find strongly separating hyperplane. Earlier, we were informed that SVMs perform well on high dimensional data. This is because of Cover’s theorm which states that if the dimensionality is high enough, everything is linearly separable. Because of this, we use a mapping function which transforms a given space into some other, usually very high dimensional space. This mapping function is called Kernel.
To quantify how well the model performs, we use a function called loss function. In case of classification, we use either cross-entropy or hinge loss functions. We will not be covering the details as to why we use them because this post is already long and we are yet to discuss the use case.
The dataset to be used is the Car Dataset from the UCI Machine Learning repository. The first step in building predictive models is specifying problem statement and generating hypothesis. The problem statement is as follows
Given various features of a car, the aim is to build a predictive model to determine the level of acceptance of the car. There are four levels of acceptance, which are unacceptable, acceptable, good and very good.
: There is a significant impact of the variables (below) on the dependent variable
: There is no significant impact of the variables (below) on the dependent variable
After we generate the hypothesis, we explore the data. First, we look at the severity of imbalanced classes in our data.
This visualization was generated using Tableau Public. It’s a very good tool for data visualization.
The class imbalance is huge. If we were to make a naïve classifier which always outputs unacceptable, we would be correct 70% of the time according to the training dataset. A good predictive algorithm performance depends on how well it can predict minority classes.
The dataset is read using pandas.
import pandas as pd df = pd.read_csv('cara.data',names=['buying','maint','doors','persons','lug_boot','safety','acc_level']) df.head()
we note that there are lot of categorical variables. Before we proceed using SVC, we must give them a numerical representation. It is done by using
for cols in df.columns: df[cols] = pd.factorize(df[cols]) df.head()
Of course this is not the one and only and best numerical representation, we are just proceeding with this.
Let’s check the correlation between the independent variable and dependent variable to validate our hypothesis that all the independent variable have significant impact on dependent variable.
for col in df.columns: if col != 'acc_level': print('Correlation between acc_level and %s is %0.2f' %(col,df['acc_level'].corr(df[col])))
Correlation between acc_level and buying is 0.29 Correlation between acc_level and maint is 0.25 Correlation between acc_level and doors is 0.06 Correlation between acc_level and persons is 0.34 Correlation between acc_level and lug_boot is 0.12 Correlation between acc_level and safety is 0.40
How the buying cost has positive correlation between the acceptability? Does people like expensive cars? Absolutely not. The answer lies within our encoding. We have encoded the label
vhigh with value 0,
high with value 1. Hence, we have positive correlation with the acceptability where as in reality, it is negative correlation. This is why proper encoding of the data is important.
Now, we do train, validation and test split.
train = df.sample(frac=0.8) val = train.sample(frac=0.1) test=df.drop(train.index) train = train.drop(val.index)
At this point, we are ready to train our svc.
from sklearn import svm svc=svm.SVC() svc.fit(train[['buying','maint','doors','persons','lug_boot','safety']],train['acc_level'])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
At this point, we should check our classification accuracy on validation dataset. We should never check out accuracy on test dataset unless you are sure the model in your hand is going to be the final model.
from sklearn.metrics import accuracy_score val_pred = svc.predict(val[['buying','maint','doors','persons','lug_boot','safety']]) trueval_values = val['acc_level'] print(accuracy_score(trueval_values,val_pred))
Pretty impressive accuracy considering our naïve classifier has score of 0.70. However, this accuracy is biased. It is possible that our classifier has learned just unacceptability and validation data is full of it, giving us an illusion of high score. Hence, we are going to use cross validation.
import numpy as np from sklearn import cross_validation df_train = pd.concat([train,val],ignore_index=True,axis=0) kfold = cross_validation.StratifiedKFold(df_train['acc_level'].values,3) print(np.mean(cross_validation.cross_val_score(svc, df_train[['buying','maint','doors','persons','lug_boot','safety']], df_train['acc_level'], cv=kfold, n_jobs = -1)))
We used stratified K fold because we have very misbalanced dataset. If you are building a model that is trained on a data where there are so few of the other class, you cannot expect it to predict the rarer group effectively. The stratified cross validation allows for randomization but also makes sure these unbalanced datasets have some of the both classes.
How can we further increase the accuracy? For this, we are going to use a technique called Grid Search.
The support vector classifier parameters we are tuning are C and gamma. We can tune many things but right now, we are tuning just these two parameters. Grid search parses through this parameter space in a deterministic way and selects the best parameter for which the score is maximum.
from sklearn.model_selection import GridSearchCV g_s=np.logspace(-10,5,10) c_s=np.array([1,10,100,1000]) clf=GridSearchCV(estimator=svc,param_grid=dict(gamma=g_s,C=c_s), n_jobs=-1) clf.fit(df_train[['buying','maint','doors','persons','lug_boot','safety']],df_train['acc_level']) print(' The best C is %0.2f and best gamma is %f for which best accuracy is %f' % (clf.best_estimator_.C,clf.best_estimator_.gamma,clf.best_score_*100))
The best C is 1000.00 and best gamma is 0.021544 for which best accuracy is 96.671491
This is a pretty decent improvement. Once we found out the best parameters in the given logspace, we can initialize another instance of support vector classifier using the following code.
Right now, we are not going to tinker the model more, so we proceed to check the test accuracy.
test_pred = clfbest.predict(test[['buying','maint','doors','persons','lug_boot','safety']]) accuracy_score(test['acc_level'],test_pred)
We get a accuracy 96.53% which is way better than the naïve 70%. So, in this way we approach the classification problems in scikit learn using support vector classification.
Here are some tips and tricks to increase the model score even more:
Please leave feedback and questions in the comments.