Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. It draws inference from set of hypothesis instead of a single hypothesis from the training data. Ensemble learning is used very widely, because it can boost weak learners which predict slightly better than random guess to strong learners which can make accurate predictions.
Base learners solve the same problem, but they can be different from each other because of modelling techniques. population being studied and various other factors.
In practice, construction of very good ensembles is possible because statistically, when the training data available is too small, single classifier can easily select different hypothesis in the hypothesis space which gives the same accuracy on the same data. Therefore, the risk of choosing wrong classifier is high. Ensemble can take a cumulative decision and can reduce the risk of choosing wrong classifier.
Another reason why ensemble works is representational, by forming weighted sums of hypothesis in the considered finite hypothesis space, we can expand the hypothesis space and hence, it can provide a better approximation of the true hypothesis.
So, does that mean including many base learners in the ensemble will lead to better performance? The answer is no as experiments show that selecting a subset of base learners instead of using all also lead to better choice. The process of selecting optimal subsets of base learners among the available base learners is called ensemble pruning and that is a topic for another blog post.
In case of classification, majority voting is used to choose the ensemble prediction. In majority voting, each model makes a prediction and the output which receives more than half of the votes is chosen as the ensemble prediction.
This could be improved a lot, especially by assigning different weights to different voters (base learners) so that they influence the outcome of the ensemble. The weights can be assigned based on their competencies.
In case of binary classification, weights are calculated from the validation accuracy in the following manner.
Weights are simply the log odds of the classifier validation accuracy. Let us numerically encode the classifier labels as and and let denote the prediction by the th base learner. Now the ensemble decision can be taken as follows.
Let where is the number of base learners considered. If , the decision taken by the ensemble is else . can take values between and so, we take the threshold as just average between these values. This threshold can be appropriately tuned depending upon the sensitivity of the misclassification but right now, we would consider the average.
The main idea of local accuracy estimate is to change weights each time when a new test sample is taken instead of keeping it rigid. The weights are now depending on the conditional probability given test sample instead of validation accuracy.
To maximize the performance of the ensemble, we need to give more weight to that member which has highest accuracy in the neighborhood of the test point. This requires the classifiers to be independent which can be assured by performing bootstrap aggregation or the classifier can simply be assumed as independent.
The independence assumption is necessary to maintain the diversity of the ensemble. Moderate diversity is recommended for an efficient ensemble.
Now, let’s check the performance of this algorithm on a simple 2 class classification dataset. Since the dataset is simple, very few training examples are needed. We knowingly train the individual classifiers on very few points to show how ensembles perform well.
First step is data preprocessing.
import pandas as pd import numpy as np columns = ['code_num','thickness','uofcsize','uofcshape','adhesion','secsize','bnuclei','chromatinb','nnucleoi','mitoses','output'] data = pd.read_csv('breast-cancer-wisconsin.data',names=columns) data.drop(['code_num'],1,inplace=True) data.replace('?',-99999, inplace=True) data = data.astype(int) X = np.array(data.drop(['output'], 1)) y = np.array(data['output'])
Second, we train 3 base learners, KNN, SVM and logistic regression on very few training points to satisfy the independent assumption.
from sklearn import preprocessing,neighbors,svm from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.9) clf1 = neighbors.KNeighborsClassifier() clf2 = svm.SVC() clf3 = LogisticRegression() clf1.fit(X_train, y_train) clf2.fit(X_train, y_train) clf3.fit(X_train,y_train) accuracy1 = clf1.score(X_test, y_test) accuracy2 = clf2.score(X_test, y_test) accuracy3 = clf3.score(X_test,y_test) print(accuracy1,accuracy2,accuracy3) #Out: 0.652380952381 0.652380952381 0.703174603175
Next we define two utility functions, one which returns the log odd given the probability, another one returns the neighborhood of a test point in the validation data space.
def get_weights(p): p[p==1.0] = 0.99 #avoid inf error odds = (p)/(1-p) return np.log(odds) from sklearn.neighbors import NearestNeighbors neigh = NearestNeighbors(n_neighbors=3) neigh.fit(X_val) def get_local_weights(test_point,n_neigh): nearest_indices = neigh.kneighbors(test_point,n_neighbors=n_neigh,return_distance=False) X_verify = X_val[nearest_indices] y_verify = y_val[nearest_indices] score_pred1 = clf1.score(X_verify,y_verify) score_pred2 = clf2.score(X_verify,y_verify) score_pred3 = clf3.score(X_verify,y_verify) acc_vector = np.array([score_pred1,score_pred2,score_pred3]) weights=get_weights(acc_vector) return weights
Lastly, we write the code for weighted prediction function.
def get_weighted_prediction(sample_point): weights=get_local_weights(sample_point,4) prediction=np.array([clf1.predict([sample_point]),clf2.predict([sample_point]),clf3.predict([sample_point])]) quota_weight = 0.0 for _ in range(len(prediction)): if prediction[_] == 4: quota_weight = quota_weight + weights[_] if quota_weight >= np.average(weights): return 4 else: return 2
So, let’s check the gain in accuracy.
import warnings warnings.filterwarnings('ignore') ensemble_pred= for _ in range(len(X_test)): ensemble_pred.append(get_weighted_prediction(X_test[_])) ensemble_pred=np.array(ensemble_pred).reshape(y_test.shape) from sklearn.metrics import accuracy_score print(accuracy_score(y_test,ensemble_pred)) #Out: 0.95873015873
That’s some serious performance gain. Keep in mind that we trained these algorithms on very few training samples so, the performance gain from local accuracy estimates is really good.
In the next few blog posts, I would be discussing about ensemble pruning and variety of other concepts related to ensemble learning, so stay tuned for more posts!