Supervised Machine Learning Algorithms

Posted by Alan Barr on Sun 30 October 2016

The biggest takeaway I have taken from this machine learning course so far is that one has to experiment with different algorithms. There isn't one that will give you the best result but there are some that might give you a pretty good result with minimal work. I think with enough experience it becomes easier to say, "I have this type of data that has these features X, Y, Z and I know that N series of algorithms will work well for this and O algorithms are not a good match".

These algorithms are in the sci-kit learn python package. In the class we do some calculations but for the most the algorithms do the math for you. With supervised learning we have inputs and outputs that we expect to receive. Features are our inputs and labels are the outputs. The test group of features and labels are a subset of the data that we have separated out so we can verify how successful our algorithm is.

In the UD120 course we discuss terrain data and use the algorithms: K-nearest neighbor, random forest classifier, and ada boost. Ada boost and random forest are ensemble methods that utilize decision trees to come to their prediction. A benefit of the random forest classifier is that it tends to give a good result quickly and is not as prone to overfitting as other decision tree methods are. Overfitting occurs when the model one uses scores too well against new data. If your accuracy score for new data is too good it most likely means that there is something wrong.

The data set that UD120 operates with asks one to use these algorithms to come up with results and tweak them for better results.

from sklearn import ensemble
from sklearn import neighbors

classifier = ensemble.RandomForestClassifier()
classifier.fit(features_train,labels_train)
print classifier.score(features_test,labels_test)

classifier = ensemble.AdaBoostClassifier()
classifier.fit(features_train,labels_train)
print classifier.score(features_test,labels_test)

classifier = neighbors.KNeighborsClassifier(4)
classifier.fit(features_train,labels_train)
print classifier.score(features_test,labels_test)

Which gave me these results for each respective algorithm: 0.924, 0.924, 0.94 .

I think the biggest thing I have had to wrap my mind around with supervised learning is what exactly is train and test data? Also, how do I even know how does my algorithm know if it is accurate? The flow appears to be: I create a classifier, I then give that classifier data about the features I want it to look at and labels of that data. Given a data set of photos that I want to classify into categories of colors using blue and red as my labels. Otherwise if my labels were something more continous and not specific say ages of people in the photos then my labels would look more continuous and may be a range of numbers.

I am enjoying the course so far but do have some criticisms. I wish this code were in python 3. The course is probably for people with this background in math already so it can be a bit jarring for some of the problems to go straight into "Ok, now calculate this entropy value" like ok python 2 does division in a weird way compared to 3 so you might spend a while stuck and googling for answers. Or sometimes the slides do not include what the question is so if you come back to the course it is difficult to understand what they're exactly asking for.

According to the authors K-nearest neighbors is the simplest algorithm and the sci-kit learn documents say that it is good for supervised and unsupervised learning methods. From what I have read it is simple because when you call the classifier you can define how many "neighbors" or points to use in the majority vote to assign a class. The algorithm queries each point and asks it "Hey, current point I am asking, of the number of points I have defined could you ask each which label you belong in? You are whatever the label that the majority of your nearest neighbors decide".

With a random forest classifier I can get some decent result because this random decision tree is random data and random variables from the data set. While the random forest makes a lot of mistakes generally the random forest can come out with a good classification because the mistakes are found in different parts of the random trees early. From reading Kaggle articles and notebooks it seems generally one can use this method and get some decent results.

Adaboost is another ensemble decision tree based classifier. Uses earlier classifiers to filter out errors and find the hard points to predict and uses later classifiers to work on the hard points for classifying.

Ensemble methods combine different classifiers to come up with ideally a more accurate prediction of what points are part of which labels.