Introduction to Machine Learning

Posted by Alan Barr on Fri 28 October 2016

Over the past few years the concept "machine learning" is appearing in new places in my life. Some products I use that this is present in are recommendation engines, spam filtering, natural language processing and many other areas. I want to put these techniques into use in some manner. I would especially like to use this technology to generate art. November is NaNoWriMo, (National Novel Writing Month) and also National Novel Generating Month and I would like to generate my own novel. However, my machine learning skills are not quite up to par for a strict machine learning solution. While I am intrigued by the neuralsnap poetry generator I haven't found a simple way to integrate this with the tensorflow image captioning model as of yet.

For now I am attempting to make my way through various machine learning courses. I have dabbled in a course on Kadenze and now I am trying Udacity's UD120 Machine Learning course. The first few examples use the naive bayes and support vector machine algorithms to make predictions from data.

The nice thing about the naive bayes algorithm is that it is good for classifying a lot of textual data. Spam filters usually use this type of filtering algorithm to classify text as a valid email or not. A big point of the course so far has been to try out different methods on your data. There might be a better algorithm to give you better accuracy and some might have speed implications. From what I have gleaned from the course and online since it is a simple algorithm it might do better at classifying data in categories when you do not have much data and other algorithms may take more work to implement. The course authors warned that using naive bayes for predictions would not make much sense to do.

In this course we use scikitlearn to do the computation and the naive bayes algorithm seemed to work fine on a several thousand email set. The accuracy seemed pretty good. One gotcha the course does not explain but the forums does is that the svm module linear svm is really slow and the faster one is the svm.LinearSVC(). While I am still exploring the uses cases of svm it does appear it is great for detecting boundaries around data provided there is not so much overlap. So far it seems that support vector machines can be a great utility if one has a regression of data that would fit along a line. One of the features of a svm is that it attempts to maximize the distance of the margin between the seperate data points establishing a clear boundary. Provided your data is a good fit for that.