I have been sitting around on the MNIST data set for a while now. MNIST database is a large database of handwritten digits and these are provided in the Kaggle Knowledge Competition Digit Recognizer. I have been sitting on this data set for so long in fact, that the last thing I have written for it was last August. I wrote a Python script, that took the training data and created bmp image files of each data point. So you would end up with a folder with 42000 28 by 28 pixel images (about 74.5 MB of memory). I have uploaded it as a Gist here for those interested.
But what I have done this weekend, was using the Linear Support Vector Classification implemented in the scikit-learn module to create a simple model, that determines the digit according to the given pixel data with an accuracy of 84% on the test data in the Kaggle Competition. My implementation is based on this example on using a SVM to recognize hand written digits.
What I will present you isn’t the script I have used for the Kaggle submission, but the one I have used on the training data to measure the accuracy of the model. The advantage of using only the training data is, that I have all the correct labels of each data point and can therefore display a confusion matrix and other metrics for evaluating the quality of the model.
Linear Support Vector Machines try to find a hyperplane that separates the training data into two classes with a maximum margin. In our case the class of a data point is the digit it represents. We want to maximize the margin between the hyperplane and the two classes to minimize the error of incorrectly recognizing a digit. The hyperplane then divides the data, so that everything above the hyperplane belongs to one class and everything below the hyperplane belongs to the other class. Each pixel value of the 28 by 28 image is represented in its own dimension, meaning that a image is a point in a space with 28 * 28 = 784 dimensions. And the hyperplane divides the data points into two classes in this 784 dimensional space.
An additional aspect to consider is, that dividing images into digits between 0 and 9 is a multiclass classification problem. My definition from the previous paragraph on how Support Vector Machines work only contains one hyperplane, that can divide into only two classes. And this truly is a problem, when we have more than two classes like in this case. The solution to this is to train multiple Support Vector Machines, that solve problems stated in this format: “Is this digit a 3 or not a 3?”. Now we are solving a binary classification again with the two classes “is a 3” and “is not a 3”. In our case we have one Support Vector Machine for each digit, giving us a total of ten. We consider the solution with the highest confidence score as the right digit.
So here is how the “train.csv” looked like to make sense of the indexing in the code:
label,pixel0,pixel1,pixel3,...,pixel783 1 , 0 , 0 , 0 , ... , 0 4 , 0 , 0 , 0 , ... , 0 ⋮ , ⋮ , ⋮ , ⋮ , ⋮ , ⋮
The pixel data can take values have in the range [0,255], where 255 is black and 0 is white.
And this is how my code looks like:
import csv from sklearn import svm, metrics from numpy import genfromtxt import numpy as np dataset = genfromtxt('train.csv', delimiter=",", dtype=np.dtype('>i4'))[1:] labels = [x[0] for x in dataset] data = [x[1:] for x in dataset] n_samples = len(labels) n_features = len(data[0]) print("Number of samples: " + str(n_samples) + ", number of features: "+ str(n_features)) # a support vector classifier classifier = svm.LinearSVC() split_point = int(n_samples * 0.66) # using two thirds for training # ans one third for testing labels_learn = labels[:split_point] data_learn = data[:split_point] labels_test = labels[split_point:] data_test = data[split_point:] print("Training: " + str(len(labels_learn)) + " Test: " + str(len(labels_test))) # Learning Phase classifier.fit(data_learn, labels_learn) # Predict Test Set predicted = classifier.predict(data_test) # classification report print("Classification report for classifier %s:n%sn" % (classifier, metrics.classification_report(labels_test, predicted))) # confusion matrix print("Confusion matrix:n%s" % metrics.confusion_matrix(labels_test, predicted))
The cool thing about metrics is, that you can easily print this as output to judge how well your model is performing:
Number of samples: 42000, number of features: 784 Training: 27720 Test: 14280 Classification report for classifier LinearSVC(C=1.0, class_weight=None, dual=Tr ue, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0): precision recall f1-score support 0 0.96 0.93 0.94 1442 1 0.93 0.97 0.95 1613 2 0.92 0.79 0.85 1376 3 0.88 0.86 0.87 1468 4 0.84 0.92 0.88 1339 5 0.67 0.84 0.75 1296 6 0.84 0.96 0.90 1388 7 0.94 0.85 0.89 1504 8 0.81 0.73 0.77 1401 9 0.87 0.78 0.82 1453 avg / total 0.87 0.86 0.86 14280 Confusion matrix: [[1334 0 8 9 2 36 44 1 8 0] [ 0 1570 5 4 2 9 2 2 18 1] [ 8 25 1087 33 14 58 77 17 52 5] [ 4 14 27 1260 2 89 20 11 32 9] [ 4 12 9 2 1230 13 23 2 21 23] [ 13 6 5 50 23 1083 48 1 54 13] [ 7 5 7 0 5 23 1336 0 4 1] [ 6 6 11 10 33 34 2 1274 23 105] [ 5 44 12 30 16 213 30 6 1027 18] [ 7 10 8 32 138 52 2 41 32 1131]]
The most interesting analysis metric for our digit recognition is probably the confusion matrix. Each row represents one digit and each column entry also represents one digit. A entry in the matrix resembles the number of times a given digit was recognized as the digit represented in the column. So the very first entry 1334 says that 1334 times a digit 0 was recognized as a digit 0. The number 8 two columns to the right means, that 8 times a 0 digit in the training data was recognized as a 2 from our SVM and so on and so fourth. Naturally it makes sense, that for a well working prediction model the entries in the diagonal should be substantially larger than the other values in the given row, which is the case with this linear SVM.
Applying the SVM to my own digits
Now since I have a accurate working model for the MNIST handwritten digits, I would also like to see how well the model works for my own digits. So I created a few BMP images using paint.
I extract the grey values for each pixel from each BMP image and feed this as test data to predict for my SVM. I use joblib to load my previously saved classifier, so I don’t have to train my model from scratch every time. Here is the script I have used:
from PIL import Image import numpy as np import sys from sklearn.externals import joblib # argv[1] - path to input image if len(sys.argv) != 2: print("Incorrect number of arguments, add a BMP file as cmd line argument.n") sys.exit() # loading the grey values from the image custom_IM = Image.open(sys.argv[1]) custom_pixels = list(custom_IM.getdata()) corr_pixels = [] # convert pixel data to fit training data format (swap grey values) for row in custom_pixels: new_row = 255 - row[0] corr_pixels.append(new_row) if len(corr_pixels) != 784: print("Incorrect Image Dimensions (needs to be 784)n") sys.exit() # convert to numpy array test_set = np.array(corr_pixels) classifier = joblib.load("../classifier/kaggle_digit_recognizer.pkl") # Predict Test Set predicted = classifier.predict(test_set) # prints the predicted number print(predicted)
I have only created one image for each digit and here are my results:
label, prediction 0, 3 1, 1 2, 7 3, 3 4, 1 5, 8 6, 4 7, 2 8, 8 9, 8 Accuracy: 40%
So these aren’t very good results at all for my self made images. Since my model performed way better on the actual test set, I guess that the circumstances under which the digits are made play a great matter. It could be, that the brushes I used in Paint don’t resemble the same type of writing for the original data set.
Overall, support vector machines are a powerful method of prediction and is a widely used machine learning algorithm. But you can also see how bad these simple models perform on differently created images.
As always, please comment on corrections and suggestions on how to easily improve the code and in this case also the prediction model.
Thank you for reading!
The post Using Support Vector Machines for Digit Recognition appeared first on Rather Read.