Using Support Vector Machines for Digit Recognition

I have been sitting around on the MNIST data set for a while now. MNIST database is a large database of handwritten digits and these are provided in the Kaggle Knowledge Competition Digit Recognizer. I have been sitting on this data set for so long in fact, that the last thing I have written for it was last August. I wrote a Python script, that took the training data and created bmp image files of each data point. So you would end up with a folder with 42000 28 by 28 pixel images (about 74.5 MB of memory).  I have uploaded it as a Gist here for those interested.

 

Digits from MNIST data set

Digits from MNIST data set

But what I have done this weekend, was using the Linear Support Vector Classification implemented in the scikit-learn module to create a simple model, that determines the digit according to the given pixel data with an accuracy of 84% on the test data in the Kaggle Competition. My implementation is based on this example on using a SVM to recognize hand written digits.

What I will present you isn’t the script I have used for the Kaggle submission, but the one I have used on the training data to measure the accuracy of the model. The advantage of using only the training data is, that I have all the correct labels of each data point and can therefore display a confusion matrix and other metrics for evaluating the quality of the model.

Linear Support Vector Machines try to find a hyperplane that separates the training data into two classes with a maximum margin. In our case the class of a data point is the digit it represents. We want to maximize the margin between the hyperplane and the two classes to minimize the error of incorrectly recognizing a digit. The hyperplane then divides the data, so that everything above the hyperplane belongs to one class and everything below the hyperplane belongs to the other class. Each pixel value of the 28 by 28 image is represented in its own dimension, meaning that a image is a point in a space with 28 * 28 = 784 dimensions. And the hyperplane divides the data points into two classes in this 784 dimensional space.

An additional aspect to consider is, that dividing images into digits between 0 and 9 is a multiclass classification problem. My definition from the previous paragraph on how Support Vector Machines work only contains one hyperplane, that can divide into only two classes. And this truly is a problem, when we have more than two classes like in this case. The solution to this is to train multiple Support Vector Machines, that solve problems stated in this format: “Is this digit a 3 or not a 3?”. Now we are solving a binary classification again with the two classes “is a 3” and “is not a 3”. In our case we have one Support Vector Machine for each digit, giving us a total of ten. We consider the solution with the highest confidence score as the right digit.

So here is how the “train.csv” looked like to make sense of the indexing in the code:

label,pixel0,pixel1,pixel3,...,pixel783
1 , 0 , 0 , 0 , ... , 0
4 , 0 , 0 , 0 , ... , 0
⋮ , ⋮ , ⋮ , ⋮ , ⋮ , ⋮  

The pixel data can take values have in the range [0,255], where 255 is black and 0 is white.

And this is how my code looks like:

import csv
from sklearn import svm, metrics
from numpy import genfromtxt
import numpy as np

dataset = genfromtxt('train.csv', delimiter=",", dtype=np.dtype('>i4'))[1:]
labels = [x[0] for x in dataset]
data = [x[1:] for x in dataset]

n_samples = len(labels)
n_features = len(data[0])

print("Number of samples: " + str(n_samples) + ", number of features: "+ str(n_features))

# a support vector classifier
classifier = svm.LinearSVC()

split_point = int(n_samples * 0.66)

# using two thirds for training
# ans one third for testing

labels_learn = labels[:split_point]
data_learn = data[:split_point]

labels_test = labels[split_point:]
data_test = data[split_point:]

print("Training: " + str(len(labels_learn)) + " Test: " + str(len(labels_test)))

# Learning Phase
classifier.fit(data_learn, labels_learn)

# Predict Test Set
predicted = classifier.predict(data_test)

# classification report
print("Classification report for classifier %s:n%sn" % (classifier, metrics.classification_report(labels_test, predicted)))

# confusion matrix
print("Confusion matrix:n%s" % metrics.confusion_matrix(labels_test, predicted))

The cool thing about metrics is, that you can easily print this as output to judge how well your model is performing:

Number of samples: 42000, number of features: 784
Training: 27720 Test: 14280
Classification report for classifier LinearSVC(C=1.0, class_weight=None, dual=Tr
ue, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0):
             precision    recall  f1-score   support

          0       0.96      0.93      0.94      1442
          1       0.93      0.97      0.95      1613
          2       0.92      0.79      0.85      1376
          3       0.88      0.86      0.87      1468
          4       0.84      0.92      0.88      1339
          5       0.67      0.84      0.75      1296
          6       0.84      0.96      0.90      1388
          7       0.94      0.85      0.89      1504
          8       0.81      0.73      0.77      1401
          9       0.87      0.78      0.82      1453

avg / total       0.87      0.86      0.86     14280


Confusion matrix:
[[1334    0    8    9    2   36   44    1    8    0]
 [   0 1570    5    4    2    9    2    2   18    1]
 [   8   25 1087   33   14   58   77   17   52    5]
 [   4   14   27 1260    2   89   20   11   32    9]
 [   4   12    9    2 1230   13   23    2   21   23]
 [  13    6    5   50   23 1083   48    1   54   13]
 [   7    5    7    0    5   23 1336    0    4    1]
 [   6    6   11   10   33   34    2 1274   23  105]
 [   5   44   12   30   16  213   30    6 1027   18]
 [   7   10    8   32  138   52    2   41   32 1131]]

The most interesting analysis metric for our digit recognition is probably the confusion matrix. Each row represents one digit and each column entry also represents one digit. A entry in the matrix resembles the number of times a given digit was recognized as the digit represented in the column. So the very first entry 1334 says that 1334 times a digit 0 was recognized as a digit 0. The number 8 two columns to the right means, that 8 times a 0 digit in the training data was recognized as a 2 from our SVM and so on and so fourth. Naturally it makes sense, that for a well working prediction model the entries in the diagonal should be substantially larger than the other values in the given row, which is the case with this linear SVM.

Applying the SVM to my own digits

Now since I have a accurate working model for the MNIST handwritten digits, I would also like to see how well the model works for my own digits. So I created a few BMP images using paint.

number_collage

My digits made in Paint

I extract the grey values for each pixel from each BMP image and feed this as test data to predict for my SVM. I use joblib to load my previously saved classifier, so I don’t have to train my model from scratch every time. Here is the script I have used:

from PIL import Image
import numpy as np
import sys
from sklearn.externals import joblib

# argv[1] - path to input image
if len(sys.argv) != 2:
    print("Incorrect number of arguments, add a BMP file as cmd line argument.n")
    sys.exit()

# loading the grey values from the image
custom_IM = Image.open(sys.argv[1])
custom_pixels = list(custom_IM.getdata())
corr_pixels = []

# convert pixel data to fit training data format (swap grey values)
for row in custom_pixels:
    new_row = 255 - row[0]
    corr_pixels.append(new_row)

if len(corr_pixels) != 784:
    print("Incorrect Image Dimensions (needs to be 784)n")
    sys.exit()

# convert to numpy array
test_set = np.array(corr_pixels)

classifier = joblib.load("../classifier/kaggle_digit_recognizer.pkl")

# Predict Test Set
predicted = classifier.predict(test_set)

# prints the predicted number
print(predicted)

I have only created one image for each digit and here are my results:

label, prediction
0, 3
1, 1
2, 7
3, 3
4, 1
5, 8
6, 4
7, 2
8, 8
9, 8
Accuracy: 40%

So these aren’t very good results at all for my self made images. Since my model performed way better on the actual test set, I guess that the circumstances under which the digits are made play a great matter. It could be, that the brushes I used in Paint don’t resemble the same type of writing for the original data set.

Overall, support vector machines are a powerful method of prediction and is a widely used machine learning algorithm. But you can also see how bad these simple models perform on differently created images.

As always, please comment on corrections and suggestions on how to easily improve the code and in this case also the prediction model.

Thank you for reading!

The post Using Support Vector Machines for Digit Recognition appeared first on Rather Read.