MNIST is a dataset developed by Jann LeCoun, Corinna Cortes and Christopher Burges for evaluating machine learning models on the problem of handwriting classification. The dataset was built from a series of scanned datasets available at the National Institute of Standards and Technology (NIST). Images of numbers were taken from a variety of scanned documents, normalized in size and in the center. This makes it an excellent data set for assessing models, allowing the developer to focus on the learning mechanism with very little data cleansing or necessary training.

Each image is a 28 by 28 pixel square (a total of 784 pixels). This set of 60,000 images is used to train the model, and a separate set of 10,000 images is used to test it. This is the task of recognizing 10 digits (from 0 to 9) or classification into 10 classes.

In this tutorial handwriting recognition by using multilayer perceptron and Keras is considered. The Keras deep learning library provides a convenient mnist.load_data () method for loading the MNIST dataset. The dataset is loaded automatically upon the first call of this function and saved in your home directory in ~ /.keras/datasets/mnist.pkl.gz as a 15 MB file. This is very convenient for developing and testing models of deep learning.

To demonstrate how easy it is to load the MNIST dataset, we first write a small script to load and render the first four images in the training set.

from keras.datasets import mnist
import matplotlib.pyplot as plt
# load (downloaded if needed) the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist_db.load_data()
# plot 4 images as gray scale
plt.imshow(X_train[0], cmap=pplt.get_cmap('gray'))
plt.imshow(X_train[1], cmap=pplt.get_cmap('gray'))
plt.imshow(X_train[2], cmap=pplt.get_cmap('gray'))
plt.imshow(X_train[3], cmap=pplt.get_cmap('gray'))
# show the plot

By running the example above, you should see the image below.

To understand whether we really need a complex model, such as a convolutional neural network, we first try to use a very simple model of a neural network with one hidden layer. We will use this network as a basis for comparing more complex convolutional models of neural networks. Let’s start by importing the classes and functions that we need.

import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils

The training data set is structured as a three-dimensional array. To prepare the data, we first present the images in the form of one-dimensional arrays (since we consider each pixel as a separate input feature). In this case, 28 × 28 images will be converted to arrays containing 784 elements. We can do this conversion using the reshape () function of the NumPy library. To reduce the consumption of RAM, we convert the accuracy of the pixel values to 32.

(X_tr, y_tr), (X_tst, y_tst) = mnist.load_data()
npix = X_tr.shape[1] * X_tr.shape[2]
X_tr = X_tr.reshape(X_tr.shape[0], npix).astype('float32')
X_tst = X_tst.reshape(X_tst.shape[0], npix).astype('float32')

Pixel values are given in grayscale with values from 0 to 255. For effective training of neural networks, it is almost always recommended to perform some scaling of the input values. We can normalize pixel values in the range of 0 and 1, dividing each value by maximum values of 255.

X_tr = X_tr / 255
X_tst = X_tst / 255

The output variable is an integer from 0 to 9, since This is a classification task with several classes. A good practice is to use the coding of class values by converting the vector of integers of a class into a binary matrix. We can easily do this using the built-in auxiliary function np_utils.to_categorical() in Keras.

y_tr = np_utils.to_categorical(y_tr)
y_tst = np_utils.to_categorical(y_tst)
num_classes = y_tst.shape[1]

Now we will create our simple model of a single-layer neural network and define it as a function.

def create_model():
	# create model
	m = Sequential()
	m.add(Dense(npix,input_dim=npix, kernel_initializer='normal', activation='relu'))
	m.add(Dense(num_classes,kernel_initializer='normal', activation='softmax'))
	# Compile model
	m.compile(loss='categorical_crossentropy',optimizer='adam', metrics=['accuracy'])
	return m

The model is a simple neural network with one hidden layer with the same number of neurons as the number of inputs (784). In the hidden layer we use the semilinear activation function relu.

At the output layer, the softmax activation function is used to convert the outputs to probabilistic values ​​and allows you to select one class out of 10 as the output value of the model.

Now we just have to determine the loss function, the optimization algorithm and the metrics that we will collect. In problems with probabilistic classification, as a error function, it is best to use not a quadratic error, but cross-entropy. Error will be less for probabilistic tasks (for example, with a logistic / softmax function for the output layer), mainly due to the fact that this function is designed to maximize the confidence of the model in the correct class definition, and it does not care about the probability distribution of the sample to other classes . The optimization algorithm used will resemble some form of the gradient descent algorithm, the only difference being in how the learning rate is chosen. In our case, we will use the Adam optimizer, which usually shows good performance. Since our classes are balanced (the number of handwritten numbers belonging to each class is the same), accuracy will be a suitable metric – the proportion of input data assigned to the correct class.  Now we can train and evaluate the quality of the training

# build the model
model_perc = create_model()
# Fit the model, y_tr, validation_data=(X_tst, y_tst), epochs=10, batch_size=200, verbose=2)
# Final evaluation of the model
scores = model_perc.evaluate(X_tst, y_tst, verbose=0)
print("Baseline Error: %.2f%%" % (100-scores[1]*100))

The model fits on 10 epochs of training , with each update 200 images are used. The test data used as a validation dataset allows you to see the recognition quality of a trained model. The value verbose = 2 is used to reduce the output by one line for each training epoch. Finally, a test dataset is used to evaluate the model and a classification error is printed. Train on 60000 samples, validate on 10000 samples.

Epoch 1/10  - 21s - loss: 0.2781 - acc: 0.9213 - val_loss: 0.1443 - val_acc: 0.9585
Epoch 2/10  - 21s - loss: 0.1100 - acc: 0.9686 - val_loss: 0.0943 - val_acc: 0.9709
Epoch 3/10 - 18s - loss: 0.0709 - acc: 0.9798 - val_loss: 0.0809 - val_acc: 0.9739
Epoch 4/10  - 18s - loss: 0.0511 - acc: 0.9855 - val_loss: 0.0679 - val_acc: 0.9781
Epoch 5/10  - 18s - loss: 0.0361 - acc: 0.9898 - val_loss: 0.0650 - val_acc: 0.9801
Epoch 6/10  - 18s - loss: 0.0265 - acc: 0.9936 - val_loss: 0.0640 - val_acc: 0.9790
Epoch 7/10  - 18s - loss: 0.0191 - acc: 0.9953 - val_loss: 0.0624 - val_acc: 0.9810
Epoch 8/10  - 18s - loss: 0.0145 - acc: 0.9965 - val_loss: 0.0592 - val_acc: 0.9822
Epoch 9/10  - 18s - loss: 0.0109 - acc: 0.9977 - val_loss: 0.0554 - val_acc: 0.9827
Epoch 10/10  - 18s - loss: 0.0079 - acc: 0.9986 - val_loss: 0.0596 - val_acc: 0.9814
Baseline Error: 1.86%

As you can see, our model achieves an accuracy of about 98.14% and an error of 1.86% on the test dataset, which is good for such a simple model.

The complete source code for this tutorial: