Now that we have covered the basics of neural networks, let’s do something with them. Identifying landmarks on images faces is an extremely useful task. It is commonly used to determine facial expressions, focus of attention and can even be used to identify people in images (ah em Facebook). While the method we will be using is not the most modern and advanced, it is great for working through the process of creating, training and optimising a neural network. Before we get started, make sure you have grabbed the code from the github repo.
Don’t forget to grab the jupyter notebook for this article from github here or by cloning:
The dataset we are going to use for this tutorial is a very small subset of that provided during the recently closed Kaggle facial landmarking competition. To simplify the process of accessing the data, and the time to download I have added the file training.csv to the github repo which contains all of the data we will need for this exercise. This file contains 1000 grayscale images, each with 15 manually marked facial points; each image is 96 x 96 pixels in size.
Looking at these training samples, we can see that there is variation in the expressions of the subjects in the images, the lighting as well as the consistency of the position of the ground truth landmarks. These variances, and others are what often provides difficulty in predicting the correct position of the landmarks. The jupyter notebook contains all of the code required to extract this data from the csv file.
If you are interested in obtaining a larger version of this dataset you can obtain the complete and original set in the publicly BioId database.
We are going to construct a similar neural network to that described in the previous posts http://www.tinhatben.com/2016/09/neural-networks-and-backpropagation-part-one/ and http://www.tinhatben.com/2016/09/neural-networks-and-backpropagation-part-two/; a simple network with a single hidden layer.
The Input Layer
Each of the images in the dataset have a shape of 96 x 96 pixels; as they are grayscale images and not RBG or CMYK we do not need to worry about any additional channels. In order to pass each image into the neural network we will need to flatten them out into a single vector. Each of our 96 x 96 images will become a 9216 vector, providing a 1000 x 9216 input vector for training.
So, our input layer has 9216 input units, where each unit corresponds to a pixel in each of the sample images.
The Hidden Layer
It is in the hidden layer that we have the most flexibility to modify the performance of this particular network. We can specify and adjust :
- The number of units in the hidden layer is one that we can specify and adjust to modify the performance of the network. I would encourage anyone interested to change this value and see the effect on network performance. We will be starting with 100.
- The addition of a bias unit (we will be using one)
- The type activation function used to provide the hidden layer with non-linearity. In this example we are using tanh, but you can try sigmoid and relu and see the effect on the performance.
By Fylwind at English Wikipedia – Own work, Public Domain, Link
- The initialisation of the layer weights; this can be varied but needs to be done with the non-linearity in mind. The wrong choice of initialisation may lead to poor network performance if the outputs of the layer get stuck at the extremes i.e. -1 or 1. In this example we will used to trusted method of selecting initial weight values () from a random, normal distribution between:
Where is the number of input units (9216) and is the number of hidden units (100).
So the activation of the hidden layers will be computed as
The Output Layer
The aim of the network is to predict the location of 15 x (x, y) landmarks i.e. 30 values. So the output layer must have 30 output units. As with the hidden layer, the output layer contains weights that require initialisation; we will use a similar method to that above; selecting initial weight values () from a random, normal distribution between:
Where is the number of hidden units (100) and the number of output units,
As this is a regression problem and not a classification one (i.e. we are not trying to label the images) we will not have a non-linearity attached to the output. So the activation of the output layer is computed as:
Preparing the Data for Training
Preparing the data correctly is key if the training process is going to be successful. It can vary slightly depending on the network architecture and the non linearities used. We will employ a common method of shifting the data to have a mean of zero and a standard deviation of approximately 1. This transform will be applied to both the input images and the output coordinates prior to training.
Now we need to randomly split our data into the training set and cross validation set. The training set will be used to set the weights of the model, while the cross validation set will provide an indicator of performance and when to terminate training. We are going to randomly split the data so 70% is used for training and 30% for validation.
The cost function provides and indication of how close (or far) the network is away from the given data set. There are a number of different cost functions which can be used and they also vary between regression problems such as this and classification tasks. We are going to use ol’ faithful mean square error:
Where is the number of samples in the training or validation sets, is the predicted landmarks from the model and is the ground truth landmarks provided in the training / validation set.
Sticking with simplicity, we are going to used standard gradient descent with a learning rate ; the learning rate is another parameter that can be modified. Setting small may allow the model to find the global minima in the error function, producing the best result; but it may also take a very long time to train. Conversely setting too large may prevent the model from finding the global minima as it jumps over it, preventing convergence. As per the previous posts on neural networks we will update the weights on each iteration using:
Now let’s train the network for 750 iterations; what is the best (or smallest) validation error produced?
Minimum validation error 3.124496E-03 @ 749
Training error @ 749: 2.727863E-03
The error curve looks pretty good, a consistent reduction in both training and validation error. We can see that the training error is starting to reduce faster than the validation error at around 700 epochs. This separation is known as overfitting. As we only have 100 hidden units, overfitting should be limited. We can also see that the error may continue to decrease if we increased the number of training epochs. Again increase / decrease the number of hidden units and number of training epochs. What happens to the error curve.
What about landmark predictions?
So we have trained our model… what does this look like in terms of landmark predictions. Using the validation set, we compute the predicted landmarks. The previously determined mean and maximum values are re-applied to shift the data from -1 to 1 space, back into image space. The predicted landmarks are then plotted on the images with the ground truth landmarks for the validation set.
So… the predictions are ok, but not great; some are close such as the mouth in Sample 4 and some totally wrong e.g. the mouth Sample 2. Again, try modifying some of the parameters to see if you can improve the predictions. It is to be noted that this tutorial was an extremely simplified example of trying to solve this problem. We have limited the training data, the complexity of the model as well as the training methods; so it isn’t a surprise performance is limited. In future posts we will look at using more advanced methods such as convolutional neural networks as well as different training regimes.
I hope you found this post useful!