This past summer I interned at Flipboard in Palo Alto, California. I
worked on machine learning based problems, one of which was Image
Upscaling. This post will show some preliminary results, discuss our
model and its possible applications to Flipboard’s products.
High quality and a print-like finish play a key role in Flipboard’s
design language. We want users to enjoy a consistent and beautiful
experience throughout all of Flipboard’s content, as if they had a
custom print magazine in hand. Providing this experience consistently is
difficult. Different factors, such as image quality, deeply affect the
overall quality of the presented content. Image quality varies greatly
depending on the image’s source. This varying image quality is
especially apparent in magazines that display images across the whole
page in a full bleed format.When we display images on either the web or mobile devices they must be above a certain threshold to display well. If we receive a large image on our web product we can create breathtaking full bleed sections.
Full bleed High Quality Image
Full bleed Low Quality Image
Before we go any further I would like to give a high level introduction to both the traditional and convolutional flavours of Neural Networks. If you have a good grasp of them feel free to skip ahead to next section. Following the introduction to Neural Networks there is a preliminary results section, discussion of the model architectures, design decisions, and applications.
Note: Smaller nuances of Neural Networks will not be covered in the introduction.
Neural Networks
Neural Networks are an amazing type model that is able to learn from given data. They have a large breadth of applications and have enjoyed a recent resurgence in popularity in many domains such as computer vision, audio recognition, and natural language processing. Some recent feats include captioning images, playing Atari, aiding self-driving cars, and language translation.Neural Networks exist in different configurations, such as convolutional and recurrent, that are each good at different types of tasks. Learning ‘modes’ also exist: supervised and unsupervised; we will only focus on supervised learning.
Supervised learning is described as a network that is trained on both an input and an output. This ‘mode’ is used to predict new outputs given an input. An example of this would be training a network on thousands of pictures of cats and dogs which have been manually labelled and then asking, on a new picture, is this a cat or dog?
On a structural level a Neural Network is a feedforward graph where each of the nodes, known as units, performs a nonlinear operation on the incoming inputs. Each of the inputs has a weight that the network is able to learn through an algorithm known as backpropagation.
Basic Neural Network
Source: Wikipedia
Source: Wikipedia
As mentioned previously the units within a network perform a mathematical operation on the inputs. We can take a closer look by calculating a simple numerical example involving a single unit with a handful of inputs.
Our simple model
So how do we calculate an output value? To do so we need to compute
To make this example a little more concrete we will use some random numbers. Say we have the following vectors corresponding to our weights, inputs, and scalar bias value:
Why is this exciting? Well, in terms of a single unit, it is not too exciting. As it stands, we can tweak the weights and bias value to model only the most basic of functions. Our little example lacks ‘expressive power’. In order to increase our ‘expressive power’ we can chain and link units together to form larger networks as seen below.
Our bigger network
To update the weights and biases of our network we will use an algorithm known as backpropagation. We will focus on a supervised approach where our network is given pairs of data: input x and the desired output y. In order to use backpropagation in this supervised manner we need to quantify our network’s performance. We must define an error that compares the result of forward propagation through our network,
This comparison between values is formally known as a cost function. It can be defined however you want but for this post we will use the mean squared error function:
Backpropagation takes this magnitude and propagates it from the back of the network to the front adjusting the weights and biases along the way. The amount of adjustment to each weight and bias can be thought of as its contribution to the error and is calculated through gradient descent. This algorithm seeks to minimize the error function above by changing the weights and biases.
The general process to train a Neural Network, given an input vector x and the expected output y, is as follows:
- Perform a step of forward propagation through the network with input vector x. We will calculate an output
y^ . - Calculate the error with our error function.
EMSE(x,y)=1n|y^(x)−y|2 . - Backpropagate the errors through our network updating the weights and biases.
Convolutional Neural Networks
If you have been paying attention to recent tech articles you will most likely have heard of Neural Networks breaking the state-of-the-art in several domains. These breakthroughs are due in a small part to convolutional Neural Networks.Convolutional Neural Networks (convnets) are a slightly different flavour of the typical feed-forward Neural Network. Convnets take some biological inspiration from the visual cortex, which contains small regions of cells that are sensitive to subregions of the visual field. This is referred to as a receptive field. We can mimic this small subfield by learning weights in the form of matrices, referred to as ‘kernels’; which, like their biologically inspired counterparts, are sensitive to similar subregions of an image. We now require a way to express the similarity between our kernel and the subregion. Since the convolution operation essentially returns a measure of ‘similarity’ between two signals we can happily pass a learned kernel, along with a subregion of the image, through this operation and have a measure of similarity returned back!
Below is an animation of a kernel, in yellow, being convolved over an image, in green, with the result of the operation on the right in red.
Animation of the Convolution Operation.
Source: Feature extraction using convolution from Stanford Deep Learning
Source: Feature extraction using convolution from Stanford Deep Learning
Animation of the Convolution Operation.
Source: Edge Detection with Matrix Convolution
Source: Edge Detection with Matrix Convolution
This allows us to extract information from the right image above. The kernels inform us of the presence and location of directional edges. When used in a Neural Network we can learn the appropriate weight settings of the kernels to extract out basic features such as edges, gradients and blurs. If the network is deep with enough convolutional layers it will start learning feature combinations off of the previous layers. The simple building blocks of edges, gradients, and blurs become things like eyes, noses, and hair in later layers!
Kernels building high level representations from earlier layers.
Source: Yann Lecun “ICML 2013 tutorial on Deep Learning”
Source: Yann Lecun “ICML 2013 tutorial on Deep Learning”
Image Scaling using Convolutional Neural Networks
Below is a collection of preliminary results that were produced from the model. The left image is the original ‘high’ resolution. This is the ground truth and what we would hope to get with perfect reconstruction. We scale this original down by a factor of 2x and send it through either a bicubic scaling algorithm or the model. The results are in the center and right positions respectively.
Original
Bicubic
Model
Original
Bicubic
Model
Original
Bicubic
Model
Original
Bicubic
Model
Original
Bicubic
Model
Architecture
Below is one of the architectures used, the primary goal is to double the number of pixels taken in from the image. The architecture is a 8 layer Neural Network composed of three convolutional layers, each shown as stacked pinkish blocks, and four fully connected layers colored in blue. Each layer uses the rectified linear activation function. There is a final dense layer with linear Gaussian units which is not shown below.No pooling operations are used after any of the convolutional layers. While pooling is useful for classification tasks, where invariance to input is important, the location of features detected by each kernel is important. Pooling also discards too much useful information which is the opposite of what is needed for this use case.
The weights were all initialized using the Xavier initialization suggested by Glorot & Benigo and was then slightly tweaked during hyperparameter optimzation. This is defined as
Dataset
The network was trained on a large dataset of approximately 3 million samples. The dataset images used natural images including those of animals and outdoor scenes. Some images needed to be filtered out of this set as they included illustrations of animals or text. As each image was of varying size and quality a constraint in the form of a pixel count was added to focus on images which contained a total pixel count of 640,000 and up.Each sample within the dataset is a low and high resolution image pair. The low resolution image, the ‘x’ input, was created by downscaling a high resolution image by a certain factor. While the desired output, ‘y’, was of the original high resolution image. Very mild noise and distortions were added to the input data. The data was normalized to have zero mean by calculating the global mean and to unit variance by dividing through by the standard deviation of the dataset.
The dataset was divided into subsets of training, testing, and validation; following a 80%, 10%, and 10% split respectively.
Regularization
Max Norm
Max norm constraints enforce an absolute upper bound on the weight vectors magnitude for every unit in that layer which stops the network’s weights from ‘exploding’ if the gradient update is too large. Max norm constraints are used in all layers except the final linear Gaussian layer. An aggressive magnitude was used in all of the convolutional layers while the other layers’ magnitudes were much more lax.L2
L2 regularization penalizes the network for using large weight vectors,Dropout
Dropout randomly ‘drops’ units from a layer on each training step, creating ‘sub-architectures’ within the model. It can be viewed as a type of sampling of a smaller network within a larger network. Only the weights of the included units are updated, which makes sure the network does not become too reliant on a few units. This process of removing units happens over and over between runs, so the units being included change. The convolutional layers all have a high inclusion probability of almost 1.0 while the last two fully connected layers include about half the units.Training
The model was trained using Stochastic Gradient descent in batch sizes of 250 over the entire training set for ~250 epochs. A highish batchsize was used to smooth out the updates and make better use of the GPUs while still getting some benefit from perturbations of smaller batches.The network is trained to minimize a mean square error function. A learning rate scale was used on all weights and biases. The formula for the weights (per layer) was
Amazon g2.2xlarge EC2 instances were used to train the network with NVIDIA’s cuDNN library added in to speed up training. Training the final model took approximately 19 hours.
Hyperparameters
The majority of the hyperparameters were selected using an inhouse hyperparameter optimization library that works over clusters of Amazon g2.2xlarge instances. This was performed using a portion of the training dataset and the validation dataset. The process took roughly ~4 weeks and evaluated ~500 different configurations.Variations
Some things that did not work out well while working on this problem:- Used a larger batch size of 1000, this worked well, but ran up against local minima quickly. The jitter provided by a smaller batch was useful to bounce out of these minimas.
- Used a small convolutional network, this was alright but did not generalize as well as the larger convolutional network.
- Tried to use the weight initialization formula suggested by He et al. :
2n . Unfortunately this caused the network to sputter around and it failed to learn. Might be this specific configuration as many people have successfully used it. - Used the same amount of L2 regularization on all layers, it worked much better to vary the L2 regularization based on which layers started saturating or were clamped against max normal constraints.
- Used pooling on the layers. Lost too much information between layers, images turned out grainy and poor looking.
Applications
Our goal was not to remove or replace the need for other upscaling algorithms, such as bicubic upscaling, but to try to improve quality using different technology and avenues. Our primary use case was to scale lower resolution images up when no higher resolution images are available. This happens occasionally across our platforms.Besides the primary use case of still images this technique can be applied to different media formats such as GIFs. The GIF could be split into its separate frames, then scaled up, and then repackaged.
The final use case that we thought of was saving bandwidth. A smaller image could be sent to the client which would run a client side version of this model to gain a larger image. This could be accomplished using a custom solution or one of the javascript implementations of neural networks available such as ConvNetJS.
Further steps
We feel this problem space has a lot of potential and there are many things to try, including some wild ideas, such as:- Larger filter sizes in the convolutional layers.
- Try more layers and data.
- Try different color channel formats instead of RGB.
- Try using hundreds of filters in the first convolutional layer, sample from them using dropout with a very small inclusion probability and try tweaking the learning rate of the layer.
- Ditch the fully connected layers and try using all convolutional layers.
- Curious if distillation would work with this problem. Might help create a lighter version to run on client devices easily.
- Look into how small/large we can make the network before the quality starts degrading.
Conclusion
Pursuing high fidelity presentation is difficult. As with any endeavor, it takes an exceeding amount of effort to squeeze out those final few percentage points of quality. We are constantly reflecting on our product to see where those percentage points can come from, even if they don’t seem obvious or possible at first. While this wont have its place everywhere within our product, we feel it was a good cursory step forward to improving quality.I hope you enjoyed reading through this post and have taken something interesting away with it. I would like to thank everyone at Flipboard for an outstanding internship experience. I have learnt a lot, met many awesome people, and gained invaluable experience in the process.
If machine learning, large datasets, and working with great people on interesting projects excites you then feel free to apply, we are hiring!
No comments:
Post a Comment