This blog has not had any posts for a while. The main reasons are that I have been busy reading a lot of stuff (but did not have any additional comments on the books I read to blog about) and I have been working on a side project to get some experience building a machine learning framework from scratch. In the process of building it, I have learned a lot of things that I feel could be useful to someone trying a similar endeavor. So I have decided to blog about it.
The framework is called minerva (from the Roman goddess of wisdom). The source code can be found on my github page. My husband and I were looking for a project idea that had the following attributes:
- A generic framework for a classification system that could handle various types of data
- A high performance framework
- Easily scalable
- Implements some ideas that have gained interest in the field of ‘deep learning’ a.k.a automatic feature selection
We decided to implement a configurable neural network library from scratch in C++. We wanted this neural network to be able to perform automatic feature selection. After reading some research papers on automatic feature selection, the sparse autoencoder technique seemed promising enough to try out. Originally, the dataset we used for this project came from a Kaggle competition to perform multi-modal gesture recognition from videos. We used this dataset to setup and pipe-clean the flow (which supports images and video inputs). We tried our best to design a system with clean interfaces. We wanted to ensure that supporting different libraries (matrix, images etc) would be very seamless. This came in handy when we decided to add support to run minerva on GPUs. We simply plugged in support for GPU matrix libraries.
So, how do you use minerva? Well it isn’t as easy as pressing a button or running an executable / script (although I plan to add support for a simpler way by adding a wrapper script in the near future). But for now, there are four main steps.
- Creation of the neural network – In this step, we create an initial model of a neural network.
- Unsupervised Learning – In this step, we run the model from previous step with a lot of input data in ‘unsupervised learning’ mode. The result of this step is a feature selector neural network. More details follow.
- Supervised Learning – Run the feature selector model generated in above step in ‘supervised learning’ mode by running labeled training data. The result of this step is a classifier neural network.
- Classify (Test) – In this test we run test data on the classifier neural network and generate output labels for the test data.
1. Creation of the neural network model
The neural network is created with a configuration specified via. command line. It is then serialized and written to disk as a compressed tgz file. The file contains a json file describing the attributes like the number of layers, neurons of the neural network and a bunch of binary files containing randomly initialized matrices which combine together to form the neural network.
2. Unsupervised Learning
Unsupervised learning (or ‘deep learning’) is the process of automatically discovering patterns in large data sets without human interaction. Without unsupervised learning, we would need to manually select features that are indicators of how a particular input would be classified. This requires manually labeling thousands or millions of inputs (video frames) to perform classification (identify what an image contains). Additionally, this would need to be robust w.r.t. variations in color, alignment, and other noise. It is not feasible for most projects to devote the resources required to gather such a large number of labeled inputs.
In order to deal with this problem, many projects rely on building domain-specific feature selection systems. Such specific feature selectors preprocess the input data and produce a set of features which capture the essential information from the input data set into a significantly smaller representation. Feature selection reduces the dependence on labeled data by simplifying the problem to determining the class of a dataset using only the most relevant information instead of all information. Sift is an example of a widely used feature selection system. Most manually-crafted feature selection systems have a fatal flaw; they are designed by developers and tailored to specific problems. As a consequence, they are often brittle (because they are designed using developer / domain expert intuition), and require a complete redesign when moving from one type of input data to another (e.g. from video to text).
We decided to implement a sparse autoencoder technique for our unsupervised learning step. This technique attempts to address the shortcomings of feature selection systems by providing a framework to automatically generate features. At a high level, it uses an artificial neural network trained with unlabaled data and configured in a specific topology. The topology guides the inner layers of the network to respond to patterns in the input data such that the majority of information in the data is captured (essence of the data, so to speak). After training on unlabeled data, the inner layers of this network can be used directly as features; in other words, the network itself becomes a feature selection system.
The following image includes a visualization of some of the low level features (inputs that a pooling layer neuron maximally responds to) that were learned by Minerva after being presented 10,000 random images.
These compare favorably to Gabor Filter function responses in the top-left of the following figure, which have been found in the first layers of the visual cortex in mammalians, and perform well at visual classification tasks:
The accompanying figure on the top-right shows the corresponding spatial domain representation (impulse response) of each of the Gabor Filters.
We mainly referred to the following resources on deep learning during the course of this project:
3. Supervised Learning
In this step, we use the neural network from the earlier unsupervised learning stage as the starting point (instead of a randomly initialized neural network). At the start, this neural network is capable of selecting features. This training step uses labeled data with gesture names for images (which are snapshots of successive video frames). Thus, in this step, we fed in labeled data to this network and calibrate the network with back propagation and the expected output labels. Once the neural network output is within the specified threshold (tolerance), the resulting classifier neural network is written out to file. It is now ready to be used for testing with images.
The following image shows the neural network input that produces the maximum response for a neuron trained using Minerva to recognize images of cats.
In this step, we use both the neural networks viz. the feature selector from the unsupervised learning step and the classifier neural network from the supervised learning step to classify test images.
In the next few posts I plan to explain the details and rationale behind the design decisions we made to accomplish the four goals I mentioned at the beginning of this post. Here is how I tentatively plan to partition the topics:
- Sparsity for unsupervised learning
- Support for configurable convolutional neural networks (tiling the pixels to capture spatial locality)
- Various principles to ensure that we truly built a high performance architecture from the start
- Evaluation of various optimizers for the cost function calculation (required in calibrating the network in back propagation)
- The various libraries for supporting framework viz. libraries for linear algebra , video, image, serialization.