Serving Up Chicago-Style Deep Learning.

Latest post:

The building blocks of Deep Learning 21 Nov 2015

© 2015. All rights reserved.

Hinton's Dark Knowledge

On Thursday, October 2, 2014 Geoffrey Hinton gave a talk (slides, video) on what he calls “dark knowledge” which he claims is most of what deep learning methods actually learn. The talk presented an idea that had been introduced in (Caruana, 2006) where a more complex model is used to train a simpler, compressed model. The main point of the talk introduces the idea that classifiers built from a softmax function have a great deal more information contained in them than just a classifier; the correlations in the softmax outputs are very informative. For example, when building a computer vision system to detect cats,dogs, and boats the output entries for cat and dog in a softmax classifier will always have more correlation than cat and boat since cats look similar to dogs.

Dark knowledge was used by Hinton in two different contexts:

  • Model compression: using a simpler model with fewer parameters to match the performance of a larger model.
  • Specialist Networks: training models specialized to disambiguate between a small number of easily confuseable classes.


A deep neural network typically maps an input vector \( \mathbf{x}\in\mathbb{R}^{D _ {in}} \) to a set of scores \(f(\mathbf{x}) \in \mathbb{R}^C \) for each of \( C \) classes. These scores are then interpreted as a posterior distribution over the labels using a softmax function

The parameters of the entire network are collected in \( \boldsymbol{\Theta} \). The goal of the learning algorithm is to estimate \( \boldsymbol{\Theta} \). Usually, the parameters are learned by minimizing the log loss for all training samples

which is the negative of the log-likelihood of the data under the logistic regression model. The parameters \( \boldsymbol{\Theta} \) are estimated with iterative algorithms since there is no closed-form solution.

This loss function may be viewed as a cross entropy between an empirical posterior distribution and a predicted posterior distribution given by the model. In the case above, the empirical posterior distribution is simply a 1-hot distribution that puts all its mass at the ground truth label. This cross-entropy view motivates the dark knowledge training paradigm, which can be used to do model compression.

Model compression

Instead of training the cross entropy against the labeled data one could train it against the posteriors of a previously trained model. In Hinton’s narrative, this previous model is an ensemble method, which may contain many large deep networks of similar or various architectures. Ensemble methods have been shown to consistently achieve strong performance on a variety of tasks for deep neural networks. However, these networks have a large number of parameters, which makes it computationally demanding to do inference on new samples. To alleviate this, after training the ensemble and the error rate is sufficiently low, we use the softmax outputs from the ensemble method to construct training targets for the smaller, simpler model.

In particular, for each data point \( \mathbf{x} _ n \), our first bigger ensemble network may make the prediction

The idea is to train the smaller network using this output distribution rather than the true labels. However, since the posterior estimates are typically low entropy, the dark knowledge is largely indiscernible without a log transform. To get around this, Hinton increases the entropy of the posteriors by using a transform that “raises the temperature” as

where \( T \) is a temperature parameter that when raised increases the entropy. We now set our target distributions as

The loss function becomes

Hinton mentioned that the best results are achieved by combining the two loss functions. At first, we thought he meant alternating between them, as in train one batch with \( L ^ \mathrm{(hard)} \) and the other with \( L ^ \mathrm{(soft)} \). However, after a discussion with a professor that also attended the talk, it seems as though Hinton took a convex combination of the two loss functions

where \( \alpha \) is a parameter. This professor had the impression that an appropriate value was \( \alpha = 0.9 \) after asking Hinton about it.

One of the main settings for where this is useful is in the context of speech recognition. Here an ensemble phone recognizer may achieve a low phone error rate, but it may be too slow to process user input on the fly. A simpler model replicating the ensemble method, however, can bring some of the classification gains of large-scale ensemble deep network models to practical speech systems.

Specialist networks

Specialist networks are a way of using dark knowledge to improve the performance of deep network models regardless of their underlying complexity. They are used in the setting where there are many different classes. As before, deep network is trained over the data and each data point is assigned a target that corresponds to the temperature adjusted softmax output. These softmax outputs are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that come from a subset of classes. Specialist networks are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique. The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.

One technical hiccup created by the specialist network is that the specialist networks are trained using different classes than the full network so combining the softmax outputs from multiple networks requires a combination trick. Essentially there is an optimization problem to solve: ensure that the catchall “dustbin” classes for the specialist networks match a sum of your softmax outputs. So that if you have cars and cheetahs grouped together in one class for your dog detector you combine that network with your cars versus cheetahs network by ensuring the output probabilities for cars and cheetahs sum to a probability similar to the catch-all output of the dog detector.

GNU Parallel

I was reading the ImageNet tutorial for Caffe (a deep learning framework), in which they need to resize a large number of images. It struck me that they might not be aware of GNU Parallel, since it is a great tool for this task. I recommend it to any data scientist out there since it is so simple to use and like many other GNU tools, with good chance already installed on your computer. If not, run apt-get install parallel on Debian. It might suggest that you to install moreutils to get parallel, but this installs the wrong software (explanation).

In the writeup, it says that the author used his own MapReduce framework to do it, but it can also be done sequentially as:

for name in *.jpeg; do
    convert -resize 256x256\! $name $name

Instead of this sequential approach, you can run it in parallel with even less typing:

parallel convert -resize 256x256\! {} {} ::: *.jpeg

GNU Parallel will insert each filename at {} to form a command. Multiple commands will execute concurrently if you have a multicore computer.

If you have ever been tempted to do this kind of parallelization by adding & at the end of each command in the for loop, then Parallel is definetely for you. Adding & introduces two problems that Parallel solves: (1) you don’t know when all of them are done and there is no easy way to join them, and (2) it will start a process for each command all at once, while Parallel will schedule your tasks and execute only as many in parallel as your computer can handle.


Parallel can also take input from the pipe, in which case it is similar to xargs:

ls *.jpeg | parallel mv {} {.}-old.jpeg

This command inserts -old into the filenames of all the JPEG files in the directory. The {.} is similar to {}, except it removes the extension. There are many replacement strings like this:

parallel convert -resize 256x256\! {} resized/{/} ::: images/*.jpeg

This resizes all the JPEG files inside the folder images and places the output in the folder resized. The replacement string {/} extracts the filename and is thus similar to the command basename. For this example we went back to the ::: style input, which in many cases is preferable. For instance, it can be used several times to form a product of the input:

parallel "echo {1}: {2}" ::: A B C D ::: {1..8}

Note how we now used {1} and {2} to refer to the input. We also quoted the command, which is optional and might make things clearer (if you want to use pipes inside your command, it is required). Using multiple inputs is great for doing grid searches of parameters. However, let’s say we don’t want to do all combinations of the product and instead want to specify each pair of input manually. This behavior is easily achieved using the --xapply option:

prallel --xapply "echo {1}: {2}" ::: A B C D ::: {1..8}

Note how the letters will wrap around.

In some settings, you might find it easier to create a file,, with all the commands written out:

./experiment 10.0 1.5 > exp1.txt
./experiment 20.0 1.5 --extra-param 3.0 > exp2.txt

Now run them in parallel by:

parallel <          # OR
parallel ::::

The latter is a newer syntax (note that it has four colons), which again I prefer since it can be stringed together multiple times and you can freely mix ::: and ::::.

Multiple computers using SSH

Parallel can also be used to parallelize between multiple computers. Let’s say you have SSH access to the hostnames or SSH aliases node1 and node2 without prompting for password. Now you can tell Parallel to distribute the job across both nodes using the -S option:

parallel -S node1,node2 -j8 convert -resize 256x256\! {} {} ::: *.jpeg

You can refer to the local computer as : (e.g. do -S :,node1,node2 to include the current computer). I also added -j8 to specify that I want each node to run 8 jobs concurrently. You can try leaving this out, but Parallel could have a hard time automatically determining how many jobs to use for each node.

We assumed in this example that the files existed on the other nodes (for instance through NSF). However, Parallel can also transfer the files to the worker nodes and transfer the results back by adding --trc {}.

More information

For more information I recommend:

Thanks to Ole Tange, the author of GNU Parallel, for pointing out errors in this post.

Deep PCA Nets

Tsung-Han Chan and colleagues recently uploaded to ArXiv an interesting paper proposing a simple but effective baseline for deep learning. They propose a novel two-layer architecture where each layer convolves the image with a filterbank, followed by binary hasing, and finally block histogramming for indexing and pooling. The filters in the filterbank are learned using simple algorithms such as random projections (RandNet), principal component analysis (PCANet), and linear discriminant analysis (LDANet). They report results competitive with those obtained by other deep learning methods and scattering networks (introduced by Stéphane Mallat) on a variety of task: face recognition, face verification, hand-written digits recognition, texture discrimination, and object recognition:

Dataset Task Accuracy
Extended Yale B Face Recognition 99.58%
AR Face Recognition 95.00%
FERET (average) Face Recognition 97.25%
MNIST Digit Recognition 99.38%
CUReT Texture Recognition 99.61%
CIFAR10 Object Recognition 78.67%

The authors achieve state-of-the-art results on several of the MNIST Variations tasks. The method compares favorably to hand-designed features, wavelet-derived featues, and deep-network learned features.

PCA-Net Algorithm

The main algorithm is cascades two filterbank convolutions with an intermediate mean normalization step, followed by a binary hashing step and a final histogramming step. Training involves estimating the filterbanks used for the convolutions, and estimating the classifier to be used on top of the ultimate histogram-derived features.

Filterbank Convolutions

The filterbanks are estimated by performing principal components analysis (PCA) over patches. We extract all of the \( 7\times 7 \) patches from all of the images and vectorize them so that each patch is a flat 49-entry vector: where \( \mathbf{v} \) is an image patch in the picture, e.g.:

Image Patch Picture

For each patch vector we take the mean of the entries (the DC-component) and then subtract that mean from each entry of the vector so that all of our patches are now zero mean. We perform PCA over these zero-mean patch vectors and retain the top eight components \( W\in\mathbb{R}^{49\times 8} \). Each principle component (a column of \( W \)) is a filter and may be converted into a \( 7\times 7 \) kernel which is convolved with the input images. The input images are zero-padded for the convolution so that the output has the same dimension as the image itself. So, using the eight columns of \( W \) we take each input image \( \mathcal{I}\) and convert it into eight output images \( \mathcal{I} _ l \) where \( 1\leq l\leq 8 \).

Second Layer

The second layer is constructed by iterating the algorithm from the first layer over each of the eight output images. For each output image \( \mathcal{I} _ l \) we take the dense set of flattened patch vectors, remove the DC-component. The patches produced by the different filters are then concatenated together and we estimate another PCA filterbank (again with eight filters). Each filter \( w _ {2,k} \) from the layer-2 filterbank is convolved with \( \mathcal{I} _ l \) to produce a new image \( \mathcal{I} _ {l,k} \). Repeating this process for each filter in the filterbanks produces \( 64=8\times 8 \) images.

Hashing and Histogramming

The 64 images have the same size as the original image thus we may view the filter outputs as producing a three-dimensional array \( \mathcal{J}\in\mathbb{R}^{H\times W\times 64} \) where \( H\times W \) are the dimensions of the input image. Each of the 64 images is produced from a layer one filter \( l _ 1 \) and a layer two filter \( l _ 2 \) so we denote the associated image as \( \mathcal{J} _ {l _ 1,l _ 2} \). Each pixel \( (x,y) \) from the image has an associated 8-dimensional feature vector \( \mathcal{J}(x,y)\in\mathbb{R}^{64} \). These feature vectors are converted into integers by using a Heaviside step function \( H \) sum:

We note that we produce a hashed image such as \( \mathcal{K} _ l \) for each filter \( l \) in the layer one filterbank so this means that we have eight images after the hashing operation and the images are all integers.


We then take \( 7\times 7 \) blocks of the hashed images \( \mathcal{K} \) and compute a histogram with \(2^{64} \) bins over the values observed. These blocks can be disjoint (used for face recognition) or they can be overlapping (useful for digit recognition). The histograms formed from these blocks and from the several images are all concatenated into a feature vector. Classification is then performed using this feature vector.


The authors estimate a multiclass linear SVM to operate on the estimated feature vector for each image. The same setup was used for all input data. The particular SVM implementation was Liblinear. The specific algorithm used was \( l _ 2 \)-regularized \( l _ 2 \)-loss support vector one-against-rest support vector classification and a cost ( the C parameter) of 1. The call to liblinear may be written

liblinear -s 1 -c 1.0

Author’s Implementation

Code for the paper is here and it has implementations for cifar10 and MNIST basic (a subset of MNIST). With a little extra work one can also make it suitable for testing on the whole MNIST data set.

I tested this implementation on the MNIST basic dataset distributed with their implementation code and obtained a \( 1.31\% \) error rate using \( 12,000 \) training examples and requiring \( 700 \) seconds of training time. This is a somewhat higher error-rate than the \( 1.02\% \) reported in the author’s paper. It is possible that the author ran a more optimized SVM training routine that was not indicated in the posted codes.

The filters learned in the first layer were: first layer PCA filters

The filters learned in the second layer were: second layer PCA filters

We can see that the different image filters are somewhat similar to edge filters and that the seventh and eighth filters (in the lower-right hand corner) have less clear structure than the others. Often, when one uses PCA the first few components have a somewhat clear meaning and the rest of the components look like random noise–this is consistent with a model where the latent dimensionality of the patches is less than eight.


I was intrigued by this paper because of the simplicity of the network and the strong reported results. When I ran my simple experiment I was not able to reach the results as reported in the paper using the codes provided.

In the future I will try further experiments using PCA to ininitialize filters for the deep network. Autoencoders are often used for initializing deep-network filters and PCA is a sort of poor-man’s autoencoder. Mean-normalizing the output layer before moving to the next layer is a simple way to organize multi-layer networks and I think that has promise as a baseline. I am less enthusiastic about the histogramming and hashing steps. The authors mention that the histogramming and hashing produce translation invariance, and I wonder whether translation invariance could be achieved more simply by using max-pooling.

Overall the paper gave me some interesting questions to think about but I think it could serve as an excellent baseline for other deep network systems when the publicly available codes are more mature.