28 Oct 2014
Mark Stoehr, Gustav Larsson
On Thursday, October 2, 2014 Geoffrey Hinton gave a talk (slides, video) on what he calls “dark knowledge” which
he claims is most of what deep learning methods
actually learn. The talk presented an idea that had been introduced in
(Caruana, 2006)
where a more complex model is used to train a simpler, compressed model.
The main point of the talk introduces the idea that classifiers built from
a softmax function have
a great deal more information contained in them than just a classifier; the
correlations in the softmax outputs are very informative. For example, when
building a computer vision system to detect cats
,dogs
, and boats
the output
entries for cat
and dog
in a softmax classifier will always have more
correlation than cat
and boat
since cats
look similar to dogs
.
Dark knowledge was used by Hinton in two different contexts:
- Model compression: using a simpler model with fewer parameters to match the performance of a larger model.
- Specialist Networks: training models specialized to disambiguate between a small number of easily confuseable classes.
Preliminaries
A deep neural network typically maps an input vector \( \mathbf{x}\in\mathbb{R}^{D _ {in}} \) to a set of scores \(f(\mathbf{x}) \in \mathbb{R}^C \) for each of \( C \) classes. These scores are then interpreted as a posterior distribution over the labels using a softmax function
The parameters of the entire network are collected in \( \boldsymbol{\Theta}
\). The goal of the learning algorithm is to estimate \( \boldsymbol{\Theta}
\). Usually, the parameters are learned by minimizing the log loss for all
training samples
which is the negative of the log-likelihood of the data under the logistic
regression model. The parameters \( \boldsymbol{\Theta} \) are estimated with
iterative algorithms since there is no closed-form solution.
This loss function may be viewed as a cross entropy between an empirical
posterior distribution and a predicted posterior distribution given by the
model. In the case above, the empirical posterior distribution is simply a
1-hot distribution that puts all its mass at the ground truth label. This
cross-entropy view motivates the dark knowledge training paradigm, which can be
used to do model compression.
Model compression
Instead of training the cross entropy against the labeled data one could train
it against the posteriors of a previously trained model. In Hinton’s narrative,
this previous model is an ensemble method, which may contain many large deep
networks of similar or various architectures. Ensemble methods have been
shown to consistently achieve strong performance on a variety of tasks for deep
neural networks. However, these networks have a large number of parameters,
which makes it computationally demanding to do inference on new samples. To
alleviate this, after training the ensemble and the error rate is sufficiently
low, we use the softmax outputs from the ensemble method to construct training
targets for the smaller, simpler model.
In particular, for each data point \( \mathbf{x} _ n \), our first bigger
ensemble network may make the prediction
The idea is to train the smaller network using this output distribution rather
than the true labels. However, since the posterior estimates are typically low
entropy, the dark knowledge is largely indiscernible without a log transform.
To get around this, Hinton increases the entropy of the posteriors by using a
transform that “raises the temperature” as
where \( T \) is a temperature parameter that when raised increases the entropy.
We now set our target distributions as
The loss function becomes
Hinton mentioned that the best results are achieved by combining the two loss functions. At
first, we thought he meant alternating between them, as in train one batch with
\( L ^ \mathrm{(hard)} \) and the other with \( L ^ \mathrm{(soft)} \).
However, after a discussion with a professor that also attended the talk, it
seems as though Hinton took a convex combination of the two loss functions
where \( \alpha \) is a parameter. This professor had the impression
that an appropriate value was \( \alpha = 0.9 \) after asking Hinton about it.
One of the main settings for where this is useful is in the context of
speech recognition. Here an ensemble phone recognizer may achieve a low phone
error rate, but it may be too slow to process user input on the fly. A simpler
model replicating the ensemble method, however, can bring some of the
classification gains of large-scale ensemble deep network models to practical
speech systems.
Specialist networks
Specialist networks are a way of using dark knowledge to improve the performance of deep network models regardless of their underlying
complexity. They are used in the setting where there are many different classes. As before, deep network is trained over the data
and each data point is assigned a target that corresponds to the temperature adjusted softmax output. These softmax outputs
are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that
come from a subset of classes. Specialist networks
are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in
the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique.
The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.
One technical hiccup created by the specialist network is that the specialist networks are trained using different classes than the full
network so combining the softmax outputs from multiple networks requires a combination trick. Essentially there is an optimization problem
to solve: ensure that the catchall “dustbin” classes for the specialist networks match a sum of your softmax outputs. So that if you have
cars and cheetahs grouped together in one class for your dog detector you combine that network with your cars versus cheetahs network by ensuring
the output probabilities for cars and cheetahs sum to a probability similar to the catch-all output of the dog detector.
15 Sep 2014
Gustav Larsson
I was reading the ImageNet
tutorial for
Caffe (a deep learning framework), in which
they need to resize a large number of images. It struck me that they might not
be aware of GNU Parallel, since it
is a great tool for this task. I recommend it to any data scientist out there
since it is so simple to use and like many other GNU tools, with good chance
already installed on your computer. If not, run apt-get install parallel
on
Debian. It might suggest that you to install moreutils
to get parallel, but
this installs the wrong software (explanation).
In the writeup, it says that the author used his own MapReduce framework to do
it, but it can also be done sequentially as:
for name in *.jpeg; do
convert -resize 256x256\! $name $name
done
Instead of this sequential approach, you can run it in parallel with even less
typing:
parallel convert -resize 256x256\! {} {} ::: *.jpeg
GNU Parallel will insert each filename at {}
to form a command. Multiple
commands will execute concurrently if you have a multicore computer.
If you have ever been tempted to do this kind of parallelization by adding &
at the end of each command in the for loop, then Parallel is definetely for
you. Adding &
introduces two problems that Parallel solves: (1) you don’t
know when all of them are done and there is no easy way to join them, and (2)
it will start a process for each command all at once, while Parallel will
schedule your tasks and execute only as many in parallel as your computer
can handle.
Basics
Parallel can also take input from the pipe, in which case it is similar to xargs:
ls *.jpeg | parallel mv {} {.}-old.jpeg
This command inserts -old
into the filenames of all the JPEG files in the
directory. The {.}
is similar to {}
, except it removes the extension. There
are many replacement strings like this:
parallel convert -resize 256x256\! {} resized/{/} ::: images/*.jpeg
This resizes all the JPEG files inside the folder images
and places the
output in the folder resized
. The replacement string {/}
extracts the
filename and is thus similar to the command basename
. For this example we
went back to the :::
style input, which in many cases is preferable. For
instance, it can be used several times to form a product of the input:
parallel "echo {1}: {2}" ::: A B C D ::: {1..8}
Note how we now used {1}
and {2}
to refer to the input. We also quoted the
command, which is optional and might make things clearer (if you want to use
pipes inside your command, it is required). Using multiple inputs is great for
doing grid searches of parameters. However, let’s say we don’t want to do all
combinations of the product and instead want to specify each pair of input
manually. This behavior is easily achieved using the --xapply
option:
prallel --xapply "echo {1}: {2}" ::: A B C D ::: {1..8}
Note how the letters will wrap around.
In some settings, you might find it easier to create a file, commands.sh
,
with all the commands written out:
./experiment 10.0 1.5 > exp1.txt
./experiment 20.0 1.5 --extra-param 3.0 > exp2.txt
Now run them in parallel by:
parallel < commands.sh # OR
parallel :::: commands.sh
The latter is a newer syntax (note that it has four colons), which again I
prefer since it can be stringed together multiple times and you can freely mix
:::
and ::::
.
Multiple computers using SSH
Parallel can also be used to parallelize between multiple computers. Let’s say
you have SSH access to the hostnames or SSH aliases node1
and node2
without
prompting for password. Now you can tell Parallel to distribute the job across
both nodes using the -S
option:
parallel -S node1,node2 -j8 convert -resize 256x256\! {} {} ::: *.jpeg
You can refer to the local computer as :
(e.g. do -S :,node1,node2
to
include the current computer). I also added -j8
to specify that I want each
node to run 8 jobs concurrently. You can try leaving this out, but Parallel
could have a hard time automatically determining how many jobs to use for each
node.
We assumed in this example that the files existed on the other nodes (for
instance through NSF). However, Parallel can also transfer the files to the
worker nodes and transfer the results back by adding --trc {}
.
For more information I recommend:
Thanks to Ole Tange, the author of GNU Parallel, for pointing out errors in this post.
31 Aug 2014
Mark Stoehr
Tsung-Han Chan and colleagues recently uploaded to ArXiv an interesting paper proposing a simple but effective baseline for deep learning. They propose a novel two-layer architecture where
each layer convolves the image with a filterbank, followed by binary hasing, and finally block histogramming for indexing and pooling. The filters in the filterbank are learned using simple algorithms such as random projections (RandNet),
principal component analysis (PCANet), and linear discriminant analysis (LDANet). They report results competitive with those obtained
by other deep learning methods
and scattering networks (introduced by Stéphane Mallat) on a variety of task: face recognition, face verification, hand-written digits recognition, texture discrimination, and object recognition:
Dataset |
Task |
Accuracy |
Extended Yale B |
Face Recognition |
99.58% |
AR |
Face Recognition |
95.00% |
FERET (average) |
Face Recognition |
97.25% |
MNIST |
Digit Recognition |
99.38% |
CUReT |
Texture Recognition |
99.61% |
CIFAR10 |
Object Recognition |
78.67% |
The authors achieve state-of-the-art results on several of the MNIST Variations tasks. The method compares favorably to hand-designed features, wavelet-derived featues, and deep-network learned features.
PCA-Net Algorithm
The main algorithm is cascades two filterbank convolutions
with an intermediate mean normalization step,
followed by
a binary hashing step and a final histogramming step. Training
involves estimating the filterbanks used for the convolutions,
and estimating the classifier to be used on top of the ultimate histogram-derived features.
Filterbank Convolutions
The filterbanks are estimated by performing principal components
analysis (PCA) over patches. We extract all of the \( 7\times 7 \)
patches from all of the images and vectorize them so that each patch
is a flat 49-entry vector:
where \( \mathbf{v} \) is an image patch in the picture, e.g.:

For each patch vector we take the mean
of the entries (the DC-component) and then subtract that mean
from each entry of the vector so that all of our patches
are now zero mean. We perform PCA over these zero-mean
patch vectors and retain
the top eight components \( W\in\mathbb{R}^{49\times 8} \). Each
principle component (a column of \( W \)) is a filter and may be
converted into a \( 7\times 7 \) kernel which is convolved with
the input images. The input images are zero-padded for the
convolution so that the output has the same dimension as the
image itself. So, using the eight columns of \( W \)
we take each input image \( \mathcal{I}\) and convert it
into eight output images \( \mathcal{I} _ l \) where \( 1\leq l\leq 8 \).
Second Layer
The second layer is constructed by iterating the algorithm from
the first layer over each of the eight output images. For each
output image \( \mathcal{I} _ l \) we take the dense set
of flattened patch vectors, remove the DC-component. The patches produced by
the different filters are then concatenated together and
we estimate another PCA filterbank (again with eight filters). Each filter
\( w _ {2,k} \) from the layer-2 filterbank is convolved with
\( \mathcal{I} _ l \) to produce a new image \( \mathcal{I} _ {l,k} \). Repeating
this process for each filter in the filterbanks produces \( 64=8\times 8 \)
images.
Hashing and Histogramming
The 64 images have the same size as the original image thus we
may view the filter outputs as producing a three-dimensional
array \( \mathcal{J}\in\mathbb{R}^{H\times W\times 64} \)
where \( H\times W \) are the dimensions of the input image. Each
of the 64 images is produced from a layer one filter \( l _ 1 \)
and a layer two filter \( l _ 2 \) so we denote the associated
image as \( \mathcal{J} _ {l _ 1,l _ 2} \). Each
pixel \( (x,y) \) from the image has an associated
8-dimensional feature vector \( \mathcal{J}(x,y)\in\mathbb{R}^{64} \). These feature vectors are converted into integers by using a
Heaviside step function \( H \) sum:
We note that we produce a hashed image such as \( \mathcal{K} _ l \)
for each filter \( l \) in the layer one filterbank so this means
that we have eight images after the hashing operation and the images
are all integers.
Histogramming
We then take \( 7\times 7 \) blocks of the hashed images
\( \mathcal{K} \) and compute a histogram with \(2^{64} \)
bins over the values observed. These blocks can be disjoint
(used for face recognition) or they can be overlapping (useful
for digit recognition). The histograms formed from these blocks
and from the several images are all concatenated into a feature
vector. Classification is then performed using this feature
vector.
Classification
The authors estimate a multiclass linear SVM to operate
on the estimated feature vector for each image. The same
setup was used for all input data. The particular SVM implementation
was Liblinear.
The specific algorithm used was \( l _ 2 \)-regularized
\( l _ 2 \)-loss support vector one-against-rest support vector
classification and a cost ( the C
parameter) of 1
. The
call to liblinear may be written
liblinear -s 1 -c 1.0
Author’s Implementation
Code for the paper is here
and it has implementations for cifar10 and MNIST basic (a subset of MNIST). With a little extra
work one can also make it suitable for testing on the whole MNIST data set.
I tested this implementation on the MNIST basic dataset distributed with their implementation
code and obtained a \( 1.31\% \)
error rate using \( 12,000 \) training examples
and requiring
\( 700 \) seconds of training time. This is a somewhat higher error-rate than the \( 1.02\% \)
reported in the author’s paper. It is possible that the author ran a more optimized SVM training routine
that was not indicated in the posted codes.
The filters learned in the first layer were:

The filters learned in the second layer were:

We can see that the different image filters are somewhat similar to edge filters and that the seventh
and eighth filters (in the lower-right hand corner) have less clear structure than the others. Often,
when one uses PCA the first few components have a somewhat clear meaning and the rest of the components
look like random noise–this is consistent with a model where the latent dimensionality of the patches is less than
eight.
Conclusion
I was intrigued by this paper because of the simplicity of the network and the strong
reported results. When I ran my simple experiment I was not able to reach the results as reported
in the paper using the codes provided.
In the future I will try further experiments using PCA to ininitialize filters for the deep network.
Autoencoders are often used for initializing deep-network filters and PCA is a sort of poor-man’s autoencoder.
Mean-normalizing the output layer before moving to the next layer is a simple way to organize multi-layer networks
and I think that has promise as a baseline. I am less enthusiastic about the histogramming and hashing steps.
The authors mention that the histogramming and hashing produce translation invariance, and I wonder whether
translation invariance could be achieved more simply by using max-pooling.
Overall the paper gave me some interesting questions to think about but I think it could serve as an excellent
baseline for other deep network systems when the publicly available codes are more mature.