# Hinton's Dark Knowledge

28 Oct 2014 Mark Stoehr, Gustav LarssonOn Thursday, October 2, 2014 Geoffrey Hinton gave a talk (slides, video) on what he calls “dark knowledge” which
he claims is most of what deep learning methods
actually learn. The talk presented an idea that had been introduced in
(Caruana, 2006)
where a more complex model is used to train a simpler, compressed model.
The main point of the talk introduces the idea that classifiers built from
a softmax function have
a great deal more information contained in them than just a classifier; the
correlations in the softmax outputs are very informative. For example, when
building a computer vision system to detect `cats`

,`dogs`

, and `boats`

the output
entries for `cat`

and `dog`

in a softmax classifier will always have more
correlation than `cat`

and `boat`

since `cats`

look similar to `dogs`

.

Dark knowledge was used by Hinton in two different contexts:

- Model compression: using a simpler model with fewer parameters to match the performance of a larger model.
- Specialist Networks: training models specialized to disambiguate between a small number of easily confuseable classes.

## Preliminaries

A deep neural network typically maps an input vector \( \mathbf{x}\in\mathbb{R}^{D _ {in}} \) to a set of scores \(f(\mathbf{x}) \in \mathbb{R}^C \) for each of \( C \) classes. These scores are then interpreted as a posterior distribution over the labels using a softmax function

The parameters of the entire network are collected in \( \boldsymbol{\Theta} \). The goal of the learning algorithm is to estimate \( \boldsymbol{\Theta} \). Usually, the parameters are learned by minimizing the log loss for all training samples

which is the negative of the log-likelihood of the data under the logistic regression model. The parameters \( \boldsymbol{\Theta} \) are estimated with iterative algorithms since there is no closed-form solution.

This loss function may be viewed as a cross entropy between an empirical posterior distribution and a predicted posterior distribution given by the model. In the case above, the empirical posterior distribution is simply a 1-hot distribution that puts all its mass at the ground truth label. This cross-entropy view motivates the dark knowledge training paradigm, which can be used to do model compression.

## Model compression

Instead of training the cross entropy against the labeled data one could train it against the posteriors of a previously trained model. In Hinton’s narrative, this previous model is an ensemble method, which may contain many large deep networks of similar or various architectures. Ensemble methods have been shown to consistently achieve strong performance on a variety of tasks for deep neural networks. However, these networks have a large number of parameters, which makes it computationally demanding to do inference on new samples. To alleviate this, after training the ensemble and the error rate is sufficiently low, we use the softmax outputs from the ensemble method to construct training targets for the smaller, simpler model.

In particular, for each data point \( \mathbf{x} _ n \), our first bigger ensemble network may make the prediction

The idea is to train the smaller network using this output distribution rather than the true labels. However, since the posterior estimates are typically low entropy, the dark knowledge is largely indiscernible without a log transform. To get around this, Hinton increases the entropy of the posteriors by using a transform that “raises the temperature” as

where \( T \) is a temperature parameter that when raised increases the entropy. We now set our target distributions as

The loss function becomes

Hinton mentioned that the best results are achieved by combining the two loss functions. At first, we thought he meant alternating between them, as in train one batch with \( L ^ \mathrm{(hard)} \) and the other with \( L ^ \mathrm{(soft)} \). However, after a discussion with a professor that also attended the talk, it seems as though Hinton took a convex combination of the two loss functions

where \( \alpha \) is a parameter. This professor had the impression that an appropriate value was \( \alpha = 0.9 \) after asking Hinton about it.

One of the main settings for where this is useful is in the context of speech recognition. Here an ensemble phone recognizer may achieve a low phone error rate, but it may be too slow to process user input on the fly. A simpler model replicating the ensemble method, however, can bring some of the classification gains of large-scale ensemble deep network models to practical speech systems.

## Specialist networks

Specialist networks are a way of using dark knowledge to improve the performance of deep network models regardless of their underlying complexity. They are used in the setting where there are many different classes. As before, deep network is trained over the data and each data point is assigned a target that corresponds to the temperature adjusted softmax output. These softmax outputs are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that come from a subset of classes. Specialist networks are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique. The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.

One technical hiccup created by the specialist network is that the specialist networks are trained using different classes than the full network so combining the softmax outputs from multiple networks requires a combination trick. Essentially there is an optimization problem to solve: ensure that the catchall “dustbin” classes for the specialist networks match a sum of your softmax outputs. So that if you have cars and cheetahs grouped together in one class for your dog detector you combine that network with your cars versus cheetahs network by ensuring the output probabilities for cars and cheetahs sum to a probability similar to the catch-all output of the dog detector.