deepdish2017-02-01T21:13:22+00:00http://deepdish.io/The building blocks of Deep Learning2015-11-21T00:00:00+00:00http://deepdish.io//2015/11/21/building-blocks-of-deep-learning<p>A feed-forward network is built up of nodes that make a directed acyclic graph (DAG). This post will focus on how a single node works and what we need to implement if we want to define one. It is aimed at people who generally know how deep networks work, but can still be confused about exactly what gradients need to be computed for each node (i.e. myself, all the time).</p>
<h1 id="network">Network</h1>
<p>In our network, we will have three different types of nodes (often called <em>layers</em> as well):</p>
<ul>
<li>Static data (data and labels)</li>
<li>Dynamic data (parameters)</li>
<li>Functions</li>
</ul>
<p>This is a bit different from the traditional take on nodes, since we are not allowing nodes to have any
internal parameters. Instead, parameters will be fed into function nodes as dynamic data. A network with
two fully connected layers may look like this:</p>
<p><img src="/public/images/building-block/example.svg" alt="node" /></p>
<p>The static data nodes are light blue and the dynamic data nodes (parameter nodes) are orange.</p>
<p>To train this network, all we need is the derivative of the loss, \( L \),
with respect to each of the parameter nodes. For this, we need to consider the
canonical building block, the function node:</p>
<p><img src="/public/images/building-block/block.svg" alt="node" /></p>
<p>It takes any number of inputs and produces an output (that eventually leads to the loss). In the eyes of the node, it makes no distinction between static and dynamic data, which makes things both simpler and more flexible. What we need from this building block is a way to compute \(\mathbf{z}\) and the derivative of \( L \) with respect to each of the inputs. First of all, we need a function that computes</p>
<p>\begin{equation}
\mathbf{z} = \mathrm{forward}((\mathbf{x}^1, \dots, \mathbf{x}^n)).
\end{equation}</p>
<p>This is the simple part and should be trivial once you have decided what you want the node to do.</p>
<p>Next, computing the derivative of a single element of one of the inputs may look like (superscript omitted):</p>
<p>\begin{equation}
\frac{ \partial L }{ \partial x _ i } = \sum _ j \frac{ \partial L }{ \partial z _ j } \frac{ \partial z _ j }{\partial x _ i }
\end{equation}</p>
<p>We broke the derivative up using the multivariable chain rule (also known as the <a href="https://en.wikipedia.org/wiki/Total_derivative">total derivative</a>). It can also be written as</p>
<p>\begin{equation}
\frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad \left[ \mathbb{R}^ { A \times 1} = \mathbb{R} ^ {A \times B} \mathbb{R} ^ {B \times 1} \right]
\end{equation}</p>
<p>This assumes that the input size is \( A \) and the output size is \( B \). The derivative \( \frac{ \partial L }{ \partial \mathbf{z} } \in \mathbb{R} ^ {B} \) is something that needs to be given to the building block from the outside (this is the gradient being back-propagated). The Jacobian \( \frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} } \in \mathbb{R} ^ {B \times A} \) on the other hand needs to be defined by the node. However, we do not necessarily need to explicitly compute it or store it. All we need is to define the function</p>
<p>\begin{equation}
\frac{ \partial L }{ \partial \mathbf{x} } = \mathrm{backward}\left(\mathbf{x}, \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right)
\end{equation}</p>
<p>This would need to be done for each input separately. Since they sometimes share computations, frameworks like Caffe use a single function for the entire node’s backward computation. In our code examples, we will adopt this as well, meaning we will be defining:</p>
<p>\begin{equation}
\left(\frac{ \partial L }{ \partial \mathbf{x}^1 }, \dots, \frac{ \partial L }{ \partial \mathbf{x}^n }\right) = \mathrm{backward}\left((\mathbf{x}^1, \dots, \mathbf{x}^n), \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right)
\end{equation}</p>
<p>It is also common to support multiple outputs, however for simplicity (and without loss of generality) we will assume there is only one.</p>
<h2 id="functions">Functions</h2>
<p>So, the functions that we need to define for a single node is first the forward pass:</p>
<p><img src="/public/images/building-block/forward.svg" alt="forward" /></p>
<p>The input data refers to all the inputs, so it will for instance be a list of arrays.</p>
<p>Next, the backward pass:</p>
<p><img src="/public/images/building-block/backward.svg" alt="backward" /></p>
<p>It takes three inputs as described above and returns the gradient of the loss with respect to the input. It does not need to take the output data, since it can be computed from the input data. However, if it is needed, we might as well pass it in since we will have computed it already.</p>
<h2 id="forward--backward-pass">Forward / Backward pass</h2>
<p>A DAG describes a <a href="https://en.wikipedia.org/wiki/Partially_ordered_set">partial ordering</a>. First, we need to sort our nodes so that they do not violate the partial order. There will probably be several solutions to this, but we can pick one arbitrarily.</p>
<p>Once we have this ordering, we call forward on the list from the first node to the last. The order will guarantee that the dependencies of a node have been computed when we get to it. The Loss should be the last node. This is called a <em>forward pass</em>.</p>
<p>Then, we call backward on this list in reverse. This means that we start with the Loss node. Since we do not have any output diff at this point, we simply set it to an array of all ones. We proceed until we are done with the first in the list. This is called a <em>backward pass</em>.</p>
<p>Once the forward and the backward pass have been performed, we take the gradients that have arrived at each parameter node and perform a gradient descent update in the opposite direction.</p>
<h2 id="weight-sharing">Weight sharing</h2>
<p>By externalizing the parameters, it makes parameter sharing conceptually easy to deal with. For instance, if we wanted to share weights (but not biases), we could do:</p>
<p><img src="/public/images/building-block/shared.svg" alt="shared" /></p>
<p>In this case, \( \mathbf{W} \) would receive <em>two</em> gradient arrays, in which case the sum is taken before performing the update step.</p>
<h2 id="relu">ReLU</h2>
<p>As an example, the ReLU has a single input of the same size as the output, so \( A = B \). The output is computed elementwise as</p>
<p>\begin{equation}
z _ i = \max(0, x _ i)
\end{equation}</p>
<p>which could translate to something like this in Python (who uses pseudo-code anymore?):</p>
<div class="highlighter-rouge"><pre class="highlight"><code>def forward(inputs):
return np.maximum(inputs[0], 0)
</code></pre>
</div>
<p>For the backward pass, the Jacobian will be a diagonal matrix, with entries</p>
<p>\begin{equation}
\frac{\partial z _ i}{\partial x _ i} = 1 _ {\{ x _ i > 0 \}},
\end{equation}</p>
<p>where \( 1 _ {\{P\}} \) is 1 if the predicate \( P\) is true, and zero otherwise (see <a href="https://en.wikipedia.org/wiki/Iverson_bracket">Iverson bracket</a>). We can now write the gradient of the loss as</p>
<p>\begin{equation}
\frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{1} _ {\{ \mathbf{x} > \mathbf{0} \} } \odot \frac{ \partial L }{ \partial \mathbf{z} },
\end{equation}</p>
<p>where \( \odot \) denotes an <a href="https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29">elementwise product</a>.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>def backward(inputs, output, output_diff):
return [(inputs[0] > 0) * output_diff]
</code></pre>
</div>
<p>Note that we have to return a list, since we could have multiple inputs.</p>
<h2 id="dense">Dense</h2>
<p>Moving on to the dense (fully connected) layer where</p>
<p>\begin{equation}
\mathbf{z} = \mathbf{W} ^ \intercal \mathbf{x} + \mathbf{b} \quad\quad (\mathbb{R}^{B \times 1} = \mathbb{R}^{B \times A} \mathbb{R}^{A \times 1} + \mathbb{R}^{B \times 1})
\end{equation}</p>
<p>However, remember that we make no distinction between static and dynamic input, and from the point of view of our Dense node it simply looks like:</p>
<p>\begin{equation}
\mathbf{z} = \mathbf{x} ^ 2 \mathbf{x} ^ 1 + \mathbf{x} ^ 3
\end{equation}</p>
<p>Which might translate to:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>def forward(inputs):
x, W, b = inputs
return W.T @ x + b
</code></pre>
</div>
<p>For the backward pass, we need to compute all three Jacobians and multiply them by the gradient coming in from above. Let’s start with \( \mathbf{x} \):</p>
<p>\begin{equation}
\frac{\mathrm{d} \mathbf{z} }{\mathrm{d} \mathbf{x}} = \mathbf{W} ^ \intercal \in \mathbb{R} ^ {B \times A}
\end{equation}</p>
<p>which gives us</p>
<p>\begin{equation}
\text{Gradient #1 }\rightarrow \quad\quad<br />
\frac{\partial L}{\partial \mathbf{x}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \mathbf{W} \frac{ \partial L }{ \partial \mathbf{z} }
\quad\quad \leftarrow\text{ Gradient #1}
\end{equation}</p>
<p>Moving on. Since \( \mathbf{W} \in \mathbb{R} ^ {A \times B} \), it means its Jacobian should have the dimensions \( B \times (A \times B) \). We know the bias will drop off, so we can write the output that we will be taking the Jacobian of as:</p>
<p>\begin{equation}
\mathbf{z}’ = \left( \sum _ {j = 1} ^ A W _ {j, 1} x _ j, \dots, \sum _ {j = 1} ^ A W _ {j, B} x _ j \right)
\end{equation}</p>
<p>Now, let’s compute the derivative of \( z’ _ i \) (and thus \(z _ i \)) with respect to \( W _ {j, k} \):</p>
<p>\begin{equation}
\frac{\partial z _ i}{\partial W _ {j, k}} =
\left\{
\begin{array}{ll}
x _ j & \mbox{if } i = k \<br />
0 & \mbox{otherwise}
\end{array}
\right.
\end{equation}</p>
<p>With a bit of collapsing things together (Einstein notation is great for this, but the steps are omitted here), we get an outer product of two vectors
\begin{equation}
\text{Gradient #2 }\rightarrow \quad\quad<br />
\frac{\partial L}{\partial \mathbf{W}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{W} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{x} \left( \frac{ \partial L }{ \partial \mathbf{z} } \right)^\intercal
\quad\quad \leftarrow\text{ Gradient #2}
\end{equation}</p>
<p>The final Jacobian is simply an identity matrix</p>
<p>\begin{equation}
\frac{\mathrm{d} \mathbf{z} }{\mathrm{d} \mathbf{b}} = I \in \mathbb{R} ^ {B \times B}
\end{equation}</p>
<p>so the Loss derivative with respect to the bias is just the gradients coming in from above unchanged</p>
<p>\begin{equation}
\text{Gradient #3}\rightarrow \quad\quad<br />
\frac{\partial L}{\partial \mathbf{b}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{b} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \frac{ \partial L }{ \partial \mathbf{z} }
\quad\quad \leftarrow\text{ Gradient #3}
\end{equation}</p>
<p>We thus have all three gradients (with no regard as to which ones are parameters). This might translate in code to:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>def backward(inputs, output, output_diff):
x, W, b = inputs
return [
W.T @ output_diff,
np.outer(x, output_diff),
output_diff,
]
</code></pre>
</div>
<p>Now, the frameworks that I know do not externalize the parameters, so instead
of returning the two last gradients, they would be applied to the interal
parameters through some other means. However, the main ideas and certainly the
math will be exactly the same.</p>
<h2 id="loss">Loss</h2>
<p>You should get the idea by now. The final note is that when we do this for the
Loss layer, we still need to pretend the node has been placed in the middle of
a network with an actual loss at the end of it. The Loss node should not be
different in any way, except that its output size is scalar. However, a good
loss node should theoretically be able to be used in the middle of a network, so
it should still query <code class="highlighter-rouge">output_diff</code> and use it correctly (even though it will be all ones when
used in the final position).</p>
<h2 id="summary">Summary</h2>
<p>In summary, the usual steps when constructing a new node/layer is:</p>
<ul>
<li>Compute the forward pass</li>
<li>Calculate the Jacobian for all your inputs (static and dynamic alike)</li>
<li>Multiply them with the gradient coming in from above. At this point, we will often realize that we do not have to ever store the entire Jacobian.</li>
</ul>
Creating an LMDB database in Python2015-04-28T00:00:00+00:00http://deepdish.io//2015/04/28/creating-lmdb-in-python<p>LMDB is the database of choice when using <a href="http://caffe.berkeleyvision.org/">Caffe</a> with large datasets. This is a tutorial of how to create an LMDB database from Python. First, let’s look at the pros and cons of using LMDB over HDF5.</p>
<p>Reasons to use HDF5:</p>
<ul>
<li>Simple format to read/write.</li>
</ul>
<p>Reasons to use LMDB:</p>
<ul>
<li>LMDB uses <a href="http://en.wikipedia.org/wiki/Memory-mapped_file">memory-mapped files</a>, giving much better I/O performance.</li>
<li>Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can’t have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to <code class="highlighter-rouge">h5</code> files in your text file). Then again, compared to LMDB’s page caching the I/O performance won’t be nearly as good.</li>
</ul>
<h2 id="lmdb-from-python">LMDB from Python</h2>
<p>You will need the Python package <a href="https://lmdb.readthedocs.org/en/release/">lmdb</a> as well as Caffe’s python package (<code class="highlighter-rouge">make pycaffe</code> in Caffe). LMDB provides key-value storage, where each <key, value> pair will be a sample in our dataset. The key will simply be a string version of an ID value, and the value will be a serialized version of the <code class="highlighter-rouge">Datum</code> class in Caffe (which are built using <a href="https://github.com/google/protobuf">protobuf</a>).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">lmdb</span>
<span class="kn">import</span> <span class="nn">caffe</span>
<span class="n">N</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="c"># Let's pretend this is interesting data</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">N</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="c"># We need to prepare the database for the size. We'll set it 10 times</span>
<span class="c"># greater than what we theoretically need. There is little drawback to</span>
<span class="c"># setting this too big. If you still run into problem after raising</span>
<span class="c"># this, you might want to try saving fewer entries in a single</span>
<span class="c"># transaction.</span>
<span class="n">map_size</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">nbytes</span> <span class="o">*</span> <span class="mi">10</span>
<span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'mylmdb'</span><span class="p">,</span> <span class="n">map_size</span><span class="o">=</span><span class="n">map_size</span><span class="p">)</span>
<span class="k">with</span> <span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">(</span><span class="n">write</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
<span class="c"># txn is a Transaction object</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="n">datum</span> <span class="o">=</span> <span class="n">caffe</span><span class="o">.</span><span class="n">proto</span><span class="o">.</span><span class="n">caffe_pb2</span><span class="o">.</span><span class="n">Datum</span><span class="p">()</span>
<span class="n">datum</span><span class="o">.</span><span class="n">channels</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">datum</span><span class="o">.</span><span class="n">height</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span>
<span class="n">datum</span><span class="o">.</span><span class="n">width</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="n">datum</span><span class="o">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">tobytes</span><span class="p">()</span> <span class="c"># or .tostring() if numpy < 1.9</span>
<span class="n">datum</span><span class="o">.</span><span class="n">label</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
<span class="n">str_id</span> <span class="o">=</span> <span class="s">'{:08}'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="c"># The encode is only essential in Python 3</span>
<span class="n">txn</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">str_id</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">),</span> <span class="n">datum</span><span class="o">.</span><span class="n">SerializeToString</span><span class="p">())</span>
</code></pre>
</div>
<p>You can also open up and inspect an existing LMDB database from Python:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">lmdb</span>
<span class="kn">import</span> <span class="nn">caffe</span>
<span class="n">env</span> <span class="o">=</span> <span class="n">lmdb</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'mylmdb'</span><span class="p">,</span> <span class="n">readonly</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">with</span> <span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">()</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
<span class="n">raw_datum</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">b</span><span class="s">'00000000'</span><span class="p">)</span>
<span class="n">datum</span> <span class="o">=</span> <span class="n">caffe</span><span class="o">.</span><span class="n">proto</span><span class="o">.</span><span class="n">caffe_pb2</span><span class="o">.</span><span class="n">Datum</span><span class="p">()</span>
<span class="n">datum</span><span class="o">.</span><span class="n">ParseFromString</span><span class="p">(</span><span class="n">raw_datum</span><span class="p">)</span>
<span class="n">flat_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">fromstring</span><span class="p">(</span><span class="n">datum</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">flat_x</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">datum</span><span class="o">.</span><span class="n">channels</span><span class="p">,</span> <span class="n">datum</span><span class="o">.</span><span class="n">height</span><span class="p">,</span> <span class="n">datum</span><span class="o">.</span><span class="n">width</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">datum</span><span class="o">.</span><span class="n">label</span>
</code></pre>
</div>
<p>Iterating <key, value> pairs is also easy:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">with</span> <span class="n">env</span><span class="o">.</span><span class="n">begin</span><span class="p">()</span> <span class="k">as</span> <span class="n">txn</span><span class="p">:</span>
<span class="n">cursor</span> <span class="o">=</span> <span class="n">txn</span><span class="o">.</span><span class="n">cursor</span><span class="p">()</span>
<span class="k">for</span> <span class="n">key</span><span class="p">,</span> <span class="n">value</span> <span class="ow">in</span> <span class="n">cursor</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">)</span>
</code></pre>
</div>
Initialization of deep networks2015-02-24T00:00:00+00:00http://deepdish.io//2015/02/24/network-initialization<p>As we all know, the solution to a non-convex optimization algorithm (like
stochastic gradient descent) depends on the initial values of the parameters.
This post is about choosing initialization parameters for deep networks and how
it affects the convergence. We will also discuss the related topic of vanishing
gradients.</p>
<p>First, let’s go back to the time of sigmoidal activation functions and
initialization of parameters using IID Gaussian or uniform distributions with fairly
arbitrarily set variances. Building deep networks was difficult because of
exploding or vanishing activations and gradients. Let’s take activations first:
If all your parameters are too small, the variance of your activations will
drop in each layer. This is a problem if your activation function is sigmoidal,
since it is approximately linear close to 0. That is, you gradually lose your
non-linearity, which means there is no benefit to having multiple layers. If,
on the other hand, your activations become larger and larger, then your
activations will saturate and become meaningless, with gradients approaching 0.</p>
<p><img src="/public/images/activation-functions.svg" alt="Activation functions" /></p>
<p>Let us consider one layer and forget about the bias. Note that the following analysis
and conclusion is taken from Glorot and Bengio[1]. Consider a weight matrix
\( W \in \mathbf{R}^{m \times n} \), where each element was drawn from an IID
Guassian with variance \( \mathrm{Var}(W) \). Note that we are a bit abusive with notation
letting \( W \) denote both a matrix and a univariate random variable. We
also assume there is no correlation between our input and our weights and both
are zero-mean. If we consider one filter (row) in \( W \), say \( \mathbf{w}
\) (a random vector), then the variance of the output signal over the input signal is:</p>
<script type="math/tex; mode=display">\frac{ \mathrm{Var}(\mathbf{w}^T \mathbf{x}) }{ \mathrm{Var}(X) } = \frac{\sum _ n^N \mathrm{Var}(w _ n x _ n)}{\mathrm{Var}(X)} = \frac{n \mathrm{Var}(W) \mathrm{Var}(X)}{\mathrm{Var}(X)}= n\mathrm{Var}(W)</script>
<p>As we build a deep network, we want the variance of the signal going forward in
the network to remain the same, thus it would be advantageous if \( n \mathrm{Var}(W)
= 1. \) The same argument can be made for the gradients, the signal going
backward in the network, and the conclusion is that we would also like \( m
\mathrm{Var}(W) = 1. \) Unless \( n = m, \) it is impossible to sastify both
of these conditions. In practice, it works well if both are approximately
satisfied. One thing that has never been clear to me is why it is only
necessary to satisfy these conditions when picking the initialization values of
\( W. \) It would seem that we have no guarantee that the conditions will
remain true as the network is trained.</p>
<p>Nevertheless, this <em>Xavier initialization</em> (after Glorot’s first name) is a neat
trick that works well in practice. However, along came rectified linear units
(ReLU), a non-linearity that is scale-invariant around 0 <em>and</em> does not
saturate at large input values. This seemingly solved both of the problems the
sigmoid function had; or were they just alleviated? I am unsure of how widely
used Xavier initialization is, but if it is not, perhaps it is because ReLU
seemingly eliminated this problem.</p>
<p>However, take the most competative network as of recently, VGG[2]. They do not
use this kind of initialization, although they report that it was tricky to get
their networks to converge. They say that they first trained their most shallow
architecture and then used that to help initialize the second one, and so
forth. They presented 6 networks, so it seems like an awfully complicated
training process to get to the deepest one.</p>
<p>A recent paper by He et al.[3] presents a pretty straightforward generalization
of ReLU and Leaky ReLU. What is more interesting is their emphasis on the
benefits of Xavier initialization even for ReLU. They re-did the derivations
for ReLUs and discovered that the conditions were the same up to a factor 2.
The difficulty Simonyan and Zisserman had training VGG is apparently avoidable,
simply by using Xavier intialization (or better yet the ReLU adjusted version).
Using this technique, He et al. reportedly trained a whopping 30-layer deep
network to convergence in one go.</p>
<p>Another recent paper tackling the signal scaling problem is by Ioffe and
Szegedy[4]. They call the change in scale <em>internal covariate shift</em> and claim
this forces learning rates to be unnecessarily small. They suggest that if all
layers have the same scale and remain so throughout training, a much higher
learning rate becomes practically viable. You cannot just standardize the
signals, since you would lose expressive power (the bias disappears and in the
case of sigmoids we would be constrained to the linear regime). They solve this
by re-introducing two parameters per layer, scaling and bias, added again after
standardization. The training reportedly becomes about 6 times faster and they
present state-of-the-art results on ImageNet. However, I’m not certain this is
the solution that will stick.</p>
<p>I reckon we will see a lot more work on this frontier in the next few years.
Especially since it also relates to the – right now wildly popular –
Recurrent Neural Network (RNN), which connects output signals back as inputs.
The way you train such network is that you unroll the time axis, treating the
result as an extremely deep feedforward network. This greatly exacerbates the
vanishing gradient problem. A popular solution, called Long Short-Term Memory
(LSTM), is to introduce memory cells, which are a type of teleport that allows
a signal to jump ahead many time steps. This means that the gradient is
retained for all those time steps and can be propagated back to a much earlier
time without vanishing.</p>
<p>This area is far from solved, and until then I think I will be sticking to
Xavier initialization. If you are using Caffe, the one take-away of this post
is to use the following on all your layers:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>weight_filler {
type: "xavier"
}
</code></pre>
</div>
<h3 id="references">References</h3>
<ol>
<li>
<p>X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256.</p>
</li>
<li>
<p>K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [<a href="http://arxiv.org/pdf/1409.1556v5">pdf</a>]</p>
</li>
<li>
<p>K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” arXiv:1502.01852 [cs], Feb. 2015. [<a href="http://arxiv.org/pdf/1502.01852v1">pdf</a>]</p>
</li>
<li>
<p>S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Feb. 2015. [<a href="http://arxiv.org/pdf/1502.03167v2">pdf</a>]</p>
</li>
</ol>
Local Torch installation2015-02-20T00:00:00+00:00http://deepdish.io//2015/02/20/local-torch-installation<p>This post describes how to do a local Torch7 installation while ignoring a
potentially conflicting global installation in <code class="highlighter-rouge">/usr/local/share</code>.</p>
<p>Doing a local Torch7 installation is easily done using
<a href="https://github.com/torch/distro">torch/distro</a>. However, when running
<code class="highlighter-rouge">install.sh</code>, I ran into the following error:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>/usr/larsson/torch/install/bin/luajit: /tmp/luarocks_cutorch-scm-1-5301/cutorch/TensorMath.lua:184: attempt to call method 'registerDefaultArgument' (a nil value)
stack traceback:
/tmp/luarocks_cutorch-scm-1-5301/cutorch/TensorMath.lua:184: in main chunk
[C]: at 0x00405330
make[2]: *** [TensorMath.c] Error 1
make[1]: *** [CMakeFiles/cutorch.dir/all] Error 2
make: *** [all] Error 2
</code></pre>
</div>
<p>This issue is documented <a href="https://github.com/torch/cutorch/issues/106">here</a>
and the solution is to remove the global installation in <code class="highlighter-rouge">/usr/local/share</code>.
This was not an option for me. This is what I did.</p>
<p>I cloned <code class="highlighter-rouge">torch/distro</code> as you would, let us say to <code class="highlighter-rouge">~/torch</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>git clone git@github.com:torch/distro.git ~/torch --recursive
</code></pre>
</div>
<p>I went into <code class="highlighter-rouge">~/torch</code> and ran <code class="highlighter-rouge">install.sh</code>, which failed. For me, I still got
Torch installed even though some packages failed. Check that this is the case by
running <code class="highlighter-rouge">which th</code> and <code class="highlighter-rouge">which luarocks</code> - it should point to <code class="highlighter-rouge">~/torch</code>. If this
is the case, run <code class="highlighter-rouge">th</code> and type in:</p>
<div class="language-lua highlighter-rouge"><pre class="highlight"><code><span class="o">></span> <span class="nb">print</span><span class="p">(</span><span class="nb">package.path</span><span class="p">)</span>
<span class="o">></span> <span class="nb">print</span><span class="p">(</span><span class="nb">package.cpath</span><span class="p">)</span>
</code></pre>
</div>
<p>Copy these strings to your <code class="highlighter-rouge">LUA_PATH</code> and <code class="highlighter-rouge">LUA_CPATH</code>, respectively. Leave out
any references to <code class="highlighter-rouge">/usr/local/share</code>! This might look something like this in
your <code class="highlighter-rouge">~/.bashrc</code>:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code><span class="nb">export </span><span class="nv">TORCH_DIR</span><span class="o">=</span><span class="nv">$HOME</span>/torch
<span class="nb">export </span><span class="nv">LUA_PATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$TORCH_DIR</span><span class="s2">/install/share/lua/5.1/?.lua;</span><span class="nv">$TORCH_DIR</span><span class="s2">/install/share/lua/5.1/?/init.lua;</span><span class="nv">$TORCH_DIR</span><span class="s2">/install/share/luajit-2.1.0-alpha/?.lua"</span>
<span class="nb">export </span><span class="nv">LUA_CPATH</span><span class="o">=</span><span class="s2">"</span><span class="nv">$TORCH_DIR</span><span class="s2">/install/lib/lua/5.1/?.so"</span>
</code></pre>
</div>
<p>Note that you have to quote the strings since their usage of <code class="highlighter-rouge">;</code> as a delimiter
does not play well with bash. Once saved, refresh your shell by running <code class="highlighter-rouge">source ~/.bashrc</code> and try
installing the packages that failed. I did</p>
<div class="highlighter-rouge"><pre class="highlight"><code>luarocks install cutorch
luarocks install cunn
</code></pre>
</div>
<p>This time around it worked and I was good to go.</p>
Python dictionary to HDF52014-11-11T00:00:00+00:00http://deepdish.io//2014/11/11/python-dictionary-to-hdf5<p>I used to be a big fan of Numpy’s
<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html">savez</a>
and
<a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html">load</a>,
since you can throw any Python structure in there that you want to save.
However, these files are not compatible between Python 2 and 3, so they do not
fit my needs anymore since I have computers running both versions. I took the
matter to
<a href="http://stackoverflow.com/questions/18071075/saving-dictionaries-to-file-numpy-and-python-2-3-friendly">Stackoverflow</a>,
but a clear winner did not emerge.</p>
<p>Finally, I decided to write my own alternative to <code class="highlighter-rouge">savez</code> based on HDF5 using
<a href="http://www.pytables.org/">PyTables</a>. The result can be found in our
<a href="https://github.com/uchicago-cs/deepdish">deepdish</a> project (in
<a href="https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py">hdf5io.py</a>).
It also seconds as a general-purpose HDF5 saver/loader. First, an example of
how to write a <a href="http://caffe.berkeleyvision.org">Caffe</a>-compatible data file:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">deepdish</span> <span class="kn">as</span> <span class="nn">dd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">100</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
<span class="n">dd</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'test.h5'</span><span class="p">,</span> <span class="p">{</span><span class="s">'data'</span><span class="p">:</span> <span class="n">X</span><span class="p">,</span> <span class="s">'label'</span><span class="p">:</span> <span class="n">y</span><span class="p">},</span> <span class="n">compression</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
</code></pre>
</div>
<p>Note that Caffe does not like the compressed version, so we are turning off
compression. Let’s take a look at it:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code><span class="gp">$ </span>h5ls test.h5
data Dataset <span class="o">{</span>100, 3, 32, 32<span class="o">}</span>
label Dataset <span class="o">{</span>100<span class="o">}</span>
</code></pre>
</div>
<p>It will load into a dictionary with <code class="highlighter-rouge">dd.io.load</code>:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">dd</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'test.h5'</span><span class="p">)</span>
</code></pre>
</div>
<p>Now, it does much more than that. It can save numbers, lists, strings,
dictionaries and numpy arrays. It will try its best to store things natively in
HDF5, so that it could be read by other programs as well. Another example:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">3</span><span class="p">),</span> <span class="p">{</span><span class="s">'d'</span><span class="p">:</span> <span class="mi">100</span><span class="p">,</span> <span class="s">'e'</span><span class="p">:</span> <span class="s">'hello'</span><span class="p">}]</span>
<span class="o">>>></span> <span class="n">dd</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'test.h5'</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">dd</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="s">'test.h5'</span><span class="p">)</span>
<span class="p">[</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]),</span> <span class="p">{</span><span class="s">'e'</span><span class="p">:</span> <span class="s">'hello'</span><span class="p">,</span> <span class="s">'d'</span><span class="p">:</span> <span class="mi">100</span><span class="p">}]</span>
</code></pre>
</div>
<p>If it doesn’t know how to save a particular data type, it will fall-back and
use pickling. This means it will still work, but you will lose the
compatibility across Python 2 and 3.</p>
Caffe with weighted samples2014-11-04T00:00:00+00:00http://deepdish.io//2014/11/04/caffe-with-weighted-samples<p><a href="http://caffe.berkeleyvision.org/">Caffe</a> is a great framework for training and
running deep learning networks. However, it does not support weighted samples,
which is when you assign an importance for each sample. A weight (importance)
of 2 should have the same semantics for a sample as if you made a duplicate of
it.</p>
<p>I created an experimental fork of Caffe that supports this:</p>
<ul>
<li><a href="https://github.com/gustavla/caffe-weighted-samples">gustavla/caffe-weighted-samples</a></li>
</ul>
<p>This modification is so far rough around the edges and likely easy to break. I
have also not implemented support for it in all the loss layers, but only a
select few.</p>
<p>It works by adding the blob <code class="highlighter-rouge">sample_weight</code> to the dataset, alongside <code class="highlighter-rouge">data</code>
and <code class="highlighter-rouge">label</code>. The easiest way is to save the data as HDF5, which can easily be
done through Python:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">h5py</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="c"># X should have shape (samples, color channel, width, height)</span>
<span class="c"># y should have shape (samples,)</span>
<span class="c"># w should have shape (samples,)</span>
<span class="c"># They should have dtype np.float32, even label</span>
<span class="c"># DIR is an absolute path (important!)</span>
<span class="n">h5_fn</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">DIR</span><span class="p">,</span> <span class="s">'data.h5'</span><span class="p">)</span>
<span class="k">with</span> <span class="n">h5py</span><span class="o">.</span><span class="n">File</span><span class="p">(</span><span class="n">h5_fn</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">f</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span> <span class="o">=</span> <span class="n">X</span>
<span class="n">f</span><span class="p">[</span><span class="s">'label'</span><span class="p">]</span> <span class="o">=</span> <span class="n">y</span>
<span class="n">f</span><span class="p">[</span><span class="s">'sample_weight'</span><span class="p">]</span> <span class="o">=</span> <span class="n">w</span>
<span class="n">text_fn</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">DIR</span><span class="p">,</span> <span class="s">'data.txt'</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">text_fn</span><span class="p">,</span> <span class="s">'w'</span><span class="p">):</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">h5_fn</span><span class="p">,</span> <span class="nb">file</span><span class="o">=</span><span class="n">f</span><span class="p">)</span>
</code></pre>
</div>
<p>Or, if you have our <a href="https://github.com/uchicago-cs/deepdish">deepdish</a> package
installed, saving the HDF5 can be done as follows (also see <a href="/2014/11/11/python-dictionary-to-hdf5/">this post</a>):</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">dd</span><span class="o">.</span><span class="n">io</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">h5_fn</span><span class="p">,</span> <span class="nb">dict</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">X</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="n">y</span><span class="p">,</span> <span class="n">sample_weight</span><span class="o">=</span><span class="n">w</span><span class="p">))</span>
</code></pre>
</div>
<p>Now, load the <code class="highlighter-rouge">sample_weight</code> in your data layer:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>layers {
name: "example"
type: HDF5_DATA
top: "data"
top: "label"
top: "sample_weight" # <-- add this
hdf5_data_param {
source: "/path/to/data.txt"
batch_size: 100
}
}
</code></pre>
</div>
<p>The file <code class="highlighter-rouge">data.txt</code> should contain a single line with the absolute path to
<code class="highlighter-rouge">h5_fn</code>, for instance <code class="highlighter-rouge">/path/to/data.h5</code>. Next, hook it up to the softmax layer
as:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>layers {
name: "loss"
type: SOFTMAX_LOSS
bottom: "last_layer"
bottom: "label"
bottom: "sample_weight" # <-- add this
top: "loss"
}
</code></pre>
</div>
<p>The layer <code class="highlighter-rouge">SOFTMAX_LOSS</code> is one of the few layers that have been
adapted to use <code class="highlighter-rouge">sample_weight</code>. If you want to use one that has not been
implemented yet, take inspiration from
<a href="https://github.com/gustavla/caffe-weighted-samples/blob/master/src/caffe/layers/softmax_loss_layer.cpp">src/caffe/softmax_loss_layer.cpp</a>.
Remember to also update <code class="highlighter-rouge">hpp</code> and <code class="highlighter-rouge">cu</code> files where needed. If you end up doing this, pull requests are welcome.</p>
Hinton's Dark Knowledge2014-10-28T00:00:00+00:00http://deepdish.io//2014/10/28/hintons-dark-knowledge<p>On Thursday, October 2, 2014 <a href="http://www.cs.toronto.edu/~hinton/">Geoffrey Hinton</a> gave a talk (<a href="http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/geoff_hinton_dark14.pdf">slides</a>, <a href="https://www.youtube.com/watch?v=phmEyJa4I7o">video</a>) on what he calls “dark knowledge” which
he claims is most of what <a href="http://deeplearning.net/">deep learning methods</a>
actually learn. The talk presented an idea that had been introduced in
<a href="http://www.cs.cornell.edu/~caruana/compression.kdd06.pdf">(Caruana, 2006)</a>
where a more complex model is used to train a simpler, compressed model.
The main point of the talk introduces the idea that classifiers built from
a <a href="http://en.wikipedia.org/wiki/Softmax_function">softmax function</a> have
a great deal more information contained in them than just a classifier; the
correlations in the softmax outputs are very informative. For example, when
building a computer vision system to detect <code class="highlighter-rouge">cats</code>,<code class="highlighter-rouge">dogs</code>, and <code class="highlighter-rouge">boats</code> the output
entries for <code class="highlighter-rouge">cat</code> and <code class="highlighter-rouge">dog</code> in a softmax classifier will always have more
correlation than <code class="highlighter-rouge">cat</code> and <code class="highlighter-rouge">boat</code> since <code class="highlighter-rouge">cats</code> look similar to <code class="highlighter-rouge">dogs</code>.</p>
<p>Dark knowledge was used by Hinton in two different contexts:</p>
<ul>
<li>Model compression: using a simpler model with fewer parameters to match the performance of a larger model.</li>
<li>Specialist Networks: training models specialized to disambiguate between a small number of easily confuseable classes.</li>
</ul>
<h2 id="preliminaries">Preliminaries</h2>
<p>A deep neural network typically maps an input vector \( \mathbf{x}\in\mathbb{R}^{D _ {in}} \) to a set of scores \(f(\mathbf{x}) \in \mathbb{R}^C \) for each of \( C \) classes. These scores are then interpreted as a posterior distribution over the labels using a <a href="http://en.wikipedia.org/wiki/Softmax_function">softmax function</a></p>
<script type="math/tex; mode=display">\hat{\mathbb{P}}(\mathbf{y} \mid \mathbf{x}; \boldsymbol{\Theta}) = \mathrm{softmax}(f(\mathbf{x}; \boldsymbol{\Theta})).</script>
<p>The parameters of the entire network are collected in \( \boldsymbol{\Theta}
\). The goal of the learning algorithm is to estimate \( \boldsymbol{\Theta}
\). Usually, the parameters are learned by minimizing the log loss for all
training samples</p>
<script type="math/tex; mode=display">L ^ \mathrm{(hard)} = \sum _ {n=1}^NL(\mathbf{x} _ n,y _ n;\boldsymbol{\Theta})
= -\sum _ {n=1}^N\sum _ {c=1}^C 1 _ \{\\{y _ n=c\\}\} \log \hat{\mathbb{P}}(y _ c \mid \mathbf{x} _ n ; \boldsymbol{\Theta} ),</script>
<p>which is the negative of the log-likelihood of the data under the logistic
regression model. The parameters \( \boldsymbol{\Theta} \) are estimated with
iterative algorithms since there is no closed-form solution.</p>
<p>This loss function may be viewed as a cross entropy between an empirical
posterior distribution and a predicted posterior distribution given by the
model. In the case above, the empirical posterior distribution is simply a
1-hot distribution that puts all its mass at the ground truth label. This
cross-entropy view motivates the dark knowledge training paradigm, which can be
used to do model compression.</p>
<h2 id="model-compression">Model compression</h2>
<p>Instead of training the cross entropy against the labeled data one could train
it against the posteriors of a previously trained model. In Hinton’s narrative,
this previous model is an ensemble method, which may contain many large deep
networks of similar or various architectures. Ensemble methods have been
shown to consistently achieve strong performance on a variety of tasks for deep
neural networks. However, these networks have a large number of parameters,
which makes it computationally demanding to do inference on new samples. To
alleviate this, after training the ensemble and the error rate is sufficiently
low, we use the softmax outputs from the ensemble method to construct training
targets for the smaller, simpler model.</p>
<p>In particular, for each data point \( \mathbf{x} _ n \), our first bigger
ensemble network may make the prediction</p>
<script type="math/tex; mode=display">\mathbf{\hat y} _ n ^ \mathrm{(big)} = \mathbb{P} (\mathbf{y} \mid \mathbf{x}; \boldsymbol{\Theta} ^ \mathrm{(big)}).</script>
<p>The idea is to train the smaller network using this output distribution rather
than the true labels. However, since the posterior estimates are typically low
entropy, the dark knowledge is largely indiscernible without a log transform.
To get around this, Hinton increases the entropy of the posteriors by using a
transform that “raises the temperature” as</p>
<script type="math/tex; mode=display">[g(\mathbf{y}; T)] _ k = \frac{y _ k^{1/T} }{\sum _ {k'} y _ {k'} ^ {1/T}},</script>
<p>where \( T \) is a temperature parameter that when raised increases the entropy.
We now set our target distributions as</p>
<script type="math/tex; mode=display">\mathbf{y} ^ \mathrm{(target)} _ n = g(\mathbf{y} ^ \mathrm{(big)} _ n; T).</script>
<p>The loss function becomes</p>
<script type="math/tex; mode=display">L ^ \mathrm{(soft)} = \sum _ {n=1}^NL(\mathbf{x} _ n,y _ n;\boldsymbol{\Theta}^\mathrm{(small)}) = -\sum _ {n=1}^N \sum _ {c=1}^C y ^ \mathrm{(target)} _ {n, c} \log \hat{\mathbb{P}}(y _ c \mid \mathbf{x} _ n ; \boldsymbol{\Theta} ^ \mathrm{(small)}).</script>
<p>Hinton mentioned that the best results are achieved by combining the two loss functions. At
first, we thought he meant alternating between them, as in train one batch with
\( L ^ \mathrm{(hard)} \) and the other with \( L ^ \mathrm{(soft)} \).
However, after a discussion with a professor that also attended the talk, it
seems as though Hinton took a convex combination of the two loss functions</p>
<script type="math/tex; mode=display">L = \alpha L ^ \mathrm{(soft)} + (1 - \alpha) L ^ \mathrm{(hard)},</script>
<p>where \( \alpha \) is a parameter. This professor had the impression
that an appropriate value was \( \alpha = 0.9 \) after asking Hinton about it.</p>
<p>One of the main settings for where this is useful is in the context of
speech recognition. Here an ensemble phone recognizer may achieve a low phone
error rate, but it may be too slow to process user input on the fly. A simpler
model replicating the ensemble method, however, can bring some of the
classification gains of large-scale ensemble deep network models to practical
speech systems.</p>
<h2 id="specialist-networks">Specialist networks</h2>
<p>Specialist networks are a way of using dark knowledge to improve the performance of deep network models regardless of their underlying
complexity. They are used in the setting where there are many different classes. As before, deep network is trained over the data
and each data point is assigned a target that corresponds to the temperature adjusted softmax output. These softmax outputs
are then clustered multiple times using k-means and the resultant clusters indicate easily confuseable data points that
come from a subset of classes. Specialist networks
are then trained only on the data in these clusters using a restricted number of classes. They treat all classes not contained in
the cluster as coming from a single “other” class. These specialist networks are then trained using alternating one-hot, temperature-adjusted technique.
The ensemble method constructed by combining the various specialist networks creates benefits for the overall network.</p>
<p>One technical hiccup created by the specialist network is that the specialist networks are trained using different classes than the full
network so combining the softmax outputs from multiple networks requires a combination trick. Essentially there is an optimization problem
to solve: ensure that the catchall “dustbin” classes for the specialist networks match a sum of your softmax outputs. So that if you have
cars and cheetahs grouped together in one class for your dog detector you combine that network with your cars versus cheetahs network by ensuring
the output probabilities for cars and cheetahs sum to a probability similar to the catch-all output of the dog detector.</p>
GNU Parallel2014-09-15T00:00:00+00:00http://deepdish.io//2014/09/15/gnu-parallel<p>I was reading the <a href="http://caffe.berkeleyvision.org/gathered/examples/imagenet.html">ImageNet
tutorial</a> for
<a href="http://caffe.berkeleyvision.org/">Caffe</a> (a deep learning framework), in which
they need to resize a large number of images. It struck me that they might not
be aware of <a href="http://www.gnu.org/software/parallel/">GNU Parallel</a>, since it
is a great tool for this task. I recommend it to any data scientist out there
since it is so simple to use and like many other GNU tools, with good chance
already installed on your computer. If not, run <code class="highlighter-rouge">apt-get install parallel</code> on
Debian. It might suggest that you to install <code class="highlighter-rouge">moreutils</code> to get parallel, but
this installs the wrong software (<a href="http://www.gnu.org/software/parallel/history.html">explanation</a>).</p>
<p>In the writeup, it says that the author used his own MapReduce framework to do
it, but it can also be done sequentially as:</p>
<div class="language-sh highlighter-rouge"><pre class="highlight"><code><span class="k">for </span>name <span class="k">in</span> <span class="k">*</span>.jpeg; <span class="k">do
</span>convert -resize 256x256<span class="se">\!</span> <span class="nv">$name</span> <span class="nv">$name</span>
<span class="k">done</span>
</code></pre>
</div>
<p>Instead of this sequential approach, you can run it in parallel with even less
typing:</p>
<div class="language-sh highlighter-rouge"><pre class="highlight"><code>parallel convert -resize 256x256<span class="se">\!</span> <span class="o">{}</span> <span class="o">{}</span> ::: <span class="k">*</span>.jpeg
</code></pre>
</div>
<p>GNU Parallel will insert each filename at <code class="highlighter-rouge"><span class="p">{}</span></code> to form a command. Multiple
commands will execute concurrently if you have a multicore computer.</p>
<p>If you have ever been tempted to do this kind of parallelization by adding <code class="highlighter-rouge">&</code>
at the end of each command in the for loop, then Parallel is definetely for
you. Adding <code class="highlighter-rouge">&</code> introduces two problems that Parallel solves: (1) you don’t
know when all of them are done and there is no easy way to <em>join</em> them, and (2)
it will start a process for each command all at once, while Parallel will
schedule your tasks and execute only as many in parallel as your computer
can handle.</p>
<h2 id="basics">Basics</h2>
<p>Parallel can also take input from the pipe, in which case it is similar to xargs:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>ls *.jpeg | parallel mv {} {.}-old.jpeg
</code></pre>
</div>
<p>This command inserts <code class="highlighter-rouge">-old</code> into the filenames of all the JPEG files in the
directory. The <code class="highlighter-rouge"><span class="p">{</span><span class="err">.</span><span class="p">}</span></code> is similar to <code class="highlighter-rouge"><span class="p">{}</span></code>, except it removes the extension. There
are many replacement strings like this:</p>
<div class="language-sh highlighter-rouge"><pre class="highlight"><code>parallel convert -resize 256x256<span class="se">\!</span> <span class="o">{}</span> resized/<span class="o">{</span>/<span class="o">}</span> ::: images/<span class="k">*</span>.jpeg
</code></pre>
</div>
<p>This resizes all the JPEG files inside the folder <code class="highlighter-rouge">images</code> and places the
output in the folder <code class="highlighter-rouge">resized</code>. The replacement string <code class="highlighter-rouge"><span class="p">{</span><span class="err">/</span><span class="p">}</span></code> extracts the
filename and is thus similar to the command <code class="highlighter-rouge">basename</code>. For this example we
went back to the <code class="highlighter-rouge">:::</code> style input, which in many cases is preferable. For
instance, it can be used several times to form a product of the input:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code>parallel <span class="s2">"echo {1}: {2}"</span> ::: A B C D ::: <span class="o">{</span>1..8<span class="o">}</span>
</code></pre>
</div>
<p>Note how we now used <code class="highlighter-rouge"><span class="p">{</span><span class="err">1</span><span class="p">}</span></code> and <code class="highlighter-rouge"><span class="p">{</span><span class="err">2</span><span class="p">}</span></code> to refer to the input. We also quoted the
command, which is optional and might make things clearer (if you want to use
pipes inside your command, it is required). Using multiple inputs is great for
doing grid searches of parameters. However, let’s say we don’t want to do all
combinations of the product and instead want to specify each pair of input
manually. This behavior is easily achieved using the <code class="highlighter-rouge">--xapply</code> option:</p>
<div class="language-bash highlighter-rouge"><pre class="highlight"><code>prallel --xapply <span class="s2">"echo {1}: {2}"</span> ::: A B C D ::: <span class="o">{</span>1..8<span class="o">}</span>
</code></pre>
</div>
<p>Note how the letters will wrap around.</p>
<p>In some settings, you might find it easier to create a file, <code class="highlighter-rouge">commands.sh</code>,
with all the commands written out:</p>
<div class="language-sh highlighter-rouge"><pre class="highlight"><code>./experiment 10.0 1.5 > exp1.txt
./experiment 20.0 1.5 --extra-param 3.0 > exp2.txt
</code></pre>
</div>
<p>Now run them in parallel by:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>parallel < commands.sh # OR
parallel :::: commands.sh
</code></pre>
</div>
<p>The latter is a newer syntax (note that it has <em>four</em> colons), which again I
prefer since it can be stringed together multiple times and you can freely mix
<code class="highlighter-rouge">:::</code> and <code class="highlighter-rouge">::::</code>.</p>
<h2 id="multiple-computers-using-ssh">Multiple computers using SSH</h2>
<p>Parallel can also be used to parallelize between multiple computers. Let’s say
you have SSH access to the hostnames or SSH aliases <code class="highlighter-rouge">node1</code> and <code class="highlighter-rouge">node2</code> without
prompting for password. Now you can tell Parallel to distribute the job across
both nodes using the <code class="highlighter-rouge">-S</code> option:</p>
<div class="language-sh highlighter-rouge"><pre class="highlight"><code>parallel -S node1,node2 -j8 convert -resize 256x256<span class="se">\!</span> <span class="o">{}</span> <span class="o">{}</span> ::: <span class="k">*</span>.jpeg
</code></pre>
</div>
<p>You can refer to the local computer as <code class="highlighter-rouge">:</code> (e.g. do <code class="highlighter-rouge">-S :,node1,node2</code> to
include the current computer). I also added <code class="highlighter-rouge">-j8</code> to specify that I want each
node to run 8 jobs concurrently. You can try leaving this out, but Parallel
could have a hard time automatically determining how many jobs to use for each
node.</p>
<p>We assumed in this example that the files existed on the other nodes (for
instance through NSF). However, Parallel can also transfer the files to the
worker nodes and transfer the results back by adding <code class="highlighter-rouge">--trc {}</code>.</p>
<h2 id="more-information">More information</h2>
<p>For more information I recommend:</p>
<ul>
<li><a href="http://www.gnu.org/software/parallel/parallel_tutorial.html">GNU Parallel Tutorial</a> - Very readable with lots of information</li>
<li><a href="https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1">GNU Parallel Videos</a> - Screencasts by the author of Parallel</li>
<li><a href="http://docs.rcc.uchicago.edu/software/scheduler/parallel/README.html">Parallel Batch Job Submission</a> - How to use Parallel on a SLURM cluster</li>
</ul>
<p><em>Thanks to Ole Tange, the author of GNU Parallel, for pointing out errors in this post.</em></p>
<!---
## Multiple computers using a cluster
This will depenend a bit on your cluster and its scheduler. However, as an
example, we run things on University of Chicago's [RCC
cluster](http://rcc.uchicago.edu/) which uses the scheduler SLURM. Batch jobs
are submitted using `sbatch`, but sub-jobs can be submitted inside your batch
job using `srun`. So, in order to use Parallel across the cluster, we can submit
a batch job that looks like this:
```sh
#SBATCH --ntasks 64
#SBATCH --exclusive
parallel -n500 --delay 0.2 -j4 "srun --exclusive -N1 ./batch-resize.sh" ::: *.jpeg
```
To avoid sending too many `srun`, we have added a small delay and split it up
into batches of 500. The `-n500` means it will send 500 filenames to each
command, so one call to `srun` will process 500 images. For `srun`, we specify
`-N1` in order to send it to one node only. The `batch-resize.sh` takes any
number of parameters and performs a resize on all of them, so it might look like:
```sh
parallel convert -resize 256x256\! {} {} ::: $@
```
This may not be relevant to your cluster, but the idea could be similar. You
might also be able to do the SSH solution if you know the hostnames of your
worker nodes. Check with your cluster's staff since chances are they know about
GNU Parallel and how to deploy it onto the cluster.
-->
Deep PCA Nets2014-08-31T00:00:00+00:00http://deepdish.io//2014/08/31/deep-pca-nets<p>Tsung-Han Chan and colleagues recently uploaded to <a href="http://arxiv.org">ArXiv</a> an <a href="http://arxiv.org/abs/1404.3606">interesting paper</a> proposing a simple but effective baseline for deep learning. They propose a novel two-layer architecture where
each layer convolves the image with a filterbank, followed by binary hasing, and finally block histogramming for indexing and pooling. The filters in the filterbank are learned using simple algorithms such as random projections (RandNet),
<a href="http://en.wikipedia.org/wiki/Principal_component_analysis">principal component analysis</a> (PCANet), and linear discriminant analysis (LDANet). They report results competitive with those obtained
by other deep learning methods
and <a href="http://www.di.ens.fr/data/scattering">scattering networks</a> (introduced by Stéphane Mallat) on a variety of task: face recognition, face verification, hand-written digits recognition, texture discrimination, and object recognition:</p>
<table>
<thead>
<tr>
<th>Dataset</th>
<th>Task</th>
<th>Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html">Extended Yale B</a></td>
<td>Face Recognition</td>
<td>99.58%</td>
</tr>
<tr>
<td><a href="http://www2.ece.ohio-state.edu/~aleix/ARdatabase.html">AR</a></td>
<td>Face Recognition</td>
<td>95.00%</td>
</tr>
<tr>
<td><a href="http://www.itl.nist.gov/iad/humanid/feret/feret_master.html">FERET</a> (average)</td>
<td>Face Recognition</td>
<td>97.25%</td>
</tr>
<tr>
<td><a href="http://yann.lecun.com/exdb/mnist/">MNIST</a></td>
<td>Digit Recognition</td>
<td>99.38%</td>
</tr>
<tr>
<td><a href="http://www1.cs.columbia.edu/CAVE//exclude/curet/.index.html">CUReT</a></td>
<td>Texture Recognition</td>
<td>99.61%</td>
</tr>
<tr>
<td><a href="http://www.cs.toronto.edu/~kriz/cifar.html">CIFAR10</a></td>
<td>Object Recognition</td>
<td>78.67%</td>
</tr>
</tbody>
</table>
<p>The authors achieve state-of-the-art results on several of the <a href="http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations">MNIST Variations</a> tasks. The method compares favorably to hand-designed features, wavelet-derived featues, and deep-network learned features.</p>
<h2 id="pca-net-algorithm">PCA-Net Algorithm</h2>
<p>The main algorithm is cascades two filterbank convolutions
with an intermediate mean normalization step,
followed by
a binary hashing step and a final histogramming step. Training
involves estimating the filterbanks used for the convolutions,
and estimating the classifier to be used on top of the ultimate histogram-derived features.</p>
<h3 id="filterbank-convolutions">Filterbank Convolutions</h3>
<p>The filterbanks are estimated by performing principal components
analysis (PCA) over patches. We extract all of the \( 7\times 7 \)
patches from all of the images and vectorize them so that each patch
is a flat 49-entry vector: <script type="math/tex">\mathbf{v}\in\mathbb{R}^{7\times 7} \to \operatorname{vec}\mathbf{v}\in\mathbb{R}^{49}</script>
where \( \mathbf{v} \) is an image patch in the picture, e.g.:</p>
<p><img src="/public/images/PCANet_mnist5_patch.png" alt="Image Patch Picture" /></p>
<p>For each patch vector we take the mean
of the entries (the DC-component) and then subtract that mean
from each entry of the vector so that all of our patches
are now zero mean. We perform PCA over these zero-mean
patch vectors and retain
the top eight components \( W\in\mathbb{R}^{49\times 8} \). Each
principle component (a column of \( W \)) is a filter and may be
converted into a \( 7\times 7 \) kernel which is convolved with
the input images. The input images are zero-padded for the
convolution so that the output has the same dimension as the
image itself. So, using the eight columns of \( W \)
we take each input image \( \mathcal{I}\) and convert it
into eight output images \( \mathcal{I} _ l \) where \( 1\leq l\leq 8 \).</p>
<h4 id="second-layer">Second Layer</h4>
<p>The second layer is constructed by iterating the algorithm from
the first layer over each of the eight output images. For each
output image \( \mathcal{I} _ l \) we take the dense set
of flattened patch vectors, remove the DC-component. The patches produced by
the different filters are then concatenated together and
we estimate another PCA filterbank (again with eight filters). Each filter
\( w _ {2,k} \) from the layer-2 filterbank is convolved with
\( \mathcal{I} _ l \) to produce a new image \( \mathcal{I} _ {l,k} \). Repeating
this process for each filter in the filterbanks produces \( 64=8\times 8 \)
images.</p>
<h3 id="hashing-and-histogramming">Hashing and Histogramming</h3>
<p>The 64 images have the same size as the original image thus we
may view the filter outputs as producing a three-dimensional
array \( \mathcal{J}\in\mathbb{R}^{H\times W\times 64} \)
where \( H\times W \) are the dimensions of the input image. Each
of the 64 images is produced from a layer one filter \( l _ 1 \)
and a layer two filter \( l _ 2 \) so we denote the associated
image as \( \mathcal{J} _ {l _ 1,l _ 2} \). Each
pixel \( (x,y) \) from the image has an associated
8-dimensional feature vector \( \mathcal{J}(x,y)\in\mathbb{R}^{64} \). These feature vectors are converted into integers by using a
<a href="http://en.wikipedia.org/wiki/Heaviside_step_function">Heaviside step function</a> \( H \) sum:
<script type="math/tex">\mathcal{K} _ {l _ 1}(x,y)=\sum _ {z=1}^{8} 2^{z-1}\cdot H(\mathcal{J} _ {l _ 1,z}(x,y,z)).</script></p>
<p>We note that we produce a hashed image such as \( \mathcal{K} _ l \)
for each filter \( l \) in the layer one filterbank so this means
that we have eight images after the hashing operation and the images
are all integers.</p>
<h4 id="histogramming">Histogramming</h4>
<p>We then take \( 7\times 7 \) blocks of the hashed images
\( \mathcal{K} \) and compute a histogram with \(2^{64} \)
bins over the values observed. These blocks can be disjoint
(used for face recognition) or they can be overlapping (useful
for digit recognition). The histograms formed from these blocks
and from the several images are all concatenated into a feature
vector. Classification is then performed using this feature
vector.</p>
<h3 id="classification">Classification</h3>
<p>The authors estimate a multiclass linear SVM to operate
on the estimated feature vector for each image. The same
setup was used for all input data. The particular SVM implementation
was <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/">Liblinear</a>.
The specific algorithm used was \( l _ 2 \)-regularized
\( l _ 2 \)-loss support vector one-against-rest support vector
classification and a cost ( the <code class="highlighter-rouge">C</code> parameter) of <code class="highlighter-rouge">1</code>. The
call to liblinear may be written</p>
<p><code class="highlighter-rouge">liblinear -s 1 -c 1.0</code></p>
<h2 id="authors-implementation">Author’s Implementation</h2>
<p>Code for the paper is <a href="http://mx.nthu.edu.tw/~tsunghan/download/PCANet_demo.zip">here</a>
and it has implementations for cifar10 and MNIST basic (a subset of MNIST). With a little extra
work one can also make it suitable for testing on the whole MNIST data set.</p>
<p>I tested this implementation on the MNIST basic dataset distributed with their implementation
code and obtained a \( 1.31\% \)
error rate using \( 12,000 \) training examples
and requiring
\( 700 \) seconds of training time. This is a somewhat higher error-rate than the \( 1.02\% \)
reported in the author’s paper. It is possible that the author ran a more optimized SVM training routine
that was not indicated in the posted codes.</p>
<p>The filters learned in the first layer were:
<img src="/public/images/PCANet_V1.png" alt="first layer PCA filters" title="Layer 1 PCA filters" /></p>
<p>The filters learned in the second layer were:
<img src="/public/images/PCANet_V2.png" alt="second layer PCA filters" title="Layer 2 PCA filters" /></p>
<p>We can see that the different image filters are somewhat similar to edge filters and that the seventh
and eighth filters (in the lower-right hand corner) have less clear structure than the others. Often,
when one uses PCA the first few components have a somewhat clear meaning and the rest of the components
look like random noise–this is consistent with a model where the latent dimensionality of the patches is less than
eight.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I was intrigued by this paper because of the simplicity of the network and the strong
reported results. When I ran my simple experiment I was not able to reach the results as reported
in the paper using the codes provided.</p>
<p>In the future I will try further experiments using PCA to ininitialize filters for the deep network.
Autoencoders are often used for initializing deep-network filters and PCA is a sort of poor-man’s autoencoder.
Mean-normalizing the output layer before moving to the next layer is a simple way to organize multi-layer networks
and I think that has promise as a baseline. I am less enthusiastic about the histogramming and hashing steps.
The authors mention that the histogramming and hashing produce translation invariance, and I wonder whether
translation invariance could be achieved more simply by using max-pooling.</p>
<p>Overall the paper gave me some interesting questions to think about but I think it could serve as an excellent
baseline for other deep network systems when the publicly available codes are more mature.</p>