Serving Up Chicago-Style Deep Learning.

Latest post:

The building blocks of Deep Learning 21 Nov 2015

© 2015. All rights reserved.

The building blocks of Deep Learning

A feed-forward network is built up of nodes that make a directed ascyclic graph (DAG). This post will focus on how a single node works and what we need to implement if we want to define one. It is aimed at people who generally know how deep networks work, but can still be confused about exactly what gradients need to be computed for each node (i.e. myself, all the time).


In our network, we will have three different types of nodes (often called layers as well):

  • Static data (data and labels)
  • Dynamic data (parameters)
  • Functions

This is a bit different from the traditional take on nodes, since we are not allowing nodes to have any internal parameters. Instead, parameters will be fed into function nodes as dynamic data. A network with two fully connected layers may look like this:


The static data nodes are light blue and the dynamic data nodes (parameter nodes) are orange.

To train this network, all we need is the derivative of the loss, \( L \), with respect to each of the parameter nodes. For this, we need to consider the canonical building block, the function node:


It takes any number of inputs and produces an output (that eventually leads to the loss). In the eyes of the node, it makes no distinction between static and dynamic data, which makes things both simpler and more flexible. What we need from this building block is a way to compute \(\mathbf{z}\) and the derivative of \( L \) with respect to each of the inputs. First of all, we need a function that computes

\begin{equation} \mathbf{z} = \mathrm{forward}((\mathbf{x}^1, \dots, \mathbf{x}^n)). \end{equation}

This is the simple part and should be trivial once you have decided what you want the node to do.

Next, computing the derivative of a single element of one of the inputs may look like (superscript omitted):

\begin{equation} \frac{ \partial L }{ \partial x _ i } = \sum _ j \frac{ \partial L }{ \partial z _ j } \frac{ \partial z _ j }{\partial x _ i } \end{equation}

We broke the derivative up using the multivariable chain rule (also known as the total derivative). It can also be written as

\begin{equation} \frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad \left[ \mathbb{R}^ { A \times 1} = \mathbb{R} ^ {A \times B} \mathbb{R} ^ {B \times 1} \right] \end{equation}

This assumes that the input size is \( A \) and the output size is \( B \). The derivative \( \frac{ \partial L }{ \partial \mathbf{z} } \in \mathbb{R} ^ {B} \) is something that needs to be given to the building block from the outside (this is the gradient being back-propagated). The Jacobian \( \frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} } \in \mathbb{R} ^ {B \times A} \) on the other hand needs to be defined by the node. However, we do not necessarily need to explicitly compute it or store it. All we need is to define the function

\begin{equation} \frac{ \partial L }{ \partial \mathbf{x} } = \mathrm{backward}\left(\mathbf{x}, \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right) \end{equation}

This would need to be done for each input separately. Since they sometimes share computations, frameworks like Caffe use a single function for the entire node's backward computation. In our code examples, we will adopt this as well, meaning we will be defining:

\begin{equation} \left(\frac{ \partial L }{ \partial \mathbf{x}^1 }, \dots, \frac{ \partial L }{ \partial \mathbf{x}^n }\right) = \mathrm{backward}\left((\mathbf{x}^1, \dots, \mathbf{x}^n), \mathbf{z}, \frac{ \partial L }{ \partial \mathbf{z} }\right) \end{equation}

It is also common to support multiple outputs, however for simplicity (and without loss of generality) we will assume there is only one.


So, the functions that we need to define for a single node is first the forward pass:


The input data refers to all the inputs, so it will for instance be a list of arrays.

Next, the backward pass:


It takes three inputs as described above and returns the gradient of the loss with respect to the input. It does not need to take the output data, since it can be computed from the input data. However, if it is needed, we might as well pass it in since we will have computed it already.

Forward / Backward pass

A DAG describes a partial ordering. First, we need to sort our nodes so that they do not violate the partial order. There will probably be several solutions to this, but we can pick one arbitrarily.

Once we have this ordering, we call forward on the list from the first node to the last. The order will guarantee that the dependencies of a node have been computed when we get to it. The Loss should be the last node. This is called a forward pass.

Then, we call backward on this list in reverse. This means that we start with the Loss node. Since we do not have any output diff at this point, we simply set it to an array of all ones. We proceed until we are done with the first in the list. This is called a backward pass.

Once the forward and the backward pass have been performed, we take the gradients that have arrived at each parameter node and perform a gradient descent update in the opposite direction.

Weight sharing

By externalizing the parameters, it makes parameter sharing conceptually easy to deal with. For instance, if we wanted to share weights (but not biases), we could do:


In this case, \( \mathbf{W} \) would receive two gradient arrays, in which case the sum is taken before performing the update step.


As an example, the ReLU has a single input of the same size as the output, so \( A = B \). The output is computed elementwise as

\begin{equation} z _ i = \max(0, x _ i) \end{equation}

which could translate to something like this in Python (who uses pseudo-code anymore?):

def forward(inputs):
    return np.maximum(inputs[0], 0)

For the backward pass, the Jacobian will be a diagonal matrix, with entries

\begin{equation} \frac{\partial z _ i}{\partial x _ i} = 1 _ {\{ x _ i > 0 \}}, \end{equation}

where \( 1 _ {\{P\}} \) is 1 if the predicate \( P\) is true, and zero otherwise (see Iverson bracket). We can now write the gradient of the loss as

\begin{equation} \frac{ \partial L }{ \partial \mathbf{x} } = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{1} _ {\{ \mathbf{x} > \mathbf{0} \} } \odot \frac{ \partial L }{ \partial \mathbf{z} }, \end{equation}

where \( \odot \) denotes an elementwise product.

def backward(inputs, output, output_diff):
    return [(inputs[0] > 0) * output_diff]

Note that we have to return a list, since we could have multiple inputs.


Moving on to the dense (fully connected) layer where

\begin{equation} \mathbf{z} = \mathbf{W} ^ \intercal \mathbf{x} + \mathbf{b} \quad\quad (\mathbb{R}^{B \times 1} = \mathbb{R}^{B \times A} \mathbb{R}^{A \times 1} + \mathbb{R}^{B \times 1}) \end{equation}

However, remember that we make no distinction between static and dynamic input, and from the point of view of our Dense node it simply looks like:

\begin{equation} \mathbf{z} = \mathbf{x} ^ 2 \mathbf{x} ^ 1 + \mathbf{x} ^ 3 \end{equation}

Which might translate to:

def forward(inputs):
    x, W, b = inputs
    return W.T @ x + b

For the backward pass, we need to compute all three Jacobians and multiply them by the gradient coming in from above. Let's start with \( \mathbf{x} \):

\begin{equation} \frac{\mathrm{d} \mathbf{z} }{\mathrm{d} \mathbf{x}} = \mathbf{W} ^ \intercal \in \mathbb{R} ^ {B \times A} \end{equation}

which gives us

\begin{equation} \text{Gradient #1 }\rightarrow \quad\quad\ \frac{\partial L}{\partial \mathbf{x}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{x} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \mathbf{W} \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad \leftarrow\text{ Gradient #1} \end{equation}

Moving on. Since \( \mathbf{W} \in \mathbb{R} ^ {A \times B} \), it means its Jacobian should have the dimensions \( B \times (A \times B) \). We know the bias will drop off, so we can write the output that we will be taking the Jacobian of as:

\begin{equation} \mathbf{z}' = \left( \sum _ {j = 1} ^ A W _ {j, 1} x _ j, \dots, \sum _ {j = 1} ^ A W _ {j, B} x _ j \right) \end{equation}

Now, let's compute the derivative of \( z' _ i \) (and thus \(z _ i \)) with respect to \( W _ {j, k} \):

\begin{equation} \frac{\partial z _ i}{\partial W _ {j, k}} = \left\{ \begin{array}{ll} x _ j & \mbox{if } i = k \\ 0 & \mbox{otherwise} \end{array} \right. \end{equation}

With a bit of collapsing things together (Einstein notation is great for this, but the steps are omitted here), we get an outer product of two vectors \begin{equation} \text{Gradient #2 }\rightarrow \quad\quad\ \frac{\partial L}{\partial \mathbf{W}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{W} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} } = \mathbf{x} \left( \frac{ \partial L }{ \partial \mathbf{z} } \right)^\intercal \quad\quad \leftarrow\text{ Gradient #2} \end{equation}

The final Jacobian is simply an identity matrix

\begin{equation} \frac{\mathrm{d} \mathbf{z} }{\mathrm{d} \mathbf{b}} = I \in \mathbb{R} ^ {B \times B} \end{equation}

so the Loss derivative with respect to the bias is just the gradients coming in from above unchanged

\begin{equation} \text{Gradient #3}\rightarrow \quad\quad\ \frac{\partial L}{\partial \mathbf{b}} = \left(\frac{ \mathrm{d} \mathbf{z} }{ \mathrm{d} \mathbf{b} }\right) ^ \intercal \frac{ \partial L }{ \partial \mathbf{z} }= \frac{ \partial L }{ \partial \mathbf{z} } \quad\quad \leftarrow\text{ Gradient #3} \end{equation}

We thus have all three gradients (with no regard as to which ones are parameters). This might translate in code to:

def backward(inputs, output, output_diff):
    x, W, b = inputs
    return [
        W.T @ output_diff,
        np.outer(x, output_diff),

Now, the frameworks that I know do not externalize the parameters, so instead of returning the two last gradients, they would be applied to the interal parameters through some other means. However, the main ideas and certainly the math will be exactly the same.


You should get the idea by now. The final note is that when we do this for the Loss layer, we still need to pretend the node has been placed in the middle of a network with an actual loss at the end of it. The Loss node should not be different in any way, except that its output size is scalar. However, a good loss node should theoretically be able to be used in the middle of a network, so it should still query output_diff and use it correctly (even though it will be all ones when used in the final position).


In summary, the usual steps when constructing a new node/layer is:

  • Compute the forward pass
  • Calculate the Jacobian for all your inputs (static and dynamic alike)
  • Multiply them with the gradient coming in from above. At this point, we will often realize that we do not have to ever store the entire Jacobian.

Creating an LMDB database in Python

LMDB is the database of choice when using Caffe with large datasets. This is a tutorial of how to create an LMDB database from Python. First, let's look at the pros and cons of using LMDB over HDF5.

Reasons to use HDF5:

  • Simple format to read/write.

Reasons to use LMDB:

  • LMDB uses memory-mapped files, giving much better I/O performance.
  • Works well with really large datasets. The HDF5 files are always read entirely into memory, so you can't have any HDF5 file exceed your memory capacity. You can easily split your data into several HDF5 files though (just put several paths to h5 files in your text file). Then again, compared to LMDB's page caching the I/O performance won't be nearly as good.

LMDB from Python

You will need the Python package lmdb as well as Caffe's python package (make pycaffe in Caffe). LMDB provides key-value storage, where each <key, value> pair will be a sample in our dataset. The key will simply be a string version of an ID value, and the value will be a serialized version of the Datum class in Caffe (which are built using protobuf).

import numpy as np
import lmdb
import caffe

N = 1000

# Let's pretend this is interesting data
X = np.zeros((N, 3, 32, 32), dtype=np.uint8)
y = np.zeros(N, dtype=np.int64)

# We need to prepare the database for the size. We'll set it 10 times
# greater than what we theoretically need. There is little drawback to
# setting this too big. If you still run into problem after raising
# this, you might want to try saving fewer entries in a single
# transaction.
map_size = X.nbytes * 10

env ='mylmdb', map_size=map_size)

with env.begin(write=True) as txn:
    # txn is a Transaction object
    for i in range(N):
        datum = caffe.proto.caffe_pb2.Datum()
        datum.channels = X.shape[1]
        datum.height = X.shape[2]
        datum.width = X.shape[3] = X[i].tobytes()  # or .tostring() if numpy < 1.9
        datum.label = int(y[i])
        str_id = '{:08}'.format(i)

        # The encode is only essential in Python 3
        txn.put(str_id.encode('ascii'), datum.SerializeToString())

You can also open up and inspect an existing LMDB database from Python:

import numpy as np
import lmdb
import caffe

env ='mylmdb', readonly=True)
with env.begin() as txn:
    raw_datum = txn.get(b'00000000')

datum = caffe.proto.caffe_pb2.Datum()

flat_x = np.fromstring(, dtype=np.uint8)
x = flat_x.reshape(datum.channels, datum.height, datum.width)
y = datum.label

Iterating <key, value> pairs is also easy:

with env.begin() as txn:
    cursor = txn.cursor()
    for key, value in cursor:
        print(key, value)

Initialization of deep networks

As we all know, the solution to a non-convex optimization algorithm (like stochastic gradient descent) depends on the initial values of the parameters. This post is about choosing initialization parameters for deep networks and how it affects the convergence. We will also discuss the related topic of vanishing gradients.

First, let's go back to the time of sigmoidal activation functions and initialization of parameters using IID Gaussian or uniform distributions with fairly arbitrarily set variances. Building deep networks was difficult because of exploding or vanishing activations and gradients. Let's take activations first: If all your parameters are too small, the variance of your activations will drop in each layer. This is a problem if your activation function is sigmoidal, since it is approximately linear close to 0. That is, you gradually lose your non-linearity, which means there is no benefit to having multiple layers. If, on the other hand, your activations become larger and larger, then your activations will saturate and become meaningless, with gradients approaching 0.

Activation functions

Let us consider one layer and forget about the bias. Note that the following analysis and conclusion is taken from Glorot and Bengio[1]. Consider a weight matrix \( W \in \mathbf{R}^{m \times n} \), where each element was drawn from an IID Guassian with variance \( \mathrm{Var}(W) \). Note that we are a bit abusive with notation letting \( W \) denote both a matrix and a univariate random variable. We also assume there is no correlation between our input and our weights and both are zero-mean. If we consider one filter (row) in \( W \), say \( \mathbf{w} \) (a random vector), then the variance of the output signal over the input signal is:

\frac{ \mathrm{Var}(\mathbf{w}^T \mathbf{x}) }{ \mathrm{Var}(X) } = \frac{\sum _ n^N \mathrm{Var}(w _ n x _ n)}{\mathrm{Var}(X)} = \frac{n \mathrm{Var}(W) \mathrm{Var}(X)}{\mathrm{Var}(X)}= n\mathrm{Var}(W) $$

As we build a deep network, we want the variance of the signal going forward in the network to remain the same, thus it would be advantageous if \( n \mathrm{Var}(W) = 1. \) The same argument can be made for the gradients, the signal going backward in the network, and the conclusion is that we would also like \( m \mathrm{Var}(W) = 1. \) Unless \( n = m, \) it is impossible to sastify both of these conditions. In practice, it works well if both are approximately satisfied. One thing that has never been clear to me is why it is only necessary to satisfy these conditions when picking the initialization values of \( W. \) It would seem that we have no guarantee that the conditions will remain true as the network is trained.

Nevertheless, this Xavier initialization (after Glorot's first name) is a neat trick that works well in practice. However, along came rectified linear units (ReLU), a non-linearity that is scale-invariant around 0 and does not saturate at large input values. This seemingly solved both of the problems the sigmoid function had; or were they just alleviated? I am unsure of how widely used Xavier initialization is, but if it is not, perhaps it is because ReLU seemingly eliminated this problem.

However, take the most competative network as of recently, VGG[2]. They do not use this kind of initialization, although they report that it was tricky to get their networks to converge. They say that they first trained their most shallow architecture and then used that to help initialize the second one, and so forth. They presented 6 networks, so it seems like an awfully complicated training process to get to the deepest one.

A recent paper by He et al.[3] presents a pretty straightforward generalization of ReLU and Leaky ReLU. What is more interesting is their emphasis on the benefits of Xavier initialization even for ReLU. They re-did the derivations for ReLUs and discovered that the conditions were the same up to a factor 2. The difficulty Simonyan and Zisserman had training VGG is apparently avoidable, simply by using Xavier intialization (or better yet the ReLU adjusted version). Using this technique, He et al. reportedly trained a whopping 30-layer deep network to convergence in one go.

Another recent paper tackling the signal scaling problem is by Ioffe and Szegedy[4]. They call the change in scale internal covariate shift and claim this forces learning rates to be unnecessarily small. They suggest that if all layers have the same scale and remain so throughout training, a much higher learning rate becomes practically viable. You cannot just standardize the signals, since you would lose expressive power (the bias disappears and in the case of sigmoids we would be constrained to the linear regime). They solve this by re-introducing two parameters per layer, scaling and bias, added again after standardization. The training reportedly becomes about 6 times faster and they present state-of-the-art results on ImageNet. However, I'm not certain this is the solution that will stick.

I reckon we will see a lot more work on this frontier in the next few years. Especially since it also relates to the -- right now wildly popular -- Recurrent Neural Network (RNN), which connects output signals back as inputs. The way you train such network is that you unroll the time axis, treating the result as an extremely deep feedforward network. This greatly exacerbates the vanishing gradient problem. A popular solution, called Long Short-Term Memory (LSTM), is to introduce memory cells, which are a type of teleport that allows a signal to jump ahead many time steps. This means that the gradient is retained for all those time steps and can be propagated back to a much earlier time without vanishing.

This area is far from solved, and until then I think I will be sticking to Xavier initialization. If you are using Caffe, the one take-away of this post is to use the following on all your layers:

weight_filler { 
    type: "xavier" 


  1. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256.

  2. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [pdf]

  3. K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” arXiv:1502.01852 [cs], Feb. 2015. [pdf]

  4. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Feb. 2015. [pdf]