Note to the reader: by “my way,” I don’t intend to purport that any of the thoughts in this post are original or unique. I did, however, write it off the top of my head so any similarity to other resources are incidental. Also, I’ve obviously read other books, papers, and articles on the subject of modern neural network architectures, so I don’t wish to lay claim to any of those ideas as well. I also believe that each individual should search for his or her own intuition of any concept, so what worked for me may not work for you (but it may aid you in that search).

Disclaimer #2: This post is currently a draft and I would like to potentially write more about the nitty gritty bits of implementing a neural network (proper usage of storage buffers, GPU-warp shared memory, compute dispatch patterns, etc). It’s sort of a loose jumble of thoughts intended to eventually be something organized, so reader beware!

Motivation

So why am I writing this? It’s because I think most of the explanations and derivations of the standard backpropagation algorithm aren’t very good. Now you are free to disagree, but at least for me, when I was teaching myself the algorithm for the purposes of actually finishing a toy GPU-accelerated implementation, I found myself getting lost in a sea of indices and notation. Also, when trying to wrap my head around how to actually implement something like a convolutional neural network, it felt like the explosion in pure bookkeeping put a lot of strain on my intuition. And so I found a new intuition and I thought I would jot it down here.

If you aren’t trying to build a neural network framework or implementation from scratch, you may not need to delve into this topic much (frameworks like TensorFlow, Keras, Theano, etc do a fair bit for you), but I think it’s still helpful to know, if only to get an intuition about algorithm performance characteristics (assuming you at least have a rough conceptual grasp of what’s happening).

My previous thought model

Most explanations you’ll find about neural networks and backpropagation are extremely index and vector heavy. This influenced my intuition pretty heavily as well. Obviously, this might seem like a natural approach but I think it is fraught with danger. The weakness is that the underlying concepts get lost (in other words, the act of bookkeeping and the algorithmic structure are conflated). There is also a huge burden in keeping track of specific functions (sigmoids, weighted sums, derivatives, etc). Generally, there is a very exact structure that backpropagation is taught in terms of (multi-layer fully connected networks with linear weighted combinations followed by some non-linear activation function). All of these things can be independently motivated easily enough (maybe I’ll write about this later), but the act of learning itself does not require any of this baggage. Furthermore, when an actual implementation is required, many networks don’t follow this canned network topology and add more “exotic” elements like different activation functions, pooling layers, normalization, and more. Without a good basis for understanding, it’s easy to get lost in the details when trying to implement it (which should be your goal after all).

Rethinking things… with recursion

As computer scientists (and mathematicians), we like simple base cases, and simple inductions. I think when I was starting to learn neural network architectures, it was tempting to try a “two-neuron network”, or “two-layer network” to simplify the understanding. However, even this is too complicated. When coming up with a recursive algorithm, it’s important to reframe things in terms of a base case, and an inductive step. For the inductive step, a single abstract layer is sufficient.

Conceptually, this simple function represents a layer with $n$ parameters and $m$ inputs (I am deliberately avoiding words like “weights” and “activations”). Let’s build our inductive step now. Remembering that the ultimate goal is to find $\mathbf{\nabla}C$ for some cost function $C$, suppose someone told this specific layer we’re looking at its differential contribution to that cost function. In other words, we are given

as an actual numerical quantity. This is pretty useful! We can pretty much directly compute $\frac{dC}{dp_i}$ for any of the $n$ parameters like so

It’s really important to remember looking at these equations that they are evaluated for a particular point in hyperspace (the coordinates of that point corresponding to the parameters of the model and input vector). Thus, $\frac{\partial f}{\partial p_i}$ should be a simple numerical computation and we end up with a number expressing the component of the gradient for $p_i$.

For example, if our layer function was something silly like

and an input of $x_0 = 4$ was given with $p_0$ initialized to $3$, and we were given $\frac{\partial C}{\partial f} = -2$, then from the equation above, we can compute

Conveniently, we can treat the input $x_0$ of this dumb multiplication layer as a parameter if we wanted (aka treat it like another variable).

Intuitively, the relative sizes of the partial derivatives indicates that $p_0$ needed to change more to correct the naughty behavior of $x_0$. Simultaneously, $x_0$ also should be corrected somehow, but not by as large an amount (since $p_0$ misbehaved less).

An important observation: it’s easy to think this result feels backwards. After all, the $x_0$ initialized to $4$ was more incorrect. Shouldn’t $x_0$ be the one that needs to change more? Remember though, that a partial derivative essentially means we are fixing other variables and computing a function which expresses a change with respect to a single particular variable. If $p_0$ were to remain fixed as opposed to $x_0$, we can see that $x_0$ can change less than if the situation were reversed (with $x_0$ fixed and $p_0$ variable).

Unlike the parameters though, we can’t directly use the partial derivatives with respect to layer inputs since they aren’t directly incorporated in the model (they are either intermediate values of the model, or input features).

Suppose however, that there was another function $g$ which was also parameterized and produced $x_0$ in this example. It should be clear now that $\frac{\partial C}{\partial x_0}$ which we just computed is analogous to the $\frac{\partial C}{\partial f}$ derived from this step, so we now have our inductive step.

In practice the choice of $f$ is a combination of weights or filters, functions, and all sorts of gizmos. Regardless of how complicated it gets though, you should be able to implement it successfully if you keep revisting the primary induction you are trying to achieve.

At this point, we have an embarrassingly general inductive step encapsulating the execution of an entire layer. However, this hypothetical layer only produced one output (although it successfully consumed multiple inputs). We can make it even more general by assuming (the inductive hypothesis) that we are given a set of partial derivatives for each of its outputs. If the outputs of the function are labelled $y_0, y_1, \ldots, y_s$, this might be written as the tuple

Now, the computation of a parameter like $\frac{\partial C}{\partial p_i}$ must be treated carefully since $p_i$ may have had contributions to many of the $y$ outputs. But wait, this should be where things start to get murky. After all, since we are given a vector quantity, we expect $\frac{\partial C}{\partial p_i}$ to also be a vector. If we rush into things we might be tempted to write some weird gradient-projection-vector-quantity-thing like

to be some sort of weird intermediate computation to try and recover $\frac{\partial C}{\partial p_0}$ as in our first example. How can we recover a single scalar from this? Should we average these partial derviatives? Take the square magnitude? Something else entirely? What do these quantities even mean?

The answer, of course, is to add them. Remember the meaning of a partial derivative, which means that we are measuring the rate of change of some output with everything fixed except for a single variable. The fact that a particular parameter is a dependency of some of these values just means that if we perturb the parameter, from a differential perspective, all of the $y$ quantities will be perturbed as well, and we can combine them. Thus, we can write

with a clear conscience. As another point, if changing $p_i$ perturbs $y_4$ more, say, this is already captured in the partial $\frac{\partial y_4}{\partial p_i}$. Anecdotally, I always find that the hardest thing to do when developing an inductive algorithm is to trust the induction hypothesis .

Quick Recap

Let’s quickly summarize what we have so far:

1. We have some gradient we are trying to compute
2. We have a function with inputs and parameters that we’re trying to optimize
3. The inputs might have been outputs from some other function/layer/whatever
4. We treat all inputs and parameters the same, the only difference being that partial derivatives with respect to variables internal to a particular layer do not participate in the recurrence relation
5. The recurrence relation is built on the chain rule
6. We can build all this intuition without indices except for the inputs, outputs, and parameters

From an implementation standpoint then, we can expect to do a bit of work changing the $\frac{\partial y_i}{\partial p_j}$ partials for various algorithms (activation functions, weighted sums, etc).

However, there are still a few holes in this understanding.

1. Why is there no “averaging” going on when “distributing” the error across the various parameters?
2. Why are the previous outputs of any given layer important when performing backprop?

The first question is based on the intuition that the chain rule acts sort of as an “assigner of responsibility” to each parameter of the model. This is a pretty good way of thinking about it, but note that there is no division by the sum of weights or anything in the above formulation. The reason is because as we are computing the gradient numerically, the direction of this gradient already encodes the “relative responsibility” of each individual parameter. That is to say that if we trust the calculus to do its thing, the relative sizes of the components will match the distribution, with or without some sort of normalization. Furthermore, when we apply an optimizer to the gradient, the magnitude of this gradient actually gets modulated anyways to control the learning rate (either a fixed multiple, annealing over time, momentum based, etc).

The second question leads to a continuation of this formulation for a more exact form of the dependencies of $y$ as a function of the layer parameters. Suppose we impose the following restriction on some given $y_i$ as follows:

where $f$ is any nonlinear function. Now, we can write

The weight $w_0$ is known for each particular step of the algorithm, but $\frac{\partial f}{\partial p_i}$ needs to be supplied. For most choices of the activation function in a neural network, this partial will depend on the activation value itself, which is why the results of the feed forward operation need to be saved.

Random Intuition

Here are a few more random things to consider:

1. When implementing a neural network from scratch, imagine that the activations from previous inputs are just weights and compute the partials with respect to those values like anything else. Of course, instead of modifying them in the update step, you’ll want to propagate them recursively.
2. With all these partials of the cost function, we could compute the Hessian matrix as well (second order derivatives). This would require a fair bit more memory, but considering how you would implement this is a great test of understanding.
3. You might ask, how we know we’re actually finding a minimum of the cost function and not a maximum? The answer is that cost functions are generally very well-behaved cost functions (like mean squared error or minimization of the KL divergence). The fact that the way we produce the output of the final layer is hyper-parameterized and a weird looking surface is irrelevant once you plug it into one of these “bowl-shaped” loss functions.
4. We are doing very crude numerical derivatives here using just the first moments. This seems to work well enough for people, but maybe you can do better!
5. Between the first and last layer, there’s a ton of floating point math, and when there are a ton of computations like this, you should be sensitive to floating point error accumulation. We get around this by regularization and keep weight sizes and activations all relatively small. Inputs that are unreasonably sized should probably be normalized though. Similarly, exploding error functions should be measured on a log scale.
6. For functions like $\textbf{ReLU}$ and max-pooling and stuff, these functions don’t have well defined continuous derivatives everywhere. Backpropagation still works because all these derivatives are being evaluated numerically so if you want, imagine that there’s some sort of hidden “smoothing” going on when certain partial derivates just go to zero because a unit wasn’t activated and whatnot. After all, we’re just computing the first moments anyways…

Application

Let’s “derive” backpropagation formulae now for a simple fully connected network.

First, the base case. Given $n$ outputs labeled $y_0, y_1, \ldots, y_n$, we can compute the partial derivatives with respect to some cost fuction $C$:

The specific value of these derivates will depend on your choice of cost function and how the layer outputs were fed into it. Assuming that each output $y_i$ is the affine linear combination $\mathbf{w} \cdot \mathbf{x} + \mathbf{b}$, the previous layer can now compute for any weight $w_i$, the component of the gradient $\mathbf{\nabla}C$ projected on $w_i$:

Here, $a_i$ is the input to the $i$th node (and we are assuming a linear activation function, aka a dumb one). This should match our intuition since the output of this particular $i$th node contributed to the error of each of the outputs (it being fully connected). So we should expect a sum over all output errors. The contribution is also proportional to $a_i$ (if the input to this node is greater, the node needs to adjust more to compensate for any error).

The equation for the gradient projected on the bias is similar but without the $a_i$ factor.

This formula is probably odd looking to beginners but don’t be put off by it. Imagine for example, the bias functioned just like a weight but that the input activation always happened to be $1$. If the weight’s activation was also $1$, the error contribution would be split 50-50 between the bias and the weight. If the weight activation was more or less, we’d get the correct response accordingly.

Note that these equations look different than other text’s like Nielsen’s! That’s because in this wonky formulation I developed (just to get my own intuition), I built my recurrence relation based on the partial derivatives of the cost with respect to the layer’s outputs, not the node’s internal value prior to applying the weights. They are completely equivalent however, and it should be easy to move from this notation to that one and vice versa. Personally, I found it easier to reason about this formulation (framing things in terms of layer outputs) because it adapts well to weird layers that have other exotic functions, weight sharing (as in the case of a convolutional neural network), or network recurrence (GRU, LSTM, standard RNN, etc). Skip connections also fit well with this framework as well.

The last step of course is to compute the partials with respect to the inputs $x_i$ to any given layer. Since these inputs get multiplied by the weights, the result is identical to the weight equation by symmetry (although they will evaluate differently). And that’s it! Everything necessary to recurse back to the starting layer is done.

As a side-note, remember that by definition, $\frac{\partial C}{\partial v} = 0$ (here $v$ is just one of the components of the input vector), since the actual training data should have contributed to the loss function in any way.

Conclusion

The primary benefit of intuition regarding neural networks and the math powering it is to enable you to experiment with wonky ideas and structures, without being tied down to a specific formulation. Using weights nonlinearly, adding weird connections, and more should all feel relatively straightforward to reason about. Furthermore, by decoupling your understanding from exact index-heavy notation, you can more adequately find a formulation that performs better for the architecture you are targeting. Finally, gaining a deeper understanding of a topic, even from a hand-wavey sense is liberating in the sense that the intuition is now your own, and you can do whatever you like with that!