Folding Autoencoders

It’s nice when different ways of seeing things come together. This often occurs when comparing models of computation in the brain and machine learning architectures.

Many of you will be familiar with the standard autoencoder architecture. This takes an input \mathbf{X}, which may be an image (in a 2D or flattened 1D form) or another input array. It applies one or more neural network layers that progressively reduce the dimensionality of the output. These one or more neural network layers are sometimes called an “encoder”. This forms a “bottleneck” to “compress” the input (in a lossy manner) to form a latent representation. This latent representation is sometimes referred to as a hidden layer. It may be seen as an encoded or compressed form of the input.

A standard autoencoder may have a 1D latent representation with a dimensionality that is much less than the input dimensionality. A variational autoencoder may seek to learn values for a set of parameters that represent a latent distribution, such as a probability distribution for the input.

The autoencoder also has another set of one or more neural network layers that receive the latent representation as an input. These one or more neural network layers are sometimes called a “decoder”. The layers generate an output \mathbf{X'}, which is a reconstruction of the input \mathbf{X}.

The whole shebang is illustrated in the figure below.

Standard Autoencoder
Standard Autoencoder

The parameters of the neural network layers, the “encoder” and the “decoder”, are learnt during training. For a set of inputs, the output of the autoencoder \mathbf{X'} is compared with the input \mathbf{X}, and an error is back-propagated through the neural network layers. Using gradient descent to minimise the errors, an optimal set of parameters may be learnt.

As autoencoders do not require a labelled “ground-truth” output to compare with the autoencoder output during training, they provide a form of unsupervised learning. Most are static, i.e. they operate on a fixed instance of the input to generate a fixed instance of the output, and there are no temporal dynamics.

Now, I’m quite partial to unsupervised learning methods. They are hard, but they also are much better at reflecting “natural” intelligence; most animals (outside of school) do not learn based on pairs of inputs and desired outputs. Models of the visual processing pathway in primates, which have been developed over the last 50 years or so, all indicate that some form of unsupervised learning is used.

In the brain, there are several pathways through the cortex that appear similar to the “encoder” neural network layers. With a liberal interpretation, we can see the latent representation of the autoencoder as an association representation formed in the “higher” levels of the cortex. (In reality, latent representations in the brain are not found at a “single” flat level but are distributed over different levels of abstraction.)

If the brain implements some form of unsupervised learning, and seems to “encode” incoming sensory information, this leads to the question: where is the “decoder”?

This is where predictive coding models can help. In predictive coding models, predictions are fed back through the cortex to attempt to reconstruct input sensory information. This seems similar to the “decoder” layers of the autoencoder. However, in this case, the “encoder” and the “decoder” appear related. In fact, one way we can model this is to see the “encoder” and the “decoder” as parts of a common reversible architecture. This looks at things in a similar way to recent work on reversible neural networks, such as the “Glow” model proposed by OpenAI. The “encoder” represents a forward pass through a set of neural network layers, while the “decoder” represents a backward pass through the same set of layers. The “decoder” function thus represents the inverse or “reverse” of the “encoder” function.

This can be illustrated as follows:

Folding the Autoencoder
Folding the Autoencoder

In this variation of the autoencoder, we effectively fold the model in half, and stick together the “encoder” and “decoder” neural network layers to form a single “reversible neural network”.

In fact, the brain is even cooler than this. If we extend across modalities to consider sensory input and motor output, the brain appears to replicate the folded autoencoder shown above, resulting in something that resembles again our original standard autoencoder:

The brain as a form of autoencoder.
The brain as a form of autoencoder.

Here, we have a lower input “encoder” that represents the encoding of sensory input \mathbf{X} into a latent “associative” representation. A forward pass provides the encoding, while a backward pass seeks to predict the sensory input \mathbf{X} by generating a reconstruction \mathbf{X'}. An error between the sensory input \mathbf{X} and the reconstruction \mathbf{X'} is used to “train” the “encoder”.

We also have an upper output “decoder” that begins with the latent “associative” representation and generates a motor output \mathbf{Y}. A forward pass decodes the latent “associative” representation to generate muscle commands.

The “backward” pass of the upper layer is more uncertain. I need to research the muscle movement side – there is quite a lot based on the pathology of Parkinson’s disease and the action of the basal ganglia. (See the link for some great video lectures from Khan Academy.)

In one model, somasensory input may be received as input \mathbf{Y'}, which represents the muscle movements as actuated based on the muscle commands. The backward pass thus seeks to reconstruct the latent “associative” representation from the input \mathbf{Y'}. The “error” may be computed at one or more of the input level (e.g. looking at the difference between \mathbf{Y} and \mathbf{Y'}) and the latent “associative” representation level (e.g. between the “sensory” input encoding and the “motor” input encoding). In any case there seem to be enough error signals to drive optimisation and learning.

Of course, this model leaves out one vital component: time. Our sensory input and muscle output is constantly changing. This means that the model should be indexing all the variables by time (e.g. \mathbf{X(t)}, \mathbf{X'(t)}, \mathbf{Y(t)}, \mathbf{Y'(t)}, and the latent “associative” representations). Also the forward and backward passes will occur at different times (the forward pass being available earlier than the backward pass). This means that our errors are also errors in time, e.g. between \mathbf{X(t=1)} and \mathbf{X'(t=2)} for a common discrete time basis.

Many of the machine learning papers I read nowadays feature a bewildering array of architectures and names (ResNet, UNet, Glow, BERT, Transformer XL etc. etc.). However, most appear to be using similar underlying principles of architecture design. The way I deal with the confusion and noise is to try to look for these underlying principles and to link these to neurological and cognitive principles (if only to reduce the number of things I need to remember and make things simple enough for my addled brain!). Autoencoder origami is one way of making sense of things.


An Introduction to the Predictive Brain

Through a variety of sources, including Sam Harris’ discussion with Anil Seth and Lisa Feldman Barrett’s How Emotions Are Made, I’ve been hearing a lot recently about the “Predictive Brain”. This is a theory of cognition that has rapidly gained ground over the last couple of decades.

Talk of a “predictive brain”, in my reading, can be broken down into theories in several key areas:

  • the “Bayesian” brain, or the application of work on Bayesian probabilities to cognition;
  • predictive coding, a specific framework for modelling information flow between cortical areas, e.g. developed from work on the visual system; and
  • feedback circuitry within the brain, or the ongoing discovery of general patterns of feedback within the cortex and mid-brain structures.

The Bayesian Brain

Let’s start with the Bayesian brain.

Many of us will know Bayes Theorem:


The Bayesian brain hypothesis is that the brain is performing some form of computation that may be modelled using Bayesian probability frameworks.

Within the context of the brain, we can treat ‘X’ as our sensory input (e.g. signals from the retina or cochlea). This is typically in the form of the firing output of groups of neurons.

The ‘Y’ varies depending on how Bayes Theorem is being applied. In many cases, it appears to be applied in a relatively general manner. For example, in this Tutorial Introduction to Bayesian Models of Cognitive Development by Amy Perfors et al, ‘Y’ is taken to refer to a “hypothesis” (h_i). Bayes Theorem thus provides a way to compare the probabilities of different hypotheses given (the ‘|’ symbol) our sensory input (‘X’). If we calculate P(h_1|X), P(h_2|X), \dots P(h_n|X), we can choose the hypothesis with the highest probability. This then becomes the “explanation” for our data. The idea is that the brain is (somehow) performing an equivalent comparison.

As you can imagine, this first approach is fairly high level. It considers a “hypotheses” as synonymous with human “reasons” for the data. In reality, the brain may be performing hundreds of thousands of low-level inferences that are difficult to put into words. In these cases, our “hypotheses” may relate to feature components such as possible orientations of observed lines or a pronounced phoneme.

However, I have also seen Bayes Theorem used to model activity in lower level neural circuits (normally in the cortex and in the visual areas). In these cases:

  • P(Y|X) can be seen as the probability of some form of neural process, i.e. the output of a neural circuit, given a particular sensory input, e.g. a context at a particular time. This is our “prediction” in a model of the “predictive brain”. In probability terms, it is known as the “posterior”.
  • P(X) is the probability of our sensory input per se. This can be thought as a measure of how likely the sensory input is, outside of any particular context, e.g. how often has the neural circuit experienced this particular pattern of input firing. In probability terms, it is known as the “evidence” or “marginal likelihood”. It acts as a normalising factor in Bayes Theorem, i.e. acts so that P(Y|X) is a true probability with a value of between 0 and 1.
  • P(Y) is the probability of the output of the neural circuit. Many neural circuits implement some form of function on their inputs, such as acting as integrators. ‘Y’ may be considered a pattern of firing that arises from the neural circuit, and so P(Y) indicates a measure of how often this output pattern of firing is experienced. In probability terms, this is know as the “prior”. It encapsulates what is known about the output before experiencing the sensory input at a particular time.
  • P(X|Y) is part of the “magic sauce” of Bayes Theorem. It is a probability of the sensory input given or assuming a particular neural circuit output. For example, for all the instances where you see a given pattern of output firing for a particular neural circuit, how common is the sensory input that is experienced? It is known as the “likelihood”.

Some neuroscientists are thus looking at ways Bayesian models of probabilistic computation are implemented by neural circuits. Questions arise such as:

  • How do the terms of the Bayesian model relate to structures in the brain, such as cortical columns, neurons, cortical layers and mid-brain structures?
  • How is the Bayesian model applied by the brain? The evidence appears to be steering towards the presence of a hierarchical model of inference, e.g. there are large numbers of neural circuits performing computations in parallel that may each be approximated using Bayesian models.
  • How do we relate the way data is encoded and communicated in the brain to numeric values? Neurons have axons, dendrites and synapses and come in a variety of flavours. Neurons fire, and they fire at different rates depending on the sensory input and the results of computations. Synapses are modulated chemically, and different types of synapse may be present for a single neuron, where each type of synapse may have different chemical and temporal properties.
  • How do we relate our high-level, top-down probabilistic models of computation (e.g. which “hypothesis” is more likely) to low-level, bottom-up probabilistic models of neural circuits?

Bayesian models are useful as they provide a framework to make predictions in a mathematical manner. They are useful as they decompose the prediction into a number probabilistic components, where the components may be easier to measure and/or compute.

Predictive Coding

Predictive coding is a theory that may model the activity of lower level neural circuitry. It was originally presented in the context of visual processing performed by the cortex (* short prayer interlude for those brave monkeys, cats and mice *).

Predictive coding models cortical sensory regions as containing functionally distinct sub-populations of neurons:

  • a population that attempts to predict an input based on a current hypothesis; and
  • a population that determines an error between the actual inputs and the predicted inputs.

At a high level, predictive coding is based around the idea that your brain issues a storm of predictions, simulates the consequences as if they were present, and checks and corrects those predictions against actually sensory input. The book How Emotions are Made brilliantly explains some of the high level thinking.

Again, we can ask the question: what is a “hypothesis” for a cortical sensory region?

I have seen “hypotheses” explained at both high and low levels. As before, at a high level, a “hypothesis” may be something like “Tiger?”. At a lower level, a “hypothesis” may be something like “face?”. And at a very low level, a “hypothesis” may be something like “line at 45 degrees in a small part of my upper right visual field?” or “tone change from 5kHz to 3kHz?”.

Predictive coding is a theory that is routed to the cortex of the brain, the wrinkly table-cloth-sized sheet of pink-grey matter that most people visualise when they think of the brain.

It has been know for a while that sensory processing in the cortex of the brain is configured hierarchically. For example, there are areas of the visual cortex that receive input from the retina (via the thalamus – important to note for later), perform cortical “computation” and pass “data” via patterns of firing to different areas of the visual cortex (many of them neighbouring areas). In the Figure below, visual input arrives at V1 and then is processed from left to right.

From the excellent paper by Barbara Finlay and Ryutaro Uchiyama – Developmental mechanisms channeling cortical evolution

The processing of the cortex is a processing hierarchy as “higher” cortical regions have a larger receptive field (e.g. working up to the full visual or auditory field) and receive complex inputs, e.g. inputs from many abstract “lower” cortical regions. These “lower” cortical regions have smaller, more specific receptive fields. At the bottom of the hierarchy you have either sensory input or motor output.

In fact, the cortex has multiple hierarchies: an input hierarchy for sensory modalities and an output hierarchy for motor outputs, where the connecting “pinch point” of cortex is fairly wide and deep and contains the abstractions of the associative areas.

It has also been known for a while that the cortex has a layered structure. This was discovered through early staining experiments that showed different bands of neuronal density.

By Henry Vandyke CarterHenry Gray (1918) Anatomy of the Human Body

Most sensory areas of the cortex have around six layers. Each layer has been found to have a different function.

The functional neuron populations required for predictive coding may be found in this layered structure of the cortex. Layers 2 and 3 provide an output from a cortical column, this may be seen as a feed-forward output that is received by layer 4 of a subsequent (higher level) cortical column. In predictive coding models this is seen as a “prediction error”. Layers 5 and 6 also provide an output from a cortical column, this may be seen as a feed-back output that is received by layer 1 of (many) previous (lower) cortical columns. In predictive coding models this is seen as a “prediction”. In this manner, a processing hierarchy is generated.

More detail of the configuration of these neural circuits may be found in the also excellent paper by AM Bastos et al – Canonical Microcircuits for Predictive Coding. Another great paper for explaining predictive coding in the context of the visual cortex is Rajesh PN Rao’s (semi-famous) paper – Predictive Coding in the Visual Cortex. Rao’s paper sets out a rather nice model of predictive coding applied to images that I will have to try to implement.

Predictive coding theories have some nice properties. If a neural circuit can successfully predict the input it should be expecting, it does not output a prediction error. This is efficient – populations of neurons only expend energy when a sensory input cannot be predicted. This also provides a model for how cascades of activity can pass through the hierarchical areas of the brain – prediction errors are passed “upwards” until they meet a neural circuit that can successfully predict its input, and this then leads to a feedback cascade with the successful prediction being passed down the hierarchy.

Let’s try to explain this in words with an image processing example.

Let’s have two layers of cortex that receive an input image. Before receipt of the image our prediction from our top layer (2) is a null or resting prediction (say 0). This is passed to the first layer (1). The first layer (1) receives the image (I) and applies a function to it to generate what Rao calls a set of “causes” for the input. As we have a non-zero input, these causes will be non-zero. A “prediction” error is calculated between the causes as generated by the first layer (1) and the prediction (which may be said to be a prediction of the causes). This error is then passed to the top layer (2). This error will be non-zero as our initial prediction is zero and our causes in the first layer (1) are non-zero: what we expect from the top down at the start is not what we see from the bottom up. The top layer (2) receives the error and applies a function to determine a set of second order causes (or layer 2 causes). These second order causes are then sent to the first layer (1) as the prediction from the top layer (2). The first layer (1) thus receives a modified prediction from the top layer (2) and a new prediction error is generated. The process can repeat over time until the system stabilises.

There are some gaps in my understanding of predictive coding. I need to play around with some actual models to see how information flows up and down the processing hierarchy. A good place to start is the Predictive Coding Networks described here. This video lecture by David Cox is also great.

Feedback in the Brain

The theories of predictive coding are built upon the levels of feedback that are observed in the visual cortex. From several decades of research we now know that there are multiple levels of feedback that are occurring in the brain. This feedback provides the basis for many theories of “prediction” as they represent pathways for information to flow in a top-to-bottom manner (in addition to conventional feed-forward bottom-to-top manner). Here “top” can be seen as more complex integrated representations and “bottom” can be seen as closer to raw sensory input.

At a first level, we have feedback within cortical layers. This is explained nicely in Bastos’ Canonical Microcircuits for Predictive Coding. Within cortical layers there appear to be recurrent connections (for example within layers 2 and 3 and layers 5 and 6).

At a next level, we have feedback within cortical columns. For example, neurons in layers 2 and 3 excite neurons in layers 5 and 6 and neurons in many of the layers both excite and inhibit neurons in upper layers.

We then have feedback between cortical areas. This resembles the feedback modelled with predictive coding.

As well as feedback within the cortex, there are also loops that extend between sub-cortical structures such as the thalamus and the basal ganglia. Sensory input arrives at the thalamus from sense organs and is projected to the cortex. However, there appear to be 10-100 times more connections from the cortex to the thalamus as from the thalamus to the cortex. This suggests that the thalamus applies some kind of sensory gating and/or attention based on cortical feedback. The basal ganglia appears to be an adapted early muscle control centre that is important for action selection and error identification. The basal ganglia receives cortical inputs and also projects to the thalamus.

These various levels of feedback may embody error signals that indicate differences between what is being perceived and what is being predicted. For example, connections from the cortex to the thalamus may gate sensory input in the case where we have successful prediction.

Summing Up

In this post we have looked at some of the ways the brain may be said to “predict” the outside world.

The brain may be modelled using Bayesian approaches. Predictive coding provides a way to understand perception. And the brain has many feedback and recurrent couplings that appear to pass information from higher processing areas to lower processing areas.

The challenge now is to use this knowledge to start building intelligent systems.