Bio-Inspired Robotics for Beginners

There are several well-known problems with modern “deep” learning approaches. These include the need for large quantities of training data and a lack of robustness. These are related. Neural network architectures are trained by computing an error between some “ground truth” data in the training data and the architecture’s prediction of that same data, given a set of associated input data. The error is directed based on differentials for each component of the architecture. This feels to me as doing everything in reverse.

Animal brains offer an alternative view of intelligence. Animals are not taught complex behaviours by providing millions of labelled training examples. Instead, they learn the natural (local) structure of their environment over time.

Photo by Pixabay on

What Do We Know that Might Help?

So, what do we know about animal brains that could help us build artificial intelligent systems?

Quite a lot it turns out.

Mammals have evolved a clever solution to the problem of predicting their environment. They have multiple sense organs that provide information in the form of electrical signals. These electrical signals are provided to neural structures. There are roughly three tiers: the brainstem, the midbrain and the cortex. These represent different levels of processing. They also show us the pathway of evolution: newer structures have been built on top of older structures, and older structures have been co-opted to provide important support functions for the newer structures.

In the mammalian brain there are some commonalities as to how information is received and processed. The cortex appears to be the main computational device. Different sensory modalities appear to be processed using similar cortical processes. Pre-processing and feedback is controlled by the mid-brain structures such as the thalamus and basal ganglia.  

In terms of knowledge of sensory processing, we know most about the visual system. We then know similar amounts about auditory and motor systems, including perception of the muscles and skin (somatosensory). We know the least about smell and interoception, our sensing of internal signals relating to organs and keeping our body in balance. All the sensory systems use nerve fibres to communicate information in the form of electrical signals.

Cortex I/O

In mammalian brains, the patterns of sensory input to the brain and motor output are fairly well conserved across species. All input and output (apart from smell) is routed via the thalamus. The basal ganglia appears to be a support structure for at least motor control. The brain is split into two halves, and each half receives input from one side of the body. Most mammals have the following:

  • V1 – the primary visual cortex – an area of the cortex that receives an input from the eye (the retina) via the thalamus;
  • A1 – the primary audio cortex – an area of the cortex that receives an input from the ear via the thalamus;
  • S1 – the primary somatosensory cortex – an area of the cortex that receives an input from touch, position, pain and temperature sensors that are positioned over the skin and within muscle structures (again via the thalamus);
  • Insula – an area of the cortex that receives an input from the body, indicating its internal state, from the thalamus; and
  • M1 – the primary motor cortex – an area of the cortex that provides an output to the muscles (it receives feedback from the thalamus and basal ganglia).

For a robotic device, we normally have access to the following:

  • Image data – for video, frames (two-dimensional matrices) at a certain resolution (e.g. 640 by 480) with a certain frame rate (e.g. 30-60 frames per second) and a certain number of channels (e.g. 3 for RGB). The cortex receives image information via the lateral geniculate nucleus (LGN) of the thalamus. The image information is split down the centre, so each hemisphere receives one half of the visual field.  The LGN has been shown to perform a Difference Of Gaussians (DOG) computation to provide something similar to an edge image. The image that is formed in M1 of the cortex is also mapped to a polar representation, with axes representing an angle of rotation and a radius (or visual degree from the horizontal).
  • Audio – in one form, a one-dimensional array of intensities or amplitudes (typically 44100 samples per second) with two channels (e.g. left and right). The cochlea of the ear actually outputs frequency information as opposed to raw sound (e.g. pressure) information. This can be approximated in a robotic device by taking the Fast Fourier Transform to get a one-dimensional array of frequencies and amplitudes. 
  • Touch sensors / motor positions / capacitive or resistive touch – this is more ropey and has a variety of formats. We can normally pre-process to a position (in 2 or 3 dimensions) and an intensity. This could be multiple channel, e.g. at each position we could have a temperature reading and a pressure reading. On a LEGO Mindstorms ev3 robot, we have a Motor class that provides information such as the current position of a motor in pulses of the rotary encoder, the current motor speed, whether the motor is running and certain motor parameters. 
  • Computing device information – this is again more of a jump. An equivalent of interoception could be seen as information on running processes and system utilization. If the robotic device has a battery this could also include battery voltages and currents. In Python, we can use something like psutil to get some of this information. On a LEGO Mindstorms ev3 robot, we have PowerSupply classes that provide information on the battery power.
  • Motor commands – robotic devices may have one or more linear or rotary motors. Typically, these are controlled by sending motor commands. This may be to move to a particular relative position, to move at a given speed for a given time and/or to rotate by a given number of rotations.

We will leave smell for now (although it’s a large part of many mammals sensory repertoire). It probably slots in best as part of interoception. One day sensors may provide an equivalent data stream. 

We can summarise this rough mapping as follows:

Eyes / Retina > V1Video Camera > Split L/R > Polar Mapping
Ears / Cochlea > A1Microphone > Channel Split L/R > FFT
Outer body + muscle > S1Multisensor > Position + value
Interoception > InsulaDevice measurements > numeric array
M1 > musclesNumeric commands > motors

One advantage of the “deep” learning movement is that it is now conventional to represent all information as multidimensional arrays, such as Numpy arrays within Python. Indeed, we can consider “intelligence” as a series of transformations between different multidimensional arrays.

Cortex Properties 

The mammalian cortex is a two-dimensional sheet. This provides a strong constraint on the computational architecture. Human brains appear wrinkly because they are trapped within our skulls and need to maximise their surface area. Mouse brains are quite smooth.

The cortex is a layered structure, providing its thickness. Each layer is a few mm in height. There are between 4 and 6 layers, depending on the area of the cortex. The layers contain a large number of implementing neurons. These neurons provide a combination of excitatory and inhibitory connections. Different layers receive different inputs and provide different outputs. Feedback appears to be supplied over a large area via layer 1, input is received from the thalamus at layer 4, input is received from other parts of the cortex at layers 3 and 4, layer 2 provides feed back to a neighbouring cortical area, layer 3 provides a feed forward output to other cortical areas (wider range) and layers 5 and 6 provide feedback to the thalamus, layer 5 also provides a feed forward output to a neighbouring cortical area. Computation occurs vertically in the layers and information is passed within the plane of the cortical sheet.

The two-dimensional cortical sheet of many mammals appears to have a common general topology. The input and output areas appear reasonably fixed, and are likely genetically determined.  The visual, somatosensory and motor areas appear aligned. This may be what creates “embodied” intelligence; we think conceptually using a common co-ordinate system. For example, the half an image from the eyes is aligned bottom-to-top within the visual processing areas, which is aligned with the feet-to-head axis of the body as felt and the feet-to-head axis of the body as acted-upon (i.e. V1, S1 and M1 are aligned). This makes sense – it is more efficient to arrange our maps in this way.

From Finlay et al

The cortex also appears to have neuronal groupings that have certain functional roles. This is best understood in the visual processing areas, where different cortical columns are found to relate to different portions of the visual field; each column has a receptive field equivalent to a small group of pixels (say somewhere around 1000). Outside of the visual field evidence is more shaky. 

The cortex of higher mammals also appears to have a uniform volume but a differing neuronal density. This neuronal density appears similar to a diffusion gradient. Within the visual areas towards the back of the brain there are a large number of neurons per square millimetre; towards the front of the brain there are fewer neurons per square millimetre. The baboon has a 4:1 ratio. However, because the volume of the cortical sheet is reasonably constant, the neurons towards the front of the brain are more densely connected (as there is room). A simple gradient in neuron number may provide an information bottleneck that forces a compression of neural representations, leading to greater abstraction.

Considering a two-dimensional computing sheet gives us an insight into two cortical pathways that have been described for vision. A first processing pathway (the dorsal stream) is the line drawn from V1 to S1 and M1. This pathway is swayed towards motion, i.e. information that is useful for muscle control. A second pathway (the ventral stream) is the line drawn from V1 to A1. This pathway is swayed towards object recognition, which makes sense because we can correlate audio and visual representations of objects to identify them. What is also interesting is that the lower visual field abuts the first processing pathway and the upper visual field abuts the second processing pathway – this may reflect the fact that the body is orientated below the horizontal line of vision and so it makes sense to map the lower visual field to the body in a more one-to-one manner.

The cortical sheet also gives us an insight into implementing efficient motor control. You will see that S1, the primary somatosensory area, is adjacent to M1, the primary motor cortex. This means that activation of the motor cortex, e.g. to move muscles, will activate the somatosensory representations regardless of any somatosensory input from the thalamus. We thus have two ways of activating the body representations, one being generated from the motor commands and one being generated by the input from the body. This gives us the possibility to compare and integrate these signals in time to provide control. For example, if the signal received from the body did not match the proposed representation from the motor commands then either the motor commands may be modified or our sensing of our body is modified (or both).

So to recap:

  • the main computing machinery of the brain is a two-dimensional sheet;
  • it has an embodied organisation – the relative placement of processing areas has functional relevance;
  • it has a density gradient that is aligned with the embodied organisation; and
  • it appears to use common, repeated computing units.

Cortex to Robot

This suggests a plan for organising computation for a robotic device.
Computers typically work according to a one-dimensional representation: data in memory. If we are creating a bio-inspired robotic device, we need to extend this to two dimensions, where the two dimensions indicate an ordering of processing units. It might be easier to image a football field of interconnected electronic calculators.

We can then align our input arrays with the processing units. At input and output areas of any computing architecture there could be a one-to-one organisation of array values to processing units. The number of processing units may decrease along an axis of the two-dimensional representation, while their interconnections increase. In fact, a general pattern is that connectivity is local at points of input or output but becomes global in between these points. These spaces in between store more abstract representations.

This architecture of the cortex reminds us of another architecture that has developed with “deep” learning: the autoencoder.

Standard Autoencoder
Standard Autoencoder

Indeed, this is not by accident – the structure of the visual processing areas of the brain has been the inspiration for these kind of structures. Taking a two-dimensional view of computation also allows us to visualise connectivity using a two-dimensional graph structure.


So we can start our bio-inspired robot by constructing a setup where different sensory modalities are converted into Numpy arrays and then processed so that we can compare them in a two-dimensional structure. The picture becomes murkier once we look at how associations are formed and how the thalamus comes into the equation. However, by looking at possible forms of our signals in a common processing environment we can begin to join the dots.



Predicting the Future

The rise of machine learning and developments in neuroscience hint that prediction is key to how brains navigate the world. But how could this work in practice?

Let’s ignore neuroscience a minute and just think how we would manually predict the future. Off the top of my head there appear to be three approaches, which we will look at below.

Instantaneous Prediction Using Rates of Change

Photo by Pixabay on

In the physical world, one way to predict the future is to remember our secondary school physics (or university kinematics). For example, if we wanted to predict a position of an object moving along one dimension, we would attempt to find out its position, speed and acceleration, and use equations of motion. In one-dimension, speed is just a measure of a change of position over time (a first-order differential with respect to time). Acceleration is a measure of change of speed over time (a second order differential with respect to time, or a change in a change).

In fact, it turns out the familiar equations of motion are actually just a specific instance of a more general mathematical pattern. James Gregory, Brook Taylor and Colin Maclaurin formalised this approach in the 17th and 18th centuries, but thinking on this issue goes back at least to Zeno, Democritus and Aristotle in ancient Greece, as well as Chinese, Indian and Middle Eastern mathematicians. In modern times, we generally refer to the pattern as a Taylor Series: a function may be represented as an infinite sum of differentials about a point.

For example, our normal equation for distance travelled in one dimension is d(t) = d_0 + v*t + \frac{1}{2}*a*t^2, where d is distance, v is velocity and a is acceleration. In this case, velocity is our first differential – a first-order rate of change of distance with time – and acceleration is our second differential – a second-order rate of change of distance with time (or a rate of change of velocity). We know from school that if we have constant velocity, a is zero and we just have d(t) = d_0 + v*t. What isn’t stressed as much is that the equations learnt by every school kid are for a single dimension with at worst constant acceleration. It is not until university that the veil begins to be removed and we realise that if we have changing acceleration (third-order changes), we need to add another term, and that we can continue this ad-nausem.

As you move into more advanced engineering and physics classes, you are also taught how to extend the equations of motion into two or three dimensions. When we have movement in three-dimensional space, we can model points in spacetime using four-dimensional vectors ([x, y, z, t]). As we move into the multidimensional case, we can look at how each dimension changes with respect to each other dimension for the point. For example, changes in an x-dimension with time (t) may resemble our one-dimensional speed. However, in the multivariate case we now can determine changes in the y and z direction with time. Hence, our velocity becomes a vector indicating how the x, y and z dimensions of the point change over time. Because the directions of modelled Cartesian space are orthogonal, we can analyse them separately.

We will stop here for now but we can also go further. If we can guess at the mass of an object, we can predict an acceleration using F=ma. Hence, our brains can begin to build a suite of approximated functions that predict the rates of change from additional data or latent variables discerned from the data.

Neural Units

So let’s return to thinking about brains. Populations of neurons in our brains do not have god-like powers to view an objective reference frame that depicts spacetime. Instead, they are fixed in space, and travel through time. What they can be is connected. But even this must be limited by practical reality, such as space and energy.

In fact, we can think about “neural units”, which may be a single neuron or a population of neurons. What makes a neural unit “a unit” is that it operates on a discrete set of information. In an image processing case this may be a pixel, in a audio processing example this may be a sample in time, or a frequency measurement.

Now the operation of a neural unit begins to resemble the assumptions for the Taylor Series: it is the point around which our function is evaluated, and all we need is local information relating to our derivatives. We’ll ignore for now the fact that our functions may not be infinitely differentiable about our unit, as it turns out approximations often seem to work fine.

So bringing this all together, we see that a neural unit may be able to (approximately) predict future activity, either of itself in time, or its neighbours in space, by determining local rates of change of different orders.

For example, if we consider the image intensity of a single pixel, we can see that a neural unit may be able to predict the intensity of that pixel at a future time, if there are patterns in the rates of change. For example, if the pixel intensity is increasing at a constant rate, r, then the intensity at a future time, t, may be determined like our velocity above: I(t) = I_0 + r*t.

Linear Approximations

Another way of viewing the same thing is to think about linear approximations.

What are linear approximations? They are just functions where the terms are linear, i.e. are not a power. In a Taylor Series this means chopping off everything past a first order differential. Now, if a car is accelerating towards you, assuming a constant velocity is going to be a very costly mistake. But what is surprising is that a fair bit of engineering is built upon linear approximations. In fact, even now some engineers pull a funny face and start sweating when you move away from linear models. It turns out that a significant portion of the world we live in has first order patterns.

If you go up a little way to include second order patterns, you find that another large chunk of the world can be approximated there. This is most visible in the equations of motion. Why can we stop with second order acceleration terms? Because gravity is constant at 9.8m/s. Until recently, the main things that displayed changing accelerations were animals.

So can we just throw away higher order terms? Not exactly. One issue with a power series is that as we try to predict further from our point or neural unit, higher order differentials become more important. For example, looking forward beyond the first few immediate units and the higher terms will dominate the prediction. Often this is compounded by sensory inaccuracy, the rates of change will never be exact, and errors in measurement are multiplied by the large high-order terms.

So what have we learnt? If we are predicting locally, either in space or in time, in a world with patterns in space-time, we can make good approximations using a Taylor series. However, these predictions become less useful in a rapidly changing world where we need to predict over longer distances in space and time.

Prediction Using Cycles

Image by Dirk Rabe from Pixabay 

Many of the patterns in the natural world are cyclical. These include the patterns of day and night caused by the rotation of the Earth upon its (roughly) north-south axis, the lunar months and tides caused by the rotation of the moon around the Earth, and the seasons caused by the rotation of the Earth around the sun. These are “deep” patterns – they have existed for the whole of our evolutionary history, and so our modelled at least chemically at a low-level in our DNA.

There are then patterns that our based on these patterns. Patterns of sleep and rest, of meal times, of food harvest, of migration, or our need for shelter. Interestingly many of these patterns are interoceptive patterns, e.g. relating to an inner state of our bodies representing how we feel.

The physical world also has patterns. Oscillations in time generate sound. Biological repetition and feedback cycles generates cyclical patterns, such as the vertical lines of light-and-dark observed looking into a forest or the stripes on a zebra.

How do we make predictions using cycles?

Often we have reference patterns that we apply at different rates. In engineering, this forms the basis of Fourier analysis. The rate of repetition we refer to as “frequency”. We can then build up complex functions and signals by the addition of simple periodic or repeating functions with different magnitudes and phases. In mathematics the simple periodic functions are typically the sine or cosine functions.

Image by David Zydd from Pixabay

So if we can use a base set of reference patterns, how does working in the frequency domain help prediction?

In one dimension the answer is that we can make predictions of values before or after a particular point based on our knowledge of the reference functions, and estimates for magnitudes and phases. For a signal that extends in space or time, we only need to know a general reference pattern for a short patch of space or time, stretch or shift it and repeat it, rather than trying to predict each point separately and individually.

When our brain attempts to predict sounds, it can thus attempt to predict frequencies and phases as opposed to complex sound waveforms. In space, things are less intuitive but apply similarly. For example, repeating patterns of intensity in space, such as stretches of light and dark lines (the stripes on a zebra) may be approximated using a reference pattern of one light and one dark line, and then repeating the pattern at an estimate scale, strength and phase. Many textures can be efficiently represented in this way (think of the patterns on plants and animals).

Thinking about neural units, we can see how hierarchies of units may be useful to implement predictions of periodic sequences. We need a unit or population of units to replicate the reference pattern, and to somehow represent an amplitude and phase.

Statistical Prediction

A third way to make predictions is using statistics and probability.

Photo by Balázs Utasi on

Statistics is all about large numbers of measurements (“big data”, when that was trendy). If we have large numbers of measurement we can look for patterns in those numbers.

Roll a six-sided dice a few times and you will record what look like random outcomes. We might have three “4s”, and two “1s”. Roll a dice a few million times and you will see that each of the six numbers occur in more-or-less similar proportions: each number occurs 1/6th of the time. The probability of rolling each number may then be represented as “1/6”.

Rates of change are fairly useless here. This is because we are dealing with discrete outcomes that are often independent. These “discrete outcomes” are also typically complex high-order events (try explaining “roll a dice” to an alien). If you were to measure the change in rolled number (e.g. “4” on roll 1 minus “1” on roll 0 = 3), this wouldn’t be very useful. Similarly, there are no repeating patterns in time or space that make Fourier analysis immediately useful for prediction.

Thinking about a neural unit, we can see that probability may be another way to predict the future. If a neural unit received an intensity for a pixel associated with the centre of a dice, it could learn that the intensity could be 0 or 1 with a roughly 50% likelihood (e.g. numbers 1, 3 and 5 having a central dot, which is absent from numbers 2, 4 and 6). If it got an intensity of 0.5, something strange has happened.

Probability, at its heart, is simply a normalised weight for a likelihood of an outcome. We use a value between 0 and 1 (or 0 and 100%) so that we can compare different events, such as rolling a dice or determining if a cow is going to charge us. In a discrete case, we have a set of defined outcomes. In a continuous case, we have a defined range of outcome values.

How Do Rates of Change and Probabilities Fit Together?

Imagine a set of neural units relate to a pixel in an image. For example, we might look at a nearest pixel to a centre of a webcam image.

In this case, each neural unit may have one associated variable: an intensity or amplitude. Say we have an 8-bit image processing system, so the neural unit can receive a value between 0 and 255 representing a measured image characteristic. This could be a channel measurement, e.g. an intensity for lightness (say 0 is black and 255 is white) or for “Red” (say 0 is not red and 255 is the most red) or an opponent colour space (say 0 is green and 255 is red).

Now nature is lazy. And thinking is hard work. Our neural units want to minimise any effort or activity.

One way to minimise effort is to make local predictions of sensory inputs, and to only pass on a signal when those predictions fail, i.e. to output a prediction error.

A neural unit could predict its own intensity at a future time I(t_0 + t_{interval}) or the intensity of one of its neighbours, e.g. I(x_i + x_{i+1}, y_j + y_{j+1}) in space. If a neural unit receives an intensity in I_{sensory}, it can compute an overall intensity prediction based on time and space prediction I_{prediction} and then determine an error between them e = I_{prediction} - I_{sensory}.

One way to approximate a rate of change is to simply compare neighbouring units in space, or current and past values in time. To compute higher orders, we just repeat this comparison on previously computed rates of change.

If they are arranged in multiple layers, our neural units could begin to predict cyclical patterns. Over time repeated patterns of activity could be represented by the activity of a single set of neural units and a reference to the underlying units that show this activity, e.g. as scaled or shifted. This would be lazier – we could just copy or communicate the activity of the single unit to the lower neural units.

Probability may come into play when looking at a default level of activity for a given context. For example, consider an “at rest” case. In many animals the top of the visual field is generally lighter than the bottom of the visual field. Why is this? Because the sky is above and the ground is below. Of course, this won’t always be the case, but it will be a general average over time. Hence, if you have no other information, a neural unit in the upper visual field would do wise to err on a base level of intensity that is higher (e.g. lighter) that a neural unit in the lower visual field. This also allows laziness in the brain. A non-light intensity signal received by the neural unit in the upper visual field is more informative than a light intensity signal as it is more unlikely. Hence, if there is a finite amount of energy, the neural unit in the upper visual field wants to use more energy to provide a signal in the case of a received non-light intensity signal than in the case of a received light intensity signal. Some of you would spot that we are now moving into the realms of (Shannon) entropy.

In the brain then, it is likely that all these approaches for prediction are applied simultaneously. Indeed, it is probable that the separate functions are condensed into common non-linear predictive functions. It is also likely that modern multi-layer neural networks are able to learn these functions from available data (or at least rough approximations based on the nature of the training data and the high-level error representation).

Playing Around with Retinal-Cortex Mappings

Here is a little notebook where I play around with converting images from a polar representation to a Cartesian representation. This is similar to the way our bodies map information from the retina onto the early visual areas.

Mapping from the visual field (A) to the thalamus (B) to the cortex (C)

These ideas are based on information we have about how the visual field is mapped to the cortex. As can be seen in the above figures, we view the world in a polar sense and this is mapped to a two-dimensional grid of values in the lower cortex.

You can play around with mappings between polar and Cartesian space at this website.

To develop some methods in Python I’ve leaned heavily on this great blogpost by Amnon Owed. This gives us some methods in Processing I have adapted for my purposes.

Amnon suggests using a look-up table to speed up the mapping. In this way we build a look-up table that maps co-ordinates in polar space to an equivalent co-ordinate in Cartesian space. We then use this look-up table to look-up the mapping and use the mapping to transform the image data.

import math
import numpy as np
import matplotlib.pyplot as plt

def calculateLUT(radius):
    """Precalculate a lookup table with the image maths."""
    LUT = np.zeros((radius, 360, 2), dtype=np.int16)
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over diameter
        for r in range(0, radius):
            theta = math.radians(angle)
            # Take angles from the vertical
            col = math.floor(r*math.sin(theta))
            row = math.floor(r*math.cos(theta))
            # rows and cols will be +ve and -ve representing
            # at offset from an origin
            LUT[r, angle] = [row, col]
    return LUT

def convert_image(img, LUT):
    Convert image from cartesian to polar co-ordinates.

    img is a numpy 2D array having shape (height, width)
    LUT is a numpy array having shape (diameter, 180, 2)
    storing [x, y] co-ords corresponding to [r, angle]
    # Use centre of image as origin
    centre_row = img.shape[0] // 2
    centre_col = img.shape[1] // 2
    # Determine the largest radius
    if centre_row > centre_col:
        radius = centre_col
        radius = centre_row
    output_image = np.zeros(shape=(radius, 360))
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over radius
        for r in range(0, radius):
            # Get mapped x, y
            (row, col) = tuple(LUT[r, angle])
            # Translate origin to centre
            m_row = centre_row - row
            m_col = col+centre_col
            output_image[r, angle] = img[m_row, m_col]
    return output_image

def calculatebackLUT(max_radius):
    """Precalculate a lookup table for mapping from x,y to polar."""
    LUT = np.zeros((max_radius*2, max_radius*2, 2), dtype=np.int16)
    # Iterate around x and y
    for row in range(0, max_radius*2):
        for col in range(0, max_radius*2):
            # Translate to centre
            m_row = max_radius - row
            m_col = col - max_radius
            # Calculate angle w.r.t. y axis
            angle = math.atan2(m_col, m_row)
            # Convert to degrees
            degrees = math.degrees(angle)
            # Calculate radius
            radius = math.sqrt(m_row*m_row+m_col*m_col)
            # print(angle, radius)
            LUT[row, col] = [int(radius), int(degrees)]
    return LUT

def build_mask(img, backLUT, ticks=20):
    """Build a mask showing polar co-ord system."""
    overlay = np.zeros(shape=img.shape, dtype=np.bool)
    # We need to set origin backLUT has origin at radius, radius
    row_adjust = backLUT.shape[0]//2 - img.shape[0]//2
    col_adjust = backLUT.shape[1]//2 - img.shape[1]//2
    for row in range(0, img.shape[0]):
        for col in range(0, img.shape[1]):
            m_row = row + row_adjust
            m_col = col + col_adjust
            (r, theta) = backLUT[m_row, m_col]
            if (r % ticks) == 0 or (theta % ticks) == 0:
                overlay[row, col] = 1
    masked = == 0, overlay)
    return masked

First build the backwards and forwards look-up tables. We’ll set a max radius of 300 pixels, allowing us to map images of 600 by 600.

backLUT = calculatebackLUT(300)
forwardLUT = calculateLUT(300)

Now we’ll try this out with some test images from skimage. We’ll normalise these to a range of 0 to 255.

from import chelsea, astronaut, coffee

img = chelsea()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img,, interpolation='bicubic')

ax[0].imshow(masked,, alpha=0.5)

ax[1].imshow(out_image,, interpolation='bicubic')

img = astronaut()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img,, interpolation='bicubic')

ax[0].imshow(masked,, alpha=0.5)

ax[1].imshow(out_image,, interpolation='bicubic')

img = coffee()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img,, interpolation='bicubic')

ax[0].imshow(masked,, alpha=0.5)

ax[1].imshow(out_image,, interpolation='bicubic')

In the methods, the positive y axis is the reference for the angle, which is extends clockwise.

Now, within the brain the visual field is actually divided in two. As such, each hemisphere gets half of the bottom image (0-180 to the right hemisphere and 180-360 to the left hemisphere).

Also within the brain, the map on the cortex is rotated clockwise by 90 degrees, such that angle from the horizontal eye line is on the x-axis. The brain receives information from the fovea at a high resolution and information from the periphery at a lower resolution.

The short Jupyter Notebook can be found here.

Extra: proof this occurs in the human brain!

Folding Autoencoders

It’s nice when different ways of seeing things come together. This often occurs when comparing models of computation in the brain and machine learning architectures.

Many of you will be familiar with the standard autoencoder architecture. This takes an input \mathbf{X}, which may be an image (in a 2D or flattened 1D form) or another input array. It applies one or more neural network layers that progressively reduce the dimensionality of the output. These one or more neural network layers are sometimes called an “encoder”. This forms a “bottleneck” to “compress” the input (in a lossy manner) to form a latent representation. This latent representation is sometimes referred to as a hidden layer. It may be seen as an encoded or compressed form of the input.

A standard autoencoder may have a 1D latent representation with a dimensionality that is much less than the input dimensionality. A variational autoencoder may seek to learn values for a set of parameters that represent a latent distribution, such as a probability distribution for the input.

The autoencoder also has another set of one or more neural network layers that receive the latent representation as an input. These one or more neural network layers are sometimes called a “decoder”. The layers generate an output \mathbf{X'}, which is a reconstruction of the input \mathbf{X}.

The whole shebang is illustrated in the figure below.

Standard Autoencoder
Standard Autoencoder

The parameters of the neural network layers, the “encoder” and the “decoder”, are learnt during training. For a set of inputs, the output of the autoencoder \mathbf{X'} is compared with the input \mathbf{X}, and an error is back-propagated through the neural network layers. Using gradient descent to minimise the errors, an optimal set of parameters may be learnt.

As autoencoders do not require a labelled “ground-truth” output to compare with the autoencoder output during training, they provide a form of unsupervised learning. Most are static, i.e. they operate on a fixed instance of the input to generate a fixed instance of the output, and there are no temporal dynamics.

Now, I’m quite partial to unsupervised learning methods. They are hard, but they also are much better at reflecting “natural” intelligence; most animals (outside of school) do not learn based on pairs of inputs and desired outputs. Models of the visual processing pathway in primates, which have been developed over the last 50 years or so, all indicate that some form of unsupervised learning is used.

In the brain, there are several pathways through the cortex that appear similar to the “encoder” neural network layers. With a liberal interpretation, we can see the latent representation of the autoencoder as an association representation formed in the “higher” levels of the cortex. (In reality, latent representations in the brain are not found at a “single” flat level but are distributed over different levels of abstraction.)

If the brain implements some form of unsupervised learning, and seems to “encode” incoming sensory information, this leads to the question: where is the “decoder”?

This is where predictive coding models can help. In predictive coding models, predictions are fed back through the cortex to attempt to reconstruct input sensory information. This seems similar to the “decoder” layers of the autoencoder. However, in this case, the “encoder” and the “decoder” appear related. In fact, one way we can model this is to see the “encoder” and the “decoder” as parts of a common reversible architecture. This looks at things in a similar way to recent work on reversible neural networks, such as the “Glow” model proposed by OpenAI. The “encoder” represents a forward pass through a set of neural network layers, while the “decoder” represents a backward pass through the same set of layers. The “decoder” function thus represents the inverse or “reverse” of the “encoder” function.

This can be illustrated as follows:

Folding the Autoencoder
Folding the Autoencoder

In this variation of the autoencoder, we effectively fold the model in half, and stick together the “encoder” and “decoder” neural network layers to form a single “reversible neural network”.

In fact, the brain is even cooler than this. If we extend across modalities to consider sensory input and motor output, the brain appears to replicate the folded autoencoder shown above, resulting in something that resembles again our original standard autoencoder:

The brain as a form of autoencoder.
The brain as a form of autoencoder.

Here, we have a lower input “encoder” that represents the encoding of sensory input \mathbf{X} into a latent “associative” representation. A forward pass provides the encoding, while a backward pass seeks to predict the sensory input \mathbf{X} by generating a reconstruction \mathbf{X'}. An error between the sensory input \mathbf{X} and the reconstruction \mathbf{X'} is used to “train” the “encoder”.

We also have an upper output “decoder” that begins with the latent “associative” representation and generates a motor output \mathbf{Y}. A forward pass decodes the latent “associative” representation to generate muscle commands.

The “backward” pass of the upper layer is more uncertain. I need to research the muscle movement side – there is quite a lot based on the pathology of Parkinson’s disease and the action of the basal ganglia. (See the link for some great video lectures from Khan Academy.)

In one model, somasensory input may be received as input \mathbf{Y'}, which represents the muscle movements as actuated based on the muscle commands. The backward pass thus seeks to reconstruct the latent “associative” representation from the input \mathbf{Y'}. The “error” may be computed at one or more of the input level (e.g. looking at the difference between \mathbf{Y} and \mathbf{Y'}) and the latent “associative” representation level (e.g. between the “sensory” input encoding and the “motor” input encoding). In any case there seem to be enough error signals to drive optimisation and learning.

Of course, this model leaves out one vital component: time. Our sensory input and muscle output is constantly changing. This means that the model should be indexing all the variables by time (e.g. \mathbf{X(t)}, \mathbf{X'(t)}, \mathbf{Y(t)}, \mathbf{Y'(t)}, and the latent “associative” representations). Also the forward and backward passes will occur at different times (the forward pass being available earlier than the backward pass). This means that our errors are also errors in time, e.g. between \mathbf{X(t=1)} and \mathbf{X'(t=2)} for a common discrete time basis.

Many of the machine learning papers I read nowadays feature a bewildering array of architectures and names (ResNet, UNet, Glow, BERT, Transformer XL etc. etc.). However, most appear to be using similar underlying principles of architecture design. The way I deal with the confusion and noise is to try to look for these underlying principles and to link these to neurological and cognitive principles (if only to reduce the number of things I need to remember and make things simple enough for my addled brain!). Autoencoder origami is one way of making sense of things.

An Introduction to the Predictive Brain

Through a variety of sources, including Sam Harris’ discussion with Anil Seth and Lisa Feldman Barrett’s How Emotions Are Made, I’ve been hearing a lot recently about the “Predictive Brain”. This is a theory of cognition that has rapidly gained ground over the last couple of decades.

Talk of a “predictive brain”, in my reading, can be broken down into theories in several key areas:

  • the “Bayesian” brain, or the application of work on Bayesian probabilities to cognition;
  • predictive coding, a specific framework for modelling information flow between cortical areas, e.g. developed from work on the visual system; and
  • feedback circuitry within the brain, or the ongoing discovery of general patterns of feedback within the cortex and mid-brain structures.

The Bayesian Brain

Let’s start with the Bayesian brain.

Many of us will know Bayes Theorem:


The Bayesian brain hypothesis is that the brain is performing some form of computation that may be modelled using Bayesian probability frameworks.

Within the context of the brain, we can treat ‘X’ as our sensory input (e.g. signals from the retina or cochlea). This is typically in the form of the firing output of groups of neurons.

The ‘Y’ varies depending on how Bayes Theorem is being applied. In many cases, it appears to be applied in a relatively general manner. For example, in this Tutorial Introduction to Bayesian Models of Cognitive Development by Amy Perfors et al, ‘Y’ is taken to refer to a “hypothesis” (h_i). Bayes Theorem thus provides a way to compare the probabilities of different hypotheses given (the ‘|’ symbol) our sensory input (‘X’). If we calculate P(h_1|X), P(h_2|X), \dots P(h_n|X), we can choose the hypothesis with the highest probability. This then becomes the “explanation” for our data. The idea is that the brain is (somehow) performing an equivalent comparison.

As you can imagine, this first approach is fairly high level. It considers a “hypotheses” as synonymous with human “reasons” for the data. In reality, the brain may be performing hundreds of thousands of low-level inferences that are difficult to put into words. In these cases, our “hypotheses” may relate to feature components such as possible orientations of observed lines or a pronounced phoneme.

However, I have also seen Bayes Theorem used to model activity in lower level neural circuits (normally in the cortex and in the visual areas). In these cases:

  • P(Y|X) can be seen as the probability of some form of neural process, i.e. the output of a neural circuit, given a particular sensory input, e.g. a context at a particular time. This is our “prediction” in a model of the “predictive brain”. In probability terms, it is known as the “posterior”.
  • P(X) is the probability of our sensory input per se. This can be thought as a measure of how likely the sensory input is, outside of any particular context, e.g. how often has the neural circuit experienced this particular pattern of input firing. In probability terms, it is known as the “evidence” or “marginal likelihood”. It acts as a normalising factor in Bayes Theorem, i.e. acts so that P(Y|X) is a true probability with a value of between 0 and 1.
  • P(Y) is the probability of the output of the neural circuit. Many neural circuits implement some form of function on their inputs, such as acting as integrators. ‘Y’ may be considered a pattern of firing that arises from the neural circuit, and so P(Y) indicates a measure of how often this output pattern of firing is experienced. In probability terms, this is know as the “prior”. It encapsulates what is known about the output before experiencing the sensory input at a particular time.
  • P(X|Y) is part of the “magic sauce” of Bayes Theorem. It is a probability of the sensory input given or assuming a particular neural circuit output. For example, for all the instances where you see a given pattern of output firing for a particular neural circuit, how common is the sensory input that is experienced? It is known as the “likelihood”.

Some neuroscientists are thus looking at ways Bayesian models of probabilistic computation are implemented by neural circuits. Questions arise such as:

  • How do the terms of the Bayesian model relate to structures in the brain, such as cortical columns, neurons, cortical layers and mid-brain structures?
  • How is the Bayesian model applied by the brain? The evidence appears to be steering towards the presence of a hierarchical model of inference, e.g. there are large numbers of neural circuits performing computations in parallel that may each be approximated using Bayesian models.
  • How do we relate the way data is encoded and communicated in the brain to numeric values? Neurons have axons, dendrites and synapses and come in a variety of flavours. Neurons fire, and they fire at different rates depending on the sensory input and the results of computations. Synapses are modulated chemically, and different types of synapse may be present for a single neuron, where each type of synapse may have different chemical and temporal properties.
  • How do we relate our high-level, top-down probabilistic models of computation (e.g. which “hypothesis” is more likely) to low-level, bottom-up probabilistic models of neural circuits?

Bayesian models are useful as they provide a framework to make predictions in a mathematical manner. They are useful as they decompose the prediction into a number probabilistic components, where the components may be easier to measure and/or compute.

Predictive Coding

Predictive coding is a theory that may model the activity of lower level neural circuitry. It was originally presented in the context of visual processing performed by the cortex (* short prayer interlude for those brave monkeys, cats and mice *).

Predictive coding models cortical sensory regions as containing functionally distinct sub-populations of neurons:

  • a population that attempts to predict an input based on a current hypothesis; and
  • a population that determines an error between the actual inputs and the predicted inputs.

At a high level, predictive coding is based around the idea that your brain issues a storm of predictions, simulates the consequences as if they were present, and checks and corrects those predictions against actually sensory input. The book How Emotions are Made brilliantly explains some of the high level thinking.

Again, we can ask the question: what is a “hypothesis” for a cortical sensory region?

I have seen “hypotheses” explained at both high and low levels. As before, at a high level, a “hypothesis” may be something like “Tiger?”. At a lower level, a “hypothesis” may be something like “face?”. And at a very low level, a “hypothesis” may be something like “line at 45 degrees in a small part of my upper right visual field?” or “tone change from 5kHz to 3kHz?”.

Predictive coding is a theory that is routed to the cortex of the brain, the wrinkly table-cloth-sized sheet of pink-grey matter that most people visualise when they think of the brain.

It has been know for a while that sensory processing in the cortex of the brain is configured hierarchically. For example, there are areas of the visual cortex that receive input from the retina (via the thalamus – important to note for later), perform cortical “computation” and pass “data” via patterns of firing to different areas of the visual cortex (many of them neighbouring areas). In the Figure below, visual input arrives at V1 and then is processed from left to right.

From the excellent paper by Barbara Finlay and Ryutaro Uchiyama – Developmental mechanisms channeling cortical evolution

The processing of the cortex is a processing hierarchy as “higher” cortical regions have a larger receptive field (e.g. working up to the full visual or auditory field) and receive complex inputs, e.g. inputs from many abstract “lower” cortical regions. These “lower” cortical regions have smaller, more specific receptive fields. At the bottom of the hierarchy you have either sensory input or motor output.

In fact, the cortex has multiple hierarchies: an input hierarchy for sensory modalities and an output hierarchy for motor outputs, where the connecting “pinch point” of cortex is fairly wide and deep and contains the abstractions of the associative areas.

It has also been known for a while that the cortex has a layered structure. This was discovered through early staining experiments that showed different bands of neuronal density.

By Henry Vandyke CarterHenry Gray (1918) Anatomy of the Human Body

Most sensory areas of the cortex have around six layers. Each layer has been found to have a different function.

The functional neuron populations required for predictive coding may be found in this layered structure of the cortex. Layers 2 and 3 provide an output from a cortical column, this may be seen as a feed-forward output that is received by layer 4 of a subsequent (higher level) cortical column. In predictive coding models this is seen as a “prediction error”. Layers 5 and 6 also provide an output from a cortical column, this may be seen as a feed-back output that is received by layer 1 of (many) previous (lower) cortical columns. In predictive coding models this is seen as a “prediction”. In this manner, a processing hierarchy is generated.

More detail of the configuration of these neural circuits may be found in the also excellent paper by AM Bastos et al – Canonical Microcircuits for Predictive Coding. Another great paper for explaining predictive coding in the context of the visual cortex is Rajesh PN Rao’s (semi-famous) paper – Predictive Coding in the Visual Cortex. Rao’s paper sets out a rather nice model of predictive coding applied to images that I will have to try to implement.

Predictive coding theories have some nice properties. If a neural circuit can successfully predict the input it should be expecting, it does not output a prediction error. This is efficient – populations of neurons only expend energy when a sensory input cannot be predicted. This also provides a model for how cascades of activity can pass through the hierarchical areas of the brain – prediction errors are passed “upwards” until they meet a neural circuit that can successfully predict its input, and this then leads to a feedback cascade with the successful prediction being passed down the hierarchy.

Let’s try to explain this in words with an image processing example.

Let’s have two layers of cortex that receive an input image. Before receipt of the image our prediction from our top layer (2) is a null or resting prediction (say 0). This is passed to the first layer (1). The first layer (1) receives the image (I) and applies a function to it to generate what Rao calls a set of “causes” for the input. As we have a non-zero input, these causes will be non-zero. A “prediction” error is calculated between the causes as generated by the first layer (1) and the prediction (which may be said to be a prediction of the causes). This error is then passed to the top layer (2). This error will be non-zero as our initial prediction is zero and our causes in the first layer (1) are non-zero: what we expect from the top down at the start is not what we see from the bottom up. The top layer (2) receives the error and applies a function to determine a set of second order causes (or layer 2 causes). These second order causes are then sent to the first layer (1) as the prediction from the top layer (2). The first layer (1) thus receives a modified prediction from the top layer (2) and a new prediction error is generated. The process can repeat over time until the system stabilises.

There are some gaps in my understanding of predictive coding. I need to play around with some actual models to see how information flows up and down the processing hierarchy. A good place to start is the Predictive Coding Networks described here. This video lecture by David Cox is also great.

Feedback in the Brain

The theories of predictive coding are built upon the levels of feedback that are observed in the visual cortex. From several decades of research we now know that there are multiple levels of feedback that are occurring in the brain. This feedback provides the basis for many theories of “prediction” as they represent pathways for information to flow in a top-to-bottom manner (in addition to conventional feed-forward bottom-to-top manner). Here “top” can be seen as more complex integrated representations and “bottom” can be seen as closer to raw sensory input.

At a first level, we have feedback within cortical layers. This is explained nicely in Bastos’ Canonical Microcircuits for Predictive Coding. Within cortical layers there appear to be recurrent connections (for example within layers 2 and 3 and layers 5 and 6).

At a next level, we have feedback within cortical columns. For example, neurons in layers 2 and 3 excite neurons in layers 5 and 6 and neurons in many of the layers both excite and inhibit neurons in upper layers.

We then have feedback between cortical areas. This resembles the feedback modelled with predictive coding.

As well as feedback within the cortex, there are also loops that extend between sub-cortical structures such as the thalamus and the basal ganglia. Sensory input arrives at the thalamus from sense organs and is projected to the cortex. However, there appear to be 10-100 times more connections from the cortex to the thalamus as from the thalamus to the cortex. This suggests that the thalamus applies some kind of sensory gating and/or attention based on cortical feedback. The basal ganglia appears to be an adapted early muscle control centre that is important for action selection and error identification. The basal ganglia receives cortical inputs and also projects to the thalamus.

These various levels of feedback may embody error signals that indicate differences between what is being perceived and what is being predicted. For example, connections from the cortex to the thalamus may gate sensory input in the case where we have successful prediction.

Summing Up

In this post we have looked at some of the ways the brain may be said to “predict” the outside world.

The brain may be modelled using Bayesian approaches. Predictive coding provides a way to understand perception. And the brain has many feedback and recurrent couplings that appear to pass information from higher processing areas to lower processing areas.

The challenge now is to use this knowledge to start building intelligent systems.

Natural Language Processing & The Brain: What Can They Teach Each Other?

Part I of II – What Recent Advances in Natural Language Processing Can Teach Us About the Brain

I recently worked out that the day of my birth will be closer to the end of the Second World War than the present day. This means I am living in the future, hooray!

Over the years I’ve been tracking (and experimenting with) various concepts in natural language processing, as well as reading general texts on the brain. To me both streams of research have been running in parallel; in the last 5 years, natural language processing has found a new lease of engineering life via deep learning architectures, and the empirical sciences have been slowly chipping away at cognitive function. Both areas appear to be groping different parts of the same elephant. This piece provides an outlet for the cross-talk in my own head. With advances in natural language processing coming thick and fast, it also provides an opportunity for me to reflect on the important undercurrents, and to try to feel for more general principles.

The post will be in two halves. This half looks at what recent advances in natural language processing and deep learning could teach us about intelligence and the functioning of the human brain. The next half will look at what the brain could teach natural language processing.


I’ll say here that the heavy lifting has been performed by better and brighter folk. I do not claim credit for any of the outlines or summaries provided here; my effort is to try to write things down in a way that make sense to my addled brain, in the hope that things may also make sense to others. I also do not come from a research background, and so may take a few liberties for a general audience.

In natural language processing, these are the areas that have stayed with me:

  • Grammars,
  • Language Models,
  • Distributed Representations,
  • Neural Networks,
  • Attention,
  • Ontologies, and
  • Language is Hard.

In the next section, we’ll run through these (at a negligent speed), looking in particular at what they teach their respective sister fields. If you want to dig deeper, I recommend as a first step the Wikipedia entry on the respective concept, or any of the links set out in this piece.

Let’s get started. Hold on.


Mention the term grammar to most people and they’ll wince, remembering the pain inflicted in English or foreign language lessons. A grammar relates to the rules of language. While we don’t always know what the rules are, we can normally tell when they are being broken.

I would say that a majority of people view grammar like god (indeed Baptists can very nearly equate the two). There is one true Grammar, it is eternal and unchanging, and woe betide you if you break the rules. Peek behind the curtain though and you realise that linguists have proposed over a dozen different models for language, and all of them fail in some way. 

So what does this mean? Are we stuck in a post-modern relativist malaise? No! Luckily, there are some general points we can make.

Most grammars indicate that language is not a string of pearls (as the surface form of words seems to suggest) but has some underlying or latent structure. Many grammars indicate recursion and fractal patterns of self-similarity, nested over hierarchical structures. You can see this here:

  • The ball.
  • The ball was thrown.
  • The red ball was thrown over the wall.
  • In the depths of the game, the red ball was thrown over the wall, becoming a metaphor for the collapse of social morality following the fall of Communism.

Also the absence of “one grammar to rule them all”, teaches us that our rulesets are messy, incomplete and inconsistent. There is chaos with constraint. This hints that maybe language cannot be definitively defined using language. This hints further at Gödel and Church. This doesn’t necessarily rule out building machines that parse and generate language, but it does indicate that these machines may not be able to follow conventional rule-based deterministic processing.

With the resurgence of neural approaches, grammars have gone out of fashion, representing the “conventional rule-based deterministic processing” that “does not work”. But we should not ignore their lessons. Many modern architectures do not seem to accurately capture the recursion and self-similarity, and it appears difficult to train different layers to capture the natural hierarchy. For example, a preferred neural network approach, which we will discuss in more detail below is the recurrent neural network. But this is performing gated repeated multiplication. This means that each sentence above is treated quite differently. This seems to miss the point above. While attention has helped, this seems to be a band-aid as opposed to a solution.

Language Models

A language model is a probabilistic model that seeks to predict a next word given one or more previous words. The “probabilistic” aspect basically means that we are given a list of probabilities associated with a list of candidate words. A word with a probability of 1 would indicate that a word was definitely next (if you are a Bayesian that you are sure this is the next word). A word with a probability of 0 would indicate that the word was definitely not next. A probability is a probability if all the possible outcomes add up to one, so all our probability values across our words need to do this.

In the early 2000s, big strides were made using so-called ‘n-gram‘ approaches. Translated, ‘n-gram’ approaches count different sequences of words. The “n” refers to a number of words in the sequence. If n=3, we count different sequences of three words and use their frequency of occurrence to generate a probability. Here are some examples:

  • the cat sat (fairly high count)
  • he said that (fairly high count)
  • sat cat the (count near 0)
  • said garbage tree (count near 0)

If we have enough digital data, say by scanning all the books, then we can get count data indicating the probabilities of millions of sequences. This can help with things such as spell checking, encoding, optical character recognition and speech recognition.

We can also scale up and down our ‘n-gram’ models to do things like count sequences of characters or sequences of phonemes instead of words.

Language models were useful as they introduced statistical techniques that laid the groundwork for later neural network approaches. They offered a different perspective from rule-based grammars, and were well suited to real-world data that was “messy, incomplete and inconsistent”. They showed that just because a sentence fails the rules of a particular grammar, it does not mean it will not occur in practice. They were good for classification and search: it turns out that there were regular patterns behind language that could enable us to apply topic labels, group documents or search.

Modern language models tend to be built not from n-grams but using recurrent neural networks, such as one or bi- directional Long Short Term Memories (LSTMs). In theory, the approaches are not dissimilar, the LSTMs are in effect counting word sequences and storing weights that reflect regular patterns within text. There are just adding a sprinkling of non-linearity.

Like all models, language models have their downsides. A big one is that people consistently fail to understand what they can and cannot do. They show the general patterns of use, and show where the soft boundaries in language use lie. It turns out that we are more predictable than we think. However, they are not able, on their own, to help us with language generation. If you want to say something new, then this is by its nature going to be of low probability. They do not provide help for semantics, the layer of meaning below the surface form of language. This is why LSTMs can produce text that at first glance seems sensible, with punctuation, grammatically correct endings and what seem like correct spellings. But look closely and you will see that the text is meaningless gibberish.

Quite commonly the answer to the current failings of recurrent neural networks has been to add more layers. This does seem to help a little, as seen with models such as BERT. But just adding more layers doesn’t seem to provide a magic bullet to the problems of meaning or text generation. Outside of the artificial training sets these models still fail in an embarrassing manner.

It is instructive to compare the failures of grammars and language models, as they both fail in different ways. Grammars show that our thoughts and speech have non-linear patterns of structure, that there is something behind language. Language models show that our thoughts and speech do not follow well-defined rules, but do show statistical regularity, normally to an extent that surprises us “free” human agents.

Distributed Representations

Distributed representations are what Geoff Hinton has been banging on about for years and for me are one of the most important principles to emerge from recent advances in machine learning. I’m lying a little when I link them to natural language processing as they originally came to prominence in vision research. Indeed, much of the work on initial neural networks for image recognition was inspired by the earlier neuroscience of Nobel Prize Winners Hubel and Wiesel.

Distributed representations mean that our representations of “things” or “concepts” are shared among multiple components or sub-components, where each component or sub-component forms part of numerous different “things” or “concepts”.

Put another way, it’s a form of reductionist recycling. Imagine you had a box of Lego bricks. You can build different models from the bricks, where the model is something more than the underlying bricks (a car is more than the 2×8 plank, the wheels, those little wheel arch things etc.). So far, so reductionist. The Greeks knew this several millennia ago. However, now imagine that each Lego brick of a particular type (e.g. red 2×1 block, the 2×8 plank, each white “oner”) is the same brick. So all your models that have use red 2×1 blocks use the same red 2×1 block. This tend to turn your head inside out. Of course, in reality you can’t be in two places at the same time, but you can imagine your brain assembling different Lego models really quickly in sequence as we think about “things” (or even not “things”, like abstractions or actions or feelings).

This is most easily understood when thinking about images. This image from Wei et al at MIT might help:

Kindly reproduced from Wei et al here.

In this convolutional neural network, the weights of each layer are trained such that different segments of each layer end up representing different aspects of a complex object. These segments form the “Lego bricks” that are combined to represent the complex object. In effect, the segments reflect different regular patterns in the external environment, and different objects are represented via different combinations of low-level features. As we move up the layers our representations become more independent of the actual sensory input, e.g. they are activated even if lighting conditions change, or if the object moves in our visual field.

Knowing this, several things come to mind with regard to language:

  • It is likely that the representations that relate to words are going to follow a similar pattern to visual objects. Indeed, many speech recognition pipelines use convolutional neural networks to decipher audio signals and convert this to text. This form of representation also fits with the findings from studying grammars: we reuse semantic and syntactic structures and the things we describe can be somewhat invariant of the way we describe them.
  • Our components are going to be hard to imagine. Language seems to come to us fully-formed as discrete units. Even Plato got confused and thought there was some magical free-floating “tree” that existed in an independent reality. We are going to have to become comfortable describing the nuts and bolts of sentences, paragraphs and documents using words to describe things that may not be words.
  • For images, convolutional neural networks are very good at building these distributed representations across the weights of the layer. This is because the convolution and aggregation is good at abstracting over two-dimensional space. But words, sentences, paragraphs and documents are going to need a different architecture; they do not exist in two-dimensional space. Even convolutional neural networks struggle when we move beyond two dimensions into real-world space and time.

Neural Networks

Neural networks are back in fashion! Neural networks have been around since the 1950s but it is only recently we have got them to work in a useful manner. This is due to a number of factors:

  • We now have hugely fast computers and vastly greater memory sizes.
  • We worked out how to practically perform automatic differentiation and build compute graphs.
  • We began to have access to huge datasets to use for training data.

The best way to think of neural network is that they implement differentiable function approximators. Given a set of (data-X, label-Y) pairs neural networks perform a form of non-linear line fitting that maps X>Y.

Within natural language processing, neural networks have out performed comparative approaches in many areas, including:

  • speech processing (text-to-speech and speech-to-text);
  • machine translation;
  • language modelling;
  • question-answering (e.g. simple multiple choice logic problems);
  • summarisation; and
  • image captioning.

In the field of image processing, as set out above, convolutional neural networks rule. In natural language processing, the weapon of choice is the recurrent neural network, especially the LSTM or Gated Recurrent Unit (GRU). Often recurrent neural networks are applied as part of a sequence-to-sequence model. In this model an “encoder” receives a sequence of tokens and generates a fixed-size numeric vector. This vector is then supplied to a “decoder”, which outputs another sequence of tokens. Both the encoder and the decoder are implemented using recurrent neural networks. This is one way machine translation may be performed.

Neural networks do not work like the brain. But they show that a crude model can approximate some aspects of cortical function. They show that it is possible to build models of the world by feeding back small errors between our expectations and reality. No magic is needed.

The limitations of neural networks also show us that we are missing fairly large chunks of the brain puzzle – intelligence is not just sections of cortex. Most of the progress in the field of machine learning has resulted from greater architectural complexity, rather than any changes to the way neural networks are trained, or defined. At the moment things resemble the wild west, with architectures growing based on hunches and hacking. This kind of shifts the problem: the architectures still need to be explicitly defined by human beings. We could do with some theoretical scaffolding for architectures, and a component-based system of parts.

Most state of the art neural network models include some form of attention mechanism. In an over-simplified way, attention involves weighting components of an input sequence for every element in an output sequence.

In a traditional sequence-to-sequence system, such as those used for machine translation, you have an encoder, which encodes tokens in an input sentence (say words in English), and a decoder, which takes an encoded vector from the encoder and generates an output sentence (say words in Chinese). Attention models sought to weight different encoder hidden states, e.g. after each word in the sentence, when producing each decoder state (e.g. each output word).

A nice way to think of attention is in the form of values, keys and queries (as explained here).

In a value/key/query attention model:

  • the query is the decoder state at the time just before replicating the next word (e.g. a a given embedding vector for a word);
  • the keys are items that together with the query are used to determine the attention weights (e.g. these can be the encoder hidden states); and
  • the values are the items that may be weighted using attention (e.g. these can also be the encoder hidden states).

In the paper “Attention is All You Need”, the authors performed a nifty trick by leading with the attention mechanism and ditching some of the sequence-to-sequence model. They used an attention function that used a scaled dot-product to compute the weighted output. If you want to play around with attention in Keras this repository from Philippe Rémy is a good place to start.

Adam Kosiorek provides a good introduction to attention in this blogpost. In his equation (1), the feature vector z is equivalent to the values and the keys are the input vector x. A value/key/query attention model expands the parameterised attention network f(…) to be a function of both two input vectors: the hidden state of the decoder (the query Q) and the hidden states of the encoder (the keys K) – a = f(Q, K). The query here changes how the attention weights are computed based on the output we wish to produce.

Now I have to say that many of the blogposts I have read that try to explain attention fail. What makes attention difficult to explain?

Attention is weighting of an input. This is easy to understand in a simple case: a single, low-dimensionality feature vector x, where the attention weights are calculated using x and are in turn applied to x. Here we have:

a = f(x)


g = ax.

g is the result of applying attention, in this simple example a weighted set of inputs. The element-wise multiplication ⊙ is simple to understand – the kth element of g is computed as g_k = a_k*x_k (i.e. an element of a is used to weight a corresponding element of x). So attention, put like this, is just the usual weighting of an input, where the weights are generated as a function of the input.

Now attention becomes more difficult to understand as we move away from this simple case.

  1. Our keys and values may be different (i.e. x and z may be different sizes with different values). In this case, I find it best to consider the values (i.e. z) as the input we are weighting with the attention weights (i.e. a). However, in this case, we have another input – the keys (or x) – that are used to generate our attention weights, i.e. a=f(x). In sequence-to-sequence examples the keys and values are often the same, but they are sometimes different. This means that some explanations conflate the two, whereas others separate them out, leading to bifurcated explanations.
  2. In a sequence-to-sequence model our keys and/or values are often the hidden states of the encoder. Each hidden state of the encoder may be considered an element of the keys and/or values (e.g. an element of x and/or z). However, encoders are often recurrent neural networks with hidden dimensions of between 200 and 300 values (200-300D), as opposed to single dimension elements (e.g. 1D elements – [x1, x2, x3…]). Our keys and values are thus matrices rather than arrays. Each element of the keys and/or values thus becomes a vector in itself (e.g. x1 = [x11, x12, x13…]). This opens up another degree of freedom when applying the attention weights. Now many systems treat each hidden state as a single unit and multiply all the elements of the hidden state by the attention weight for the hidden state, i.e. gk = ak*[xk1, xk2, xk3…] = [akxk1, akxk2, akxk3…..]. However, it is also possible to apply attention in a multi-dimensional manner, e.g. where each attention weight is a vector that is multiplied by the hidden state vector. In this case, you can apply attention weights across the different dimensions of the encoder hidden state as well as apply attention to the encoder hidden state per se. This is what I believe creates much of the confusion.
  3. Sequence-to-sequence models often generate the attention weights as a function of both the encoder hidden states and the decoder hidden states. Hence, as in the “Attention is All You Need” paper the attention weights are computed as a function of a set of keys and a query, where the query represents a hidden state of a decoder at a time that a next token is being generated. Hence, a = f(Q, K). Like the encoder hidden state, the decoder hidden state is often of high dimensionality (e.g. 200-300D). The function may thus be a function of a vector Q and a matrix K. “Attention is All You Need” further confuses matters by operating on a matrix Q, representing hidden states for multiple tokens in the decoder. For example, you could generate your attention weights as a function of all previous decoder hidden states.
  4. Many papers and code resources optimise mathematical operations for efficiency. For example, most operations will be represented as matrix multiplications to exploit graphical processing unit (GPU) speed-ups. However, this often acts to lose some logical coherence as it is difficult to unpick the separate computations that are involved.
  5. Attention in a sequence-to-sequence model operates over time via the backdoor. Visual attention models are easier to conceptually understand, as you can think of them as a focus over a particular area of a 2D image. However, sentences represent words that are uttered or read at different times, i.e. they are a sequence where each element represents a successive time. Now, I haven’t seen many visual attention models that operate over time as well as 2D (this is in effect a form of object segmentation in time). The query in the attention model above thus has a temporal aspect, it changes the attention weights based on a particular output element, and the hidden state of the decoder will change as outputs are generated. However, “t” doesn’t explicitly occur anywhere.

Unpacking the confusion also helps us see why attention is powerful and leads to better results:

  • In a sequence-to-sequence model, attention changes with each output token. We are thus using different data to condition our output at each time step.
  • Attention teaches us that thinking of cognition as a one-way system leads to bottlenecks and worse results. Attention was developed to overcome the constraints imposed by trying to compress the meaning of a sentence into a single fixed-length representation.
  • In certain sequence-to-sequence models, attention also represents a form of feedback mechanism – if we generate our attention weights based on past output states we are modifying our input based on our output. We are getting closer to a dynamic system – the system is being applied iteratively with time as an implicit variable.
  • Visualisations of attention from papers such as “Show, Attend and Tell” –  and machine translation models seem to match our intuitive notions of cognition. When a system is generating the word “dog” the attention weights emphasise the image areas that feature a dog. When a system is translating a compound noun phrase it tends to attend jointly to all the words of the phrase. Attention can thus be seen as a form of filtering, it helps narrow the input to conditionally weigh an output.
  • Attention is fascinating because it suggests that we can learn a mechanism for attention separately from our mapping function. Our attention function f(…) is a parameterised function where we learn the parameters during training. However, these parameters are often separate from the parameters that implement the encoder and decoder.

Attention appears to have functional overlaps with areas of the thalamus and basal ganglia which form a feedback loop between incoming sensory inputs and the cortex. Knowing about how attention works in deep learning architectures may provide insight into mechanisms that could be implemented in the brain.


In the philosophical sense, an ontology is the study of “being”, i.e. what “things” or “entities” there are in the world, how they exist, what they are and how they relate to each other.

In a computer science sense, the term “ontology” has also been used to describe a method of organising data to describe things. I like to think of it representing something like a database schema on steroids. Over the last few decades, one popular form of an “ontology” has been the knowledge graph, a graph of things and relationships represented by triples, two “things” connected by a “relationship”, where the “things” and the “relationship” form part of the ontology.

Ontologies are another area that has faded out of fashion with the resurgence of deep neural networks. In the early 2000s there was a lot of hype surrounding the “semantic web” and other attempts to make data on the Internet more machine interpretable. Projects like DBpedia and standard drives around RDF and OWL offered to lead us to a brave new world of intelligent devices. As with many things they didn’t quite get there.

What happened? The common problem of overreach was one. Turns out organising human knowledge is hard. Another problem was one shared with grammars, human beings were trying to develop rule-sets, conventions and standards for something that was huge and statistical in nature. Another was that we ended up with a load of JAVA and an adapted form of SQL (SPARQL), while the Internet and research, being stubborn, decided to use hacky REST APIs and Python.

However, like grammars, ontologies got some things right, and we could do with saving some of the baby from the bathwater:

  • Thinking about things in terms of graphs and networks seems intuitively right. The fact that ontologies are a useful way to represent data says something in itself about how we think about the world.
  • It turns out that representing graph data as sets of triples works fairly well. This may be useful for further natural language processing engineering. This appears to reflect some of the fractal nature of grammars, and the self-similarity seen in language.
  • Ontologies failed in a similar way to grammars. Neural networks have taught us that hand-crafting features is “not the way to go”. We want to somehow combine the computing and representational aspects of ontologies, with learnt representations from the data. We need our ontologies to be messier. No one has quite got there yet, there have been graph convolutional networks but the maths is harder and so they form a niche area that is relatively unknown.
  • The “thing”, “relationship/property” way of thinking seems to (and was likely chosen to) reflect common noun/verb language patterns, and seems to reflect an underlying way of organising information in our brains, e.g. similar to the “what” and “where” pathways in vision or the “what” and “how” pathways in speech and motor control.

Language is Hard

To end, it is interesting to note that the recent advances in deep learning started with vision, in particularly image processing. Many attempted to port across techniques that had been successful in vision to work on text. Most of the time this failed. Linguists laughed.

For example, compare the output of recent Generative Adversarial Networks (GAN) with that of generative text systems.  There are now many high-resolution GAN architectures but generative text systems struggle with one or two coherent sentences and collapse completely over a paragraph. This strongly suggests that language is an emergent system that operates on top of vision and other sensory modalities (such as speech recognition and generation). One reason why deep learning architectures struggle with language is that they are seeking to indirectly replicate a very complex stack using only the surface form of the stack output. 

Take object persistence as another example. Natural language processing systems currently struggle with entity co-reference that a 5-year old can easily grasp, e.g. knowing that a cat at the start of the story is the same cat at the end of a story. Object persistence in the brain is likely based on at least low-level vision and motor representations. Can we model these independently of the low-level representations?

The current trend in natural language processing is towards bigger and more complex architectures that excel on beating benchmarks but generally fail miserably on real-world data. Are we now over-fitting in architecture space? Maybe one solution is to take a more modular approach where we can slot in different sub-systems that all feed into the priors for word selection. 

In part two, we will look at things from the other side of the fence. We review some of the key findings in neuro- and cognitive science, and have a look at what these could teach machine learning research.

Free Will: Do We Have It?

The more we design intelligent systems the more we creep up against the concepts of free will and determinism. These concepts underlie the stories we tell ourselves and underpin our legal systems. But what does free will mean? How does it influence our actions? And can we get rid of it?


The approach of this piece is as follows:

  • First, we will take a look at our starting assumptions when we use the term “free will”.
  • Then we will look in more detail at what the “will” could be.
  • We will then turn to the term “free” and have a look at the “freedom” this could entail.
  • Lastly, we will look at choice and probability, before trying to tie it all up at some kind of conclusion.

Free Will

Ask people whether they believe in free will and most will say yes.  Ask them what “free will” is and they will say something like:

the ability to choose, think, and act voluntarily.

To believe in free will is to believe that human beings can be the authors of their own actions and to reject the idea that human actions are determined by external conditions or fate.

Free will is often contrasted with determinism:

the doctrine that all events, including human action, are ultimately determined by causes regarded as external to the will.

Indeed, this definition rather circularly refers back to the “will”, which is the thing we are saying is “free”. As determinism requires a will, maybe we can make progress by breaking down the term into its components: “free” and “will”.

What is this “will“?

Will, generally, is that faculty of the mind which selects, at the moment of decision, the strongest desire from among the various desires present.

So “will” is a “faculty” and it is “free”. It is the “thing” that selects. It is part of our “mind”, but may or may not be part of our “brains”. It is possibly the “I” that makes a decision, or at least a part of that “I”. Looking at the “will” separately from its freedom is useful. What the “will” is, and how it relates to us as human beings appears to be debatable separately from its property of being “free”.

What does it mean to be “free”?

Freedom can be defined positively or negatively: i.e. from Aristotle it can be the power to do something or the power not to do something.

Fascinatingly, the English term “free” allegedly has its root, via old German, in an Indo-European word meaning “to love”. Its modern meaning seems to have followed the logic that what is loved is not in bondage, or that what is loved belongs to those that love.

Much of the debate surrounding “free will” concerns the amount of freedom that is implied by “free”. If we are “free” can we choose to do anything? Or in its limit is to be “free” to have a choice, e.g. be able to choose something or something else?

We will come to these points later.

The Will

Who Is Driving?

Let’s start by looking in more detail at the “will”. There are some presumptions we can make. You may not agree to all of them. But setting them out helps our thinking.

antique auto automobile automotive
Photo by Pixabay on

Human being are just matter. I am a materialist: matter is all we need to explain how human beings operate in the world. This denies the existence of any spiritual component that is independent of the matter of our bodies. We do not require a ghost to inhabit the machine, whether that be a religious soul or quantum shenanigans.

Human beings are embodied. Each of us are delineated within the world by pointing to a separable collection of matter: our bodies.

It is worth noting that human beings are not a constant set of matter: our cells are generated, act and then die. Different cells have different lifespans, and many cells only survive for days or weeks. Our bodies are themselves an abstraction, a la Heraclitus. However, the longest living cells, those of our brain and nervous system, also seem to be the cells that help define us.

The primary organ of control is the brain. To live and to be able to act we need certain parts of our bodies more than others. We can lose limbs and certain organs but not our head or heart. Although our sensory and control apparatus are distributed throughout our bodies, the root of intelligent control is the brain inside the head (you can have a heart transplant but not a brain transplant).

Even to position control within the brain is semi-controversial. Try this thought experiment: did Mike the headless chicken have any choice over its actions? It apparently “attempted to preen, peck for food and crow” (“though with limited success”). Observing Mike we ask: did he choose to preen at that time rather than peck?

Pulling these together, we can say that if the “will” is that which makes decisions and controls our actions, it is primarily located with the matter of our brains within our bodies.

The Will and Our Frontal Lobes

We have seen how the “will” may be primarily located within the brain. Neuroscience allows us to go one further: many of the functions we ascribe to the “will” may be located with the frontal lobes of the brain.

Sebastian023 [CC BY-SA 3.0 ], via Wikimedia Commons
Quick caveat: when we look at the frontal lobes we have to beware the reductionist quest for the homunculus. When we as human being think and act we use the whole brain (not 10%!), with different aspects of our thoughts being represented in neuronal structures in different areas that are all tied together in time by more neuronal structures. However, brain damage and brain imaging does show correlation between particular areas and certain functions. How we think is also heavily influenced by our bodily state.

Damage to the frontal lobes is known to result in issues with acting in the world. Planning and control may suffer. We may struggle with the appropriate response to a particular situation. We may struggle to control our emotions, or provide an appropriate emotion response. We may lose the feeling of “trueness” that is associated with people and places that we know. All these functions appear to relate to some form of “will”.

There is also a (possibly loose) correlation between the relatively large size of our frontal lobes and our capacities as human beings. The frontal lobes are best seen as a facilitator, they are connected to many other brain areas and help synchronise firing activity in the brain. It is likely the frontal lobes store some of our more abstract neuronal representations. They act in a feedback loop that covers the whole brain (e.g. the fronto-cingulo-parietal network is associated with executive functions). The frontal lobes help to coordinate sequences of mental representations that form the basis for conscious movement and language. In particular, they may help to steer these sequences, while the basal ganglia assemble in the context of changing sensory signals.

Or as set out here:

the prefrontal cortex, in all of its presumed functions, is neither sensory nor motor, but supports those processes that convey information in the central nervous system in a direction opposite to the classical one: not from input to output but conversely, by corollary discharges that modulate sensory systems in anticipation of future change.

We can see how the prefrontal cortex could be associated with the “will” – it provides a level of control that operates “internal” > “external”, as opposed to the more traditionally deterministic “external” > “internal”. Similarly, this paper argues that a key function of the frontal lobes is:

shaping of behaviour by activation of action requirements or goals specified at multiple levels of abstraction.

To be successful in controlling our actions, we need to represent complex tasks within our brains and carry out multiple goals and subgoals in appropriate sequence. This involves breaking down complex problems into manageable chunks. The frontal lobes play an active role in this control.

The frontal lobes have some interesting properties. You need your frontal lobes to perform advanced “human” acts, but we cannot seem to pin down functions to any one particular area (“there is no unitary executive function”). We seem to need many areas working cooperatively to perform complex tasks. There is also a large level of variation between individuals, but many traits are heritable.

So while we cannot say that “will” is another name for the brain matter of the frontal lobes, the frontal lobes or prefrontal cortex appears necessary but not sufficient for the action of the “will”.


To recap:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain; and
  • the “will” is likely associated with the functioning of the frontal lobes.

dawn sunset beach woman
Photo by Pixabay on

The Ambiguity of “Free”

I don’t think it is controversial to place many of the problems with “free will” at the feet of the term “free”. While consideration of the “will” raises spectres such as the non-unity of a static “self”, it is hard to disagree that our bodies encase some form of delineated object that can act in the world. “Free” on the other hand is a philosophical bag of worms.

You can discern that we might run into problems by just glancing at a dictionary definition. “Free” can be used as an adjective, an adverb or a verb. As an adjective it can have multiple meanings, including:

  • able to act or be done as one wishes; not under the control of another.
  • not or no longer confined or imprisoned.
  • not subject to engagements or obligations.
  • not subject to or affected by (something undesirable).
  • given or available without charge.
  • using or expending something without restraint; lavish.
  • not observing the normal conventions of style or form.

We find that freedom is often defined more by an absence of something as opposed to a positive property in itself. All of these definitions include “not” or “without”. Without constraint or restraint are we “free”?

Freedom also appears to be defined with reference to the outside: other people or external conditions. Is there such a thing as internal mental freedom?

These definitions also hint at a darker side to freedom. We can think of situations where acting without restraint or undesirable restrictions can be bad for us.

The freedom of the “will”

In “free will”, “free” is used as a modifier for the “will”. Being a modifier, it at first appears to provide limitations to the will. But actually, we are using the modifier to stress that the will is not constrained or limited. “Free” is thus used as a negation, it stresses that the will is not “not free”. By its use, we learn that multiple different “wills” exist, that the concept of an “unfree will” is possible rather than freedom being an implied property of will.

This is interesting because “free” makes more sense as meaning “not totally constrained and controlled” than it makes as meaning “completely unconstrained”.

One way in which the “will” seems “free” is that two different people experiencing the same set of external conditions may have quite different internal worlds. For example, they could have a differing internal voice, this in turn being implemented by different brain configurations. This is the ethos behind much of Stoic philosophy and more recent reflection.

This position requires that our internal realm, our internal voice of reason, has casual effect on our external circumstances. If it does not, then our external circumstances will define our actions, and the two different people will act in the same way. This thinking has been used over the years to nicely kill some forms of dualism. If the “will” is routed in matter, and our internal realm results from the operation of our brains, then there appears to be no reason why our actions cannot be driven by processes that have few external cues. This appears consistent with common sense, even if what we think does not always change how we act, we feel that what we think has at least some casual effect on how we act.

However, now we seem to have converted the issue of “freedom” into a question of time and space.

Free from External Constraint

Saying that the “will” is free of external conditions seems to be one way forward in analysing free will. Even if our external conditions are completely constrained, our internal mental world can act without constraint. However, this is not entirely true. The conditions inside the human body are not independent of those outside the human body: if we are starving, ill or on fire this will surely have a casual effect on our internal processes. There is also variation between individuals – spiritual mastery is often associated with a looser coupling between external and internal conditions (e.g. the zen monk that lives a rich inner life despite stark outer conditions), while political thought on the left often assumes a strong coupling between these conditions.

The problems tend to arrive once we denounce dualism and see the will as matter. If our brains are matter that is configured by genetics and environment, then the “external” produces the “internal”. We can say that the two different people experiencing the same set of external conditions at one point in time have different internal worlds because they have experienced different external conditions at preceding points in time.

Twin studies provide a useful tool to investigate these questions. They often show that a certain proportion of our inner world is genetic (say 50% for certain factors). A common environment growing up constraints another proportion of variation (say 25%). The remaining variation (say 25%) is based on different experiences within the environment over time. Hence, while identical twins raised in the same family may not share the same thoughts, they are likely to have more correlated inner worlds than non-siblings raised in different environments. “Freedom” appears a matter of degree.

Degrees of Freedom

Often the “free” in free is interpreted as a binary proposition: we are either free or we are not, we are either in control of our actions or we are not. However, this does not seem consistent with twin studies, or an examination of how external conditions construct our internal realms.

It is maybe better to say that our choices are constrained, but not entirely predictable. Let’s look at some examples:

  • We can and cannot choose to starve ourselves. People have died from undertaking hunger strikes, but deaths from voluntary starvation are likely to be in the thousands as compared to billions of people who “choose” to eat everyday. We appear to believe that we have the freedom of will to starve ourselves, but we also give in to the demands of hunger in 99% of cases.
  • Can you choose actions that determine your physical state? We cannot choose to be tall when we are short. We cannot choose the colour of our skin. We cannot change the basic blueprint of our body, but we may be able to exercise and workout to change muscle mass and fitness. But the effects of exercise may vary depending on our genetic configurations. We may work at jobs that require physical exertion and so do not require “leisure” training. Our upbringing may or may not value sports. We may have good levels of co-ordination or bad. We may or may not have commitments that take up our time.
  • Can you choose not to eat junk food? Our bodies our designed to crave high calorific foods that were rare in our ancient lives: high fat and high sugar foods are ideal if you are starving. The problem is that they are not great over the long term when we are not starving. If your body is working against you to naturally pick junk food is there freedom? If you can afford healthier options, and frequent establishments and social circles that stigmatise junk food, have you really chosen to avoid it?
  • Religion appears something that we can freely choose (in modern liberal democracies). But many people gain their religion from the environment of their upbringing, and even those that “rebel” against a particular religious upbringing need something to rebel against.
  • Compare a novel and a biography: in the former, a character has no choice, their actions are determined by the author; in the latter, at the time of their action, there subjectively seems to be a choice, but in the book everything is determined. Indeed, as we abstract over people in history subjective choice appears to disappear.
  • If we have two identical twins, who are brought up in the same environment by their genetic parents, we will likely see that their choices, over time, show correlations. With relation to each other are their choices less free than two entirely separate people?

Choice when explored appears fractal. Casual factors extend and unravel like twine. As you unpick one casual chain, you find others. Looking back with hindsight, actions often appear more determined than our subjective experience of them at the time.

So Maybe Not So Free

Our discussions above, seem to suggest that we may not be as free as we feel. The fractal nature of causality, which expands as you look into it, seems to add constraint as we know more and go back further in time. Would human beings appear free to an omniscient being?

We can ask: why does this matter? As an answer I’d argue that the thought that we lack freedom has throughout history been seen as extremely dangerous. If we have no freedom, then we argue we are not in control of our actions. Does this enable us to do what we want? Does social order and society break down?

Let’s park these difficulties for a moment while we look in more detail as to how “free will” can manifest itself in the world. Maybe this can help us restore our freedom.

Free will and Probability

Let’s regroup:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain;
  • the “will” is likely associated with the functioning of the frontal lobes;
  • freedom seems to relate to a lack of constraint, where the constraint relates to conditions outside of our brains;
  • however, the conditions outside of our brains contribute to the construction of our brains and have a casual effect on what we think; and
  • when we look closely at the interactions between our external and internal worlds, we do not seem to be entirely without constraint.

Another way we can look at “freedom” is through the lens of probability. Probability may be thought of as a way of representing uncertainty when attempting to predict events.

Toss a coin but don’t look at it. It has landed one of two ways (ignore the edge). Probability involves thinking about what outcomes an event can take. In the coin case we have two: heads or tails. As there are two outcomes we can say that each outcome has a 50% chance of occurring.

So how does probability relate to freedom?

If there is more than one outcome for an event we have a form of freedom. The event is not constrained to result in a single outcome. When predicting the coin toss we have a “choice” in which outcome to bet on.

However, probability can still provide a form of constraint. In our coin case we are constrained to heads or tails. We don’t posit that the coin will explode, disappear into a void or turn into a cat.

Human beings struggle with probability. Our struggles with probability are similar to our struggles with freedom. Using probability we can model outcomes of events where we cannot predict the outcome but we can predict the pattern of outcomes that may occur. Does the freedom of will work in a similar way?

Both the freedom of free will and probability require there to be multiple possible outcomes. In an entirely deterministic world we have no free will and no probability. A more sophisticated understanding of probability is that it represents a model of what we do not know. It sets limit on our knowledge. This is a more Bayesian approach – we say that the probability of 0.5 for a head in a coin toss indicates that we can be 50% confident in a head or that we can be 50% confident that we will not get a head. The reality of our lives is that we will always have less than perfect knowledge. As we exist physically in time, and make our measurements physically in time, there will always be something we cannot measure. Indeed, the balance is very heavily weighted, we are only able to measure a tiny fraction of our current or past environments, and those measurements will always have some possible error.

Coming back to free will with this interpretation of probability, we see that our brains can never with 100% confidence predict the future states of ourselves or our environment, let alone those of others. Indeed, we cannot with 100% confidence know the equivalent current or past states of ourselves. Also, if we see our “will” as “free” in that it has a choice, this choice requires multiple outcomes in a similar way to probability theory.  So both free will and probability involve multiple outcomes that we cannot predict with 100% confidence.


What does it mean to make a choice?

A choice exists at a time before action relating to the choice occurs. In certain cases, having a choice requires multiple outcomes that are not certain, i.e. there appears to be a possibility of each occurring. Does making a choice then consist of differentiated actions towards those outcomes? Does this definition hold for all aspects of choice? Does it require an external observer? Do these “outcomes” even exist in reality?

It would appear that a choice may involve only mental actions: most people would say you can choose to think of a situation in a particular way (the famous “glass half full”). Here an action may be as simple as holding a thought.

Can you have a choice without time? If a choice requires action then action would appear to require time. Also a choice would appear to require a particular point in time before any outcome occurs. We can read about a choice in the past, and the possibility of a choice can extend for a period of time, but once action is taken, or an outcome occurs, the choice appears to no longer exist. Does this mean that we are only possibly “free” if we exist in time?

Then there are studies like the Libet experiments. The “choice” in these experiments was when to move a finger. The outcomes of this choice were a set of times spread over a continuous time range. The choice ended at least when brain activity was first detected, where a series of actions were put in motion that led to the finger moving.

There is also an overlap in the way we use “free” to both refer to the “will” and a “choice”: we can have a “free” choice as well as a “free” will. In a free choice we can select any of the multiple outcomes. While it appears to suggest a uniform probability distribution (each outcome has an equal probability), this would only be the appearance to a totally ignorant observer. The more you knew about the situation, say the preferences of the chooser or the patterns of choice taken by a larger population, the more you can weight the different probabilities.

This can be carried across to our use of “free” when applied to the will. The will is maybe only totally free when observed by a completely ignorant observer. This is the opposite of an omniscient god or God. But as the observer knows more, and their ignorance decreases, we become less free. To an omniscient god or God we are not free. Harking back to the old Germanic root, for a god or God to truly love humanity, he’d need to be an ignorant fool.


We are not gods. This gives us our freedom.

Beings routed in time and space will always have uncertainty. Uncertainty means we can never truly know how someone will act. But we are also not idiots. We can use knowledge to weight our beliefs about different outcomes. And we will be right some of the time. Just not all of the time.

Even a little uncertainty means that human beings have the benefit of doubt in their actions. We cannot know for certain whether any one person will commit a crime. But we can allocate resources to reduce time based on our knowledge. We also cannot know for certain why any crime was committed. But we can have hunches and theories. And if we need to select a reason, we can pick our best one. We should remember that our baseline is a blind guess. And the likelihood of a blind guess being right depends on the number of outcomes; the more complex our system the more we need some form of knowledge.

So: Do We Have Free Will?

Let’s regroup:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain;
  • the “will” is likely associated with the functioning of the frontal lobes;
  • freedom relates to the knowledge of an observer;
  • we can be an observer of ourselves; and
  • we can never know anything with certainty.

Like quantum physics, the answer to the question of whether the will is free seems to depend on the observer. To an omniscient being that exists outside of time we are not free. On the other hand as human beings, we can never know for certain how things are going to turn out. It thus seems like the question regarding determinism and free will is not a helpful one. We are constrained by external circumstances in all we do. But there is also a universe of possibilities within our own minds. We can never know for certain how these will interact, or what the outcome will be. That does not mean that we cannot say what may be more or less likely.

So to say someone has “free will”, is maybe to say that their brains act in a way where even if we try to constrain or control factors outside of their body, there is still a large amount of uncertainty in our predictions of their future actions. This uncertainty is such that we cannot judge or take actions based on one predicted outcome. Our freedom lies in our necessary ignorance.