Predicting the Future

The rise of machine learning and developments in neuroscience hint that prediction is key to how brains navigate the world. But how could this work in practice?

Let’s ignore neuroscience a minute and just think how we would manually predict the future. Off the top of my head there appear to be three approaches, which we will look at below.

Instantaneous Prediction Using Rates of Change

Photo by Pixabay on

In the physical world, one way to predict the future is to remember our secondary school physics (or university kinematics). For example, if we wanted to predict a position of an object moving along one dimension, we would attempt to find out its position, speed and acceleration, and use equations of motion. In one-dimension, speed is just a measure of a change of position over time (a first-order differential with respect to time). Acceleration is a measure of change of speed over time (a second order differential with respect to time, or a change in a change).

In fact, it turns out the familiar equations of motion are actually just a specific instance of a more general mathematical pattern. James Gregory, Brook Taylor and Colin Maclaurin formalised this approach in the 17th and 18th centuries, but thinking on this issue goes back at least to Zeno, Democritus and Aristotle in ancient Greece, as well as Chinese, Indian and Middle Eastern mathematicians. In modern times, we generally refer to the pattern as a Taylor Series: a function may be represented as an infinite sum of differentials about a point.

For example, our normal equation for distance travelled in one dimension is d(t) = d_0 + v*t + \frac{1}{2}*a*t^2, where d is distance, v is velocity and a is acceleration. In this case, velocity is our first differential – a first-order rate of change of distance with time – and acceleration is our second differential – a second-order rate of change of distance with time (or a rate of change of velocity). We know from school that if we have constant velocity, a is zero and we just have d(t) = d_0 + v*t. What isn’t stressed as much is that the equations learnt by every school kid are for a single dimension with at worst constant acceleration. It is not until university that the veil begins to be removed and we realise that if we have changing acceleration (third-order changes), we need to add another term, and that we can continue this ad-nausem.

As you move into more advanced engineering and physics classes, you are also taught how to extend the equations of motion into two or three dimensions. When we have movement in three-dimensional space, we can model points in spacetime using four-dimensional vectors ([x, y, z, t]). As we move into the multidimensional case, we can look at how each dimension changes with respect to each other dimension for the point. For example, changes in an x-dimension with time (t) may resemble our one-dimensional speed. However, in the multivariate case we now can determine changes in the y and z direction with time. Hence, our velocity becomes a vector indicating how the x, y and z dimensions of the point change over time. Because the directions of modelled Cartesian space are orthogonal, we can analyse them separately.

We will stop here for now but we can also go further. If we can guess at the mass of an object, we can predict an acceleration using F=ma. Hence, our brains can begin to build a suite of approximated functions that predict the rates of change from additional data or latent variables discerned from the data.

Neural Units

So let’s return to thinking about brains. Populations of neurons in our brains do not have god-like powers to view an objective reference frame that depicts spacetime. Instead, they are fixed in space, and travel through time. What they can be is connected. But even this must be limited by practical reality, such as space and energy.

In fact, we can think about “neural units”, which may be a single neuron or a population of neurons. What makes a neural unit “a unit” is that it operates on a discrete set of information. In an image processing case this may be a pixel, in a audio processing example this may be a sample in time, or a frequency measurement.

Now the operation of a neural unit begins to resemble the assumptions for the Taylor Series: it is the point around which our function is evaluated, and all we need is local information relating to our derivatives. We’ll ignore for now the fact that our functions may not be infinitely differentiable about our unit, as it turns out approximations often seem to work fine.

So bringing this all together, we see that a neural unit may be able to (approximately) predict future activity, either of itself in time, or its neighbours in space, by determining local rates of change of different orders.

For example, if we consider the image intensity of a single pixel, we can see that a neural unit may be able to predict the intensity of that pixel at a future time, if there are patterns in the rates of change. For example, if the pixel intensity is increasing at a constant rate, r, then the intensity at a future time, t, may be determined like our velocity above: I(t) = I_0 + r*t.

Linear Approximations

Another way of viewing the same thing is to think about linear approximations.

What are linear approximations? They are just functions where the terms are linear, i.e. are not a power. In a Taylor Series this means chopping off everything past a first order differential. Now, if a car is accelerating towards you, assuming a constant velocity is going to be a very costly mistake. But what is surprising is that a fair bit of engineering is built upon linear approximations. In fact, even now some engineers pull a funny face and start sweating when you move away from linear models. It turns out that a significant portion of the world we live in has first order patterns.

If you go up a little way to include second order patterns, you find that another large chunk of the world can be approximated there. This is most visible in the equations of motion. Why can we stop with second order acceleration terms? Because gravity is constant at 9.8m/s. Until recently, the main things that displayed changing accelerations were animals.

So can we just throw away higher order terms? Not exactly. One issue with a power series is that as we try to predict further from our point or neural unit, higher order differentials become more important. For example, looking forward beyond the first few immediate units and the higher terms will dominate the prediction. Often this is compounded by sensory inaccuracy, the rates of change will never be exact, and errors in measurement are multiplied by the large high-order terms.

So what have we learnt? If we are predicting locally, either in space or in time, in a world with patterns in space-time, we can make good approximations using a Taylor series. However, these predictions become less useful in a rapidly changing world where we need to predict over longer distances in space and time.

Prediction Using Cycles

Image by Dirk Rabe from Pixabay 

Many of the patterns in the natural world are cyclical. These include the patterns of day and night caused by the rotation of the Earth upon its (roughly) north-south axis, the lunar months and tides caused by the rotation of the moon around the Earth, and the seasons caused by the rotation of the Earth around the sun. These are “deep” patterns – they have existed for the whole of our evolutionary history, and so our modelled at least chemically at a low-level in our DNA.

There are then patterns that our based on these patterns. Patterns of sleep and rest, of meal times, of food harvest, of migration, or our need for shelter. Interestingly many of these patterns are interoceptive patterns, e.g. relating to an inner state of our bodies representing how we feel.

The physical world also has patterns. Oscillations in time generate sound. Biological repetition and feedback cycles generates cyclical patterns, such as the vertical lines of light-and-dark observed looking into a forest or the stripes on a zebra.

How do we make predictions using cycles?

Often we have reference patterns that we apply at different rates. In engineering, this forms the basis of Fourier analysis. The rate of repetition we refer to as “frequency”. We can then build up complex functions and signals by the addition of simple periodic or repeating functions with different magnitudes and phases. In mathematics the simple periodic functions are typically the sine or cosine functions.

Image by David Zydd from Pixabay

So if we can use a base set of reference patterns, how does working in the frequency domain help prediction?

In one dimension the answer is that we can make predictions of values before or after a particular point based on our knowledge of the reference functions, and estimates for magnitudes and phases. For a signal that extends in space or time, we only need to know a general reference pattern for a short patch of space or time, stretch or shift it and repeat it, rather than trying to predict each point separately and individually.

When our brain attempts to predict sounds, it can thus attempt to predict frequencies and phases as opposed to complex sound waveforms. In space, things are less intuitive but apply similarly. For example, repeating patterns of intensity in space, such as stretches of light and dark lines (the stripes on a zebra) may be approximated using a reference pattern of one light and one dark line, and then repeating the pattern at an estimate scale, strength and phase. Many textures can be efficiently represented in this way (think of the patterns on plants and animals).

Thinking about neural units, we can see how hierarchies of units may be useful to implement predictions of periodic sequences. We need a unit or population of units to replicate the reference pattern, and to somehow represent an amplitude and phase.

Statistical Prediction

A third way to make predictions is using statistics and probability.

Photo by Balázs Utasi on

Statistics is all about large numbers of measurements (“big data”, when that was trendy). If we have large numbers of measurement we can look for patterns in those numbers.

Roll a six-sided dice a few times and you will record what look like random outcomes. We might have three “4s”, and two “1s”. Roll a dice a few million times and you will see that each of the six numbers occur in more-or-less similar proportions: each number occurs 1/6th of the time. The probability of rolling each number may then be represented as “1/6”.

Rates of change are fairly useless here. This is because we are dealing with discrete outcomes that are often independent. These “discrete outcomes” are also typically complex high-order events (try explaining “roll a dice” to an alien). If you were to measure the change in rolled number (e.g. “4” on roll 1 minus “1” on roll 0 = 3), this wouldn’t be very useful. Similarly, there are no repeating patterns in time or space that make Fourier analysis immediately useful for prediction.

Thinking about a neural unit, we can see that probability may be another way to predict the future. If a neural unit received an intensity for a pixel associated with the centre of a dice, it could learn that the intensity could be 0 or 1 with a roughly 50% likelihood (e.g. numbers 1, 3 and 5 having a central dot, which is absent from numbers 2, 4 and 6). If it got an intensity of 0.5, something strange has happened.

Probability, at its heart, is simply a normalised weight for a likelihood of an outcome. We use a value between 0 and 1 (or 0 and 100%) so that we can compare different events, such as rolling a dice or determining if a cow is going to charge us. In a discrete case, we have a set of defined outcomes. In a continuous case, we have a defined range of outcome values.

How Do Rates of Change and Probabilities Fit Together?

Imagine a set of neural units relate to a pixel in an image. For example, we might look at a nearest pixel to a centre of a webcam image.

In this case, each neural unit may have one associated variable: an intensity or amplitude. Say we have an 8-bit image processing system, so the neural unit can receive a value between 0 and 255 representing a measured image characteristic. This could be a channel measurement, e.g. an intensity for lightness (say 0 is black and 255 is white) or for “Red” (say 0 is not red and 255 is the most red) or an opponent colour space (say 0 is green and 255 is red).

Now nature is lazy. And thinking is hard work. Our neural units want to minimise any effort or activity.

One way to minimise effort is to make local predictions of sensory inputs, and to only pass on a signal when those predictions fail, i.e. to output a prediction error.

A neural unit could predict its own intensity at a future time I(t_0 + t_{interval}) or the intensity of one of its neighbours, e.g. I(x_i + x_{i+1}, y_j + y_{j+1}) in space. If a neural unit receives an intensity in I_{sensory}, it can compute an overall intensity prediction based on time and space prediction I_{prediction} and then determine an error between them e = I_{prediction} - I_{sensory}.

One way to approximate a rate of change is to simply compare neighbouring units in space, or current and past values in time. To compute higher orders, we just repeat this comparison on previously computed rates of change.

If they are arranged in multiple layers, our neural units could begin to predict cyclical patterns. Over time repeated patterns of activity could be represented by the activity of a single set of neural units and a reference to the underlying units that show this activity, e.g. as scaled or shifted. This would be lazier – we could just copy or communicate the activity of the single unit to the lower neural units.

Probability may come into play when looking at a default level of activity for a given context. For example, consider an “at rest” case. In many animals the top of the visual field is generally lighter than the bottom of the visual field. Why is this? Because the sky is above and the ground is below. Of course, this won’t always be the case, but it will be a general average over time. Hence, if you have no other information, a neural unit in the upper visual field would do wise to err on a base level of intensity that is higher (e.g. lighter) that a neural unit in the lower visual field. This also allows laziness in the brain. A non-light intensity signal received by the neural unit in the upper visual field is more informative than a light intensity signal as it is more unlikely. Hence, if there is a finite amount of energy, the neural unit in the upper visual field wants to use more energy to provide a signal in the case of a received non-light intensity signal than in the case of a received light intensity signal. Some of you would spot that we are now moving into the realms of (Shannon) entropy.

In the brain then, it is likely that all these approaches for prediction are applied simultaneously. Indeed, it is probable that the separate functions are condensed into common non-linear predictive functions. It is also likely that modern multi-layer neural networks are able to learn these functions from available data (or at least rough approximations based on the nature of the training data and the high-level error representation).


Folding Autoencoders

It’s nice when different ways of seeing things come together. This often occurs when comparing models of computation in the brain and machine learning architectures.

Many of you will be familiar with the standard autoencoder architecture. This takes an input \mathbf{X}, which may be an image (in a 2D or flattened 1D form) or another input array. It applies one or more neural network layers that progressively reduce the dimensionality of the output. These one or more neural network layers are sometimes called an “encoder”. This forms a “bottleneck” to “compress” the input (in a lossy manner) to form a latent representation. This latent representation is sometimes referred to as a hidden layer. It may be seen as an encoded or compressed form of the input.

A standard autoencoder may have a 1D latent representation with a dimensionality that is much less than the input dimensionality. A variational autoencoder may seek to learn values for a set of parameters that represent a latent distribution, such as a probability distribution for the input.

The autoencoder also has another set of one or more neural network layers that receive the latent representation as an input. These one or more neural network layers are sometimes called a “decoder”. The layers generate an output \mathbf{X'}, which is a reconstruction of the input \mathbf{X}.

The whole shebang is illustrated in the figure below.

Standard Autoencoder
Standard Autoencoder

The parameters of the neural network layers, the “encoder” and the “decoder”, are learnt during training. For a set of inputs, the output of the autoencoder \mathbf{X'} is compared with the input \mathbf{X}, and an error is back-propagated through the neural network layers. Using gradient descent to minimise the errors, an optimal set of parameters may be learnt.

As autoencoders do not require a labelled “ground-truth” output to compare with the autoencoder output during training, they provide a form of unsupervised learning. Most are static, i.e. they operate on a fixed instance of the input to generate a fixed instance of the output, and there are no temporal dynamics.

Now, I’m quite partial to unsupervised learning methods. They are hard, but they also are much better at reflecting “natural” intelligence; most animals (outside of school) do not learn based on pairs of inputs and desired outputs. Models of the visual processing pathway in primates, which have been developed over the last 50 years or so, all indicate that some form of unsupervised learning is used.

In the brain, there are several pathways through the cortex that appear similar to the “encoder” neural network layers. With a liberal interpretation, we can see the latent representation of the autoencoder as an association representation formed in the “higher” levels of the cortex. (In reality, latent representations in the brain are not found at a “single” flat level but are distributed over different levels of abstraction.)

If the brain implements some form of unsupervised learning, and seems to “encode” incoming sensory information, this leads to the question: where is the “decoder”?

This is where predictive coding models can help. In predictive coding models, predictions are fed back through the cortex to attempt to reconstruct input sensory information. This seems similar to the “decoder” layers of the autoencoder. However, in this case, the “encoder” and the “decoder” appear related. In fact, one way we can model this is to see the “encoder” and the “decoder” as parts of a common reversible architecture. This looks at things in a similar way to recent work on reversible neural networks, such as the “Glow” model proposed by OpenAI. The “encoder” represents a forward pass through a set of neural network layers, while the “decoder” represents a backward pass through the same set of layers. The “decoder” function thus represents the inverse or “reverse” of the “encoder” function.

This can be illustrated as follows:

Folding the Autoencoder
Folding the Autoencoder

In this variation of the autoencoder, we effectively fold the model in half, and stick together the “encoder” and “decoder” neural network layers to form a single “reversible neural network”.

In fact, the brain is even cooler than this. If we extend across modalities to consider sensory input and motor output, the brain appears to replicate the folded autoencoder shown above, resulting in something that resembles again our original standard autoencoder:

The brain as a form of autoencoder.
The brain as a form of autoencoder.

Here, we have a lower input “encoder” that represents the encoding of sensory input \mathbf{X} into a latent “associative” representation. A forward pass provides the encoding, while a backward pass seeks to predict the sensory input \mathbf{X} by generating a reconstruction \mathbf{X'}. An error between the sensory input \mathbf{X} and the reconstruction \mathbf{X'} is used to “train” the “encoder”.

We also have an upper output “decoder” that begins with the latent “associative” representation and generates a motor output \mathbf{Y}. A forward pass decodes the latent “associative” representation to generate muscle commands.

The “backward” pass of the upper layer is more uncertain. I need to research the muscle movement side – there is quite a lot based on the pathology of Parkinson’s disease and the action of the basal ganglia. (See the link for some great video lectures from Khan Academy.)

In one model, somasensory input may be received as input \mathbf{Y'}, which represents the muscle movements as actuated based on the muscle commands. The backward pass thus seeks to reconstruct the latent “associative” representation from the input \mathbf{Y'}. The “error” may be computed at one or more of the input level (e.g. looking at the difference between \mathbf{Y} and \mathbf{Y'}) and the latent “associative” representation level (e.g. between the “sensory” input encoding and the “motor” input encoding). In any case there seem to be enough error signals to drive optimisation and learning.

Of course, this model leaves out one vital component: time. Our sensory input and muscle output is constantly changing. This means that the model should be indexing all the variables by time (e.g. \mathbf{X(t)}, \mathbf{X'(t)}, \mathbf{Y(t)}, \mathbf{Y'(t)}, and the latent “associative” representations). Also the forward and backward passes will occur at different times (the forward pass being available earlier than the backward pass). This means that our errors are also errors in time, e.g. between \mathbf{X(t=1)} and \mathbf{X'(t=2)} for a common discrete time basis.

Many of the machine learning papers I read nowadays feature a bewildering array of architectures and names (ResNet, UNet, Glow, BERT, Transformer XL etc. etc.). However, most appear to be using similar underlying principles of architecture design. The way I deal with the confusion and noise is to try to look for these underlying principles and to link these to neurological and cognitive principles (if only to reduce the number of things I need to remember and make things simple enough for my addled brain!). Autoencoder origami is one way of making sense of things.

Natural Language Processing & The Brain: What Can They Teach Each Other?

Part I of II – What Recent Advances in Natural Language Processing Can Teach Us About the Brain

I recently worked out that the day of my birth will be closer to the end of the Second World War than the present day. This means I am living in the future, hooray!

Over the years I’ve been tracking (and experimenting with) various concepts in natural language processing, as well as reading general texts on the brain. To me both streams of research have been running in parallel; in the last 5 years, natural language processing has found a new lease of engineering life via deep learning architectures, and the empirical sciences have been slowly chipping away at cognitive function. Both areas appear to be groping different parts of the same elephant. This piece provides an outlet for the cross-talk in my own head. With advances in natural language processing coming thick and fast, it also provides an opportunity for me to reflect on the important undercurrents, and to try to feel for more general principles.

The post will be in two halves. This half looks at what recent advances in natural language processing and deep learning could teach us about intelligence and the functioning of the human brain. The next half will look at what the brain could teach natural language processing.


I’ll say here that the heavy lifting has been performed by better and brighter folk. I do not claim credit for any of the outlines or summaries provided here; my effort is to try to write things down in a way that make sense to my addled brain, in the hope that things may also make sense to others. I also do not come from a research background, and so may take a few liberties for a general audience.

In natural language processing, these are the areas that have stayed with me:

  • Grammars,
  • Language Models,
  • Distributed Representations,
  • Neural Networks,
  • Attention,
  • Ontologies, and
  • Language is Hard.

In the next section, we’ll run through these (at a negligent speed), looking in particular at what they teach their respective sister fields. If you want to dig deeper, I recommend as a first step the Wikipedia entry on the respective concept, or any of the links set out in this piece.

Let’s get started. Hold on.


Mention the term grammar to most people and they’ll wince, remembering the pain inflicted in English or foreign language lessons. A grammar relates to the rules of language. While we don’t always know what the rules are, we can normally tell when they are being broken.

I would say that a majority of people view grammar like god (indeed Baptists can very nearly equate the two). There is one true Grammar, it is eternal and unchanging, and woe betide you if you break the rules. Peek behind the curtain though and you realise that linguists have proposed over a dozen different models for language, and all of them fail in some way. 

So what does this mean? Are we stuck in a post-modern relativist malaise? No! Luckily, there are some general points we can make.

Most grammars indicate that language is not a string of pearls (as the surface form of words seems to suggest) but has some underlying or latent structure. Many grammars indicate recursion and fractal patterns of self-similarity, nested over hierarchical structures. You can see this here:

  • The ball.
  • The ball was thrown.
  • The red ball was thrown over the wall.
  • In the depths of the game, the red ball was thrown over the wall, becoming a metaphor for the collapse of social morality following the fall of Communism.

Also the absence of “one grammar to rule them all”, teaches us that our rulesets are messy, incomplete and inconsistent. There is chaos with constraint. This hints that maybe language cannot be definitively defined using language. This hints further at Gödel and Church. This doesn’t necessarily rule out building machines that parse and generate language, but it does indicate that these machines may not be able to follow conventional rule-based deterministic processing.

With the resurgence of neural approaches, grammars have gone out of fashion, representing the “conventional rule-based deterministic processing” that “does not work”. But we should not ignore their lessons. Many modern architectures do not seem to accurately capture the recursion and self-similarity, and it appears difficult to train different layers to capture the natural hierarchy. For example, a preferred neural network approach, which we will discuss in more detail below is the recurrent neural network. But this is performing gated repeated multiplication. This means that each sentence above is treated quite differently. This seems to miss the point above. While attention has helped, this seems to be a band-aid as opposed to a solution.

Language Models

A language model is a probabilistic model that seeks to predict a next word given one or more previous words. The “probabilistic” aspect basically means that we are given a list of probabilities associated with a list of candidate words. A word with a probability of 1 would indicate that a word was definitely next (if you are a Bayesian that you are sure this is the next word). A word with a probability of 0 would indicate that the word was definitely not next. A probability is a probability if all the possible outcomes add up to one, so all our probability values across our words need to do this.

In the early 2000s, big strides were made using so-called ‘n-gram‘ approaches. Translated, ‘n-gram’ approaches count different sequences of words. The “n” refers to a number of words in the sequence. If n=3, we count different sequences of three words and use their frequency of occurrence to generate a probability. Here are some examples:

  • the cat sat (fairly high count)
  • he said that (fairly high count)
  • sat cat the (count near 0)
  • said garbage tree (count near 0)

If we have enough digital data, say by scanning all the books, then we can get count data indicating the probabilities of millions of sequences. This can help with things such as spell checking, encoding, optical character recognition and speech recognition.

We can also scale up and down our ‘n-gram’ models to do things like count sequences of characters or sequences of phonemes instead of words.

Language models were useful as they introduced statistical techniques that laid the groundwork for later neural network approaches. They offered a different perspective from rule-based grammars, and were well suited to real-world data that was “messy, incomplete and inconsistent”. They showed that just because a sentence fails the rules of a particular grammar, it does not mean it will not occur in practice. They were good for classification and search: it turns out that there were regular patterns behind language that could enable us to apply topic labels, group documents or search.

Modern language models tend to be built not from n-grams but using recurrent neural networks, such as one or bi- directional Long Short Term Memories (LSTMs). In theory, the approaches are not dissimilar, the LSTMs are in effect counting word sequences and storing weights that reflect regular patterns within text. There are just adding a sprinkling of non-linearity.

Like all models, language models have their downsides. A big one is that people consistently fail to understand what they can and cannot do. They show the general patterns of use, and show where the soft boundaries in language use lie. It turns out that we are more predictable than we think. However, they are not able, on their own, to help us with language generation. If you want to say something new, then this is by its nature going to be of low probability. They do not provide help for semantics, the layer of meaning below the surface form of language. This is why LSTMs can produce text that at first glance seems sensible, with punctuation, grammatically correct endings and what seem like correct spellings. But look closely and you will see that the text is meaningless gibberish.

Quite commonly the answer to the current failings of recurrent neural networks has been to add more layers. This does seem to help a little, as seen with models such as BERT. But just adding more layers doesn’t seem to provide a magic bullet to the problems of meaning or text generation. Outside of the artificial training sets these models still fail in an embarrassing manner.

It is instructive to compare the failures of grammars and language models, as they both fail in different ways. Grammars show that our thoughts and speech have non-linear patterns of structure, that there is something behind language. Language models show that our thoughts and speech do not follow well-defined rules, but do show statistical regularity, normally to an extent that surprises us “free” human agents.

Distributed Representations

Distributed representations are what Geoff Hinton has been banging on about for years and for me are one of the most important principles to emerge from recent advances in machine learning. I’m lying a little when I link them to natural language processing as they originally came to prominence in vision research. Indeed, much of the work on initial neural networks for image recognition was inspired by the earlier neuroscience of Nobel Prize Winners Hubel and Wiesel.

Distributed representations mean that our representations of “things” or “concepts” are shared among multiple components or sub-components, where each component or sub-component forms part of numerous different “things” or “concepts”.

Put another way, it’s a form of reductionist recycling. Imagine you had a box of Lego bricks. You can build different models from the bricks, where the model is something more than the underlying bricks (a car is more than the 2×8 plank, the wheels, those little wheel arch things etc.). So far, so reductionist. The Greeks knew this several millennia ago. However, now imagine that each Lego brick of a particular type (e.g. red 2×1 block, the 2×8 plank, each white “oner”) is the same brick. So all your models that have use red 2×1 blocks use the same red 2×1 block. This tend to turn your head inside out. Of course, in reality you can’t be in two places at the same time, but you can imagine your brain assembling different Lego models really quickly in sequence as we think about “things” (or even not “things”, like abstractions or actions or feelings).

This is most easily understood when thinking about images. This image from Wei et al at MIT might help:

Kindly reproduced from Wei et al here.

In this convolutional neural network, the weights of each layer are trained such that different segments of each layer end up representing different aspects of a complex object. These segments form the “Lego bricks” that are combined to represent the complex object. In effect, the segments reflect different regular patterns in the external environment, and different objects are represented via different combinations of low-level features. As we move up the layers our representations become more independent of the actual sensory input, e.g. they are activated even if lighting conditions change, or if the object moves in our visual field.

Knowing this, several things come to mind with regard to language:

  • It is likely that the representations that relate to words are going to follow a similar pattern to visual objects. Indeed, many speech recognition pipelines use convolutional neural networks to decipher audio signals and convert this to text. This form of representation also fits with the findings from studying grammars: we reuse semantic and syntactic structures and the things we describe can be somewhat invariant of the way we describe them.
  • Our components are going to be hard to imagine. Language seems to come to us fully-formed as discrete units. Even Plato got confused and thought there was some magical free-floating “tree” that existed in an independent reality. We are going to have to become comfortable describing the nuts and bolts of sentences, paragraphs and documents using words to describe things that may not be words.
  • For images, convolutional neural networks are very good at building these distributed representations across the weights of the layer. This is because the convolution and aggregation is good at abstracting over two-dimensional space. But words, sentences, paragraphs and documents are going to need a different architecture; they do not exist in two-dimensional space. Even convolutional neural networks struggle when we move beyond two dimensions into real-world space and time.

Neural Networks

Neural networks are back in fashion! Neural networks have been around since the 1950s but it is only recently we have got them to work in a useful manner. This is due to a number of factors:

  • We now have hugely fast computers and vastly greater memory sizes.
  • We worked out how to practically perform automatic differentiation and build compute graphs.
  • We began to have access to huge datasets to use for training data.

The best way to think of neural network is that they implement differentiable function approximators. Given a set of (data-X, label-Y) pairs neural networks perform a form of non-linear line fitting that maps X>Y.

Within natural language processing, neural networks have out performed comparative approaches in many areas, including:

  • speech processing (text-to-speech and speech-to-text);
  • machine translation;
  • language modelling;
  • question-answering (e.g. simple multiple choice logic problems);
  • summarisation; and
  • image captioning.

In the field of image processing, as set out above, convolutional neural networks rule. In natural language processing, the weapon of choice is the recurrent neural network, especially the LSTM or Gated Recurrent Unit (GRU). Often recurrent neural networks are applied as part of a sequence-to-sequence model. In this model an “encoder” receives a sequence of tokens and generates a fixed-size numeric vector. This vector is then supplied to a “decoder”, which outputs another sequence of tokens. Both the encoder and the decoder are implemented using recurrent neural networks. This is one way machine translation may be performed.

Neural networks do not work like the brain. But they show that a crude model can approximate some aspects of cortical function. They show that it is possible to build models of the world by feeding back small errors between our expectations and reality. No magic is needed.

The limitations of neural networks also show us that we are missing fairly large chunks of the brain puzzle – intelligence is not just sections of cortex. Most of the progress in the field of machine learning has resulted from greater architectural complexity, rather than any changes to the way neural networks are trained, or defined. At the moment things resemble the wild west, with architectures growing based on hunches and hacking. This kind of shifts the problem: the architectures still need to be explicitly defined by human beings. We could do with some theoretical scaffolding for architectures, and a component-based system of parts.

Most state of the art neural network models include some form of attention mechanism. In an over-simplified way, attention involves weighting components of an input sequence for every element in an output sequence.

In a traditional sequence-to-sequence system, such as those used for machine translation, you have an encoder, which encodes tokens in an input sentence (say words in English), and a decoder, which takes an encoded vector from the encoder and generates an output sentence (say words in Chinese). Attention models sought to weight different encoder hidden states, e.g. after each word in the sentence, when producing each decoder state (e.g. each output word).

A nice way to think of attention is in the form of values, keys and queries (as explained here).

In a value/key/query attention model:

  • the query is the decoder state at the time just before replicating the next word (e.g. a a given embedding vector for a word);
  • the keys are items that together with the query are used to determine the attention weights (e.g. these can be the encoder hidden states); and
  • the values are the items that may be weighted using attention (e.g. these can also be the encoder hidden states).

In the paper “Attention is All You Need”, the authors performed a nifty trick by leading with the attention mechanism and ditching some of the sequence-to-sequence model. They used an attention function that used a scaled dot-product to compute the weighted output. If you want to play around with attention in Keras this repository from Philippe Rémy is a good place to start.

Adam Kosiorek provides a good introduction to attention in this blogpost. In his equation (1), the feature vector z is equivalent to the values and the keys are the input vector x. A value/key/query attention model expands the parameterised attention network f(…) to be a function of both two input vectors: the hidden state of the decoder (the query Q) and the hidden states of the encoder (the keys K) – a = f(Q, K). The query here changes how the attention weights are computed based on the output we wish to produce.

Now I have to say that many of the blogposts I have read that try to explain attention fail. What makes attention difficult to explain?

Attention is weighting of an input. This is easy to understand in a simple case: a single, low-dimensionality feature vector x, where the attention weights are calculated using x and are in turn applied to x. Here we have:

a = f(x)


g = ax.

g is the result of applying attention, in this simple example a weighted set of inputs. The element-wise multiplication ⊙ is simple to understand – the kth element of g is computed as g_k = a_k*x_k (i.e. an element of a is used to weight a corresponding element of x). So attention, put like this, is just the usual weighting of an input, where the weights are generated as a function of the input.

Now attention becomes more difficult to understand as we move away from this simple case.

  1. Our keys and values may be different (i.e. x and z may be different sizes with different values). In this case, I find it best to consider the values (i.e. z) as the input we are weighting with the attention weights (i.e. a). However, in this case, we have another input – the keys (or x) – that are used to generate our attention weights, i.e. a=f(x). In sequence-to-sequence examples the keys and values are often the same, but they are sometimes different. This means that some explanations conflate the two, whereas others separate them out, leading to bifurcated explanations.
  2. In a sequence-to-sequence model our keys and/or values are often the hidden states of the encoder. Each hidden state of the encoder may be considered an element of the keys and/or values (e.g. an element of x and/or z). However, encoders are often recurrent neural networks with hidden dimensions of between 200 and 300 values (200-300D), as opposed to single dimension elements (e.g. 1D elements – [x1, x2, x3…]). Our keys and values are thus matrices rather than arrays. Each element of the keys and/or values thus becomes a vector in itself (e.g. x1 = [x11, x12, x13…]). This opens up another degree of freedom when applying the attention weights. Now many systems treat each hidden state as a single unit and multiply all the elements of the hidden state by the attention weight for the hidden state, i.e. gk = ak*[xk1, xk2, xk3…] = [akxk1, akxk2, akxk3…..]. However, it is also possible to apply attention in a multi-dimensional manner, e.g. where each attention weight is a vector that is multiplied by the hidden state vector. In this case, you can apply attention weights across the different dimensions of the encoder hidden state as well as apply attention to the encoder hidden state per se. This is what I believe creates much of the confusion.
  3. Sequence-to-sequence models often generate the attention weights as a function of both the encoder hidden states and the decoder hidden states. Hence, as in the “Attention is All You Need” paper the attention weights are computed as a function of a set of keys and a query, where the query represents a hidden state of a decoder at a time that a next token is being generated. Hence, a = f(Q, K). Like the encoder hidden state, the decoder hidden state is often of high dimensionality (e.g. 200-300D). The function may thus be a function of a vector Q and a matrix K. “Attention is All You Need” further confuses matters by operating on a matrix Q, representing hidden states for multiple tokens in the decoder. For example, you could generate your attention weights as a function of all previous decoder hidden states.
  4. Many papers and code resources optimise mathematical operations for efficiency. For example, most operations will be represented as matrix multiplications to exploit graphical processing unit (GPU) speed-ups. However, this often acts to lose some logical coherence as it is difficult to unpick the separate computations that are involved.
  5. Attention in a sequence-to-sequence model operates over time via the backdoor. Visual attention models are easier to conceptually understand, as you can think of them as a focus over a particular area of a 2D image. However, sentences represent words that are uttered or read at different times, i.e. they are a sequence where each element represents a successive time. Now, I haven’t seen many visual attention models that operate over time as well as 2D (this is in effect a form of object segmentation in time). The query in the attention model above thus has a temporal aspect, it changes the attention weights based on a particular output element, and the hidden state of the decoder will change as outputs are generated. However, “t” doesn’t explicitly occur anywhere.

Unpacking the confusion also helps us see why attention is powerful and leads to better results:

  • In a sequence-to-sequence model, attention changes with each output token. We are thus using different data to condition our output at each time step.
  • Attention teaches us that thinking of cognition as a one-way system leads to bottlenecks and worse results. Attention was developed to overcome the constraints imposed by trying to compress the meaning of a sentence into a single fixed-length representation.
  • In certain sequence-to-sequence models, attention also represents a form of feedback mechanism – if we generate our attention weights based on past output states we are modifying our input based on our output. We are getting closer to a dynamic system – the system is being applied iteratively with time as an implicit variable.
  • Visualisations of attention from papers such as “Show, Attend and Tell” –  and machine translation models seem to match our intuitive notions of cognition. When a system is generating the word “dog” the attention weights emphasise the image areas that feature a dog. When a system is translating a compound noun phrase it tends to attend jointly to all the words of the phrase. Attention can thus be seen as a form of filtering, it helps narrow the input to conditionally weigh an output.
  • Attention is fascinating because it suggests that we can learn a mechanism for attention separately from our mapping function. Our attention function f(…) is a parameterised function where we learn the parameters during training. However, these parameters are often separate from the parameters that implement the encoder and decoder.

Attention appears to have functional overlaps with areas of the thalamus and basal ganglia which form a feedback loop between incoming sensory inputs and the cortex. Knowing about how attention works in deep learning architectures may provide insight into mechanisms that could be implemented in the brain.


In the philosophical sense, an ontology is the study of “being”, i.e. what “things” or “entities” there are in the world, how they exist, what they are and how they relate to each other.

In a computer science sense, the term “ontology” has also been used to describe a method of organising data to describe things. I like to think of it representing something like a database schema on steroids. Over the last few decades, one popular form of an “ontology” has been the knowledge graph, a graph of things and relationships represented by triples, two “things” connected by a “relationship”, where the “things” and the “relationship” form part of the ontology.

Ontologies are another area that has faded out of fashion with the resurgence of deep neural networks. In the early 2000s there was a lot of hype surrounding the “semantic web” and other attempts to make data on the Internet more machine interpretable. Projects like DBpedia and standard drives around RDF and OWL offered to lead us to a brave new world of intelligent devices. As with many things they didn’t quite get there.

What happened? The common problem of overreach was one. Turns out organising human knowledge is hard. Another problem was one shared with grammars, human beings were trying to develop rule-sets, conventions and standards for something that was huge and statistical in nature. Another was that we ended up with a load of JAVA and an adapted form of SQL (SPARQL), while the Internet and research, being stubborn, decided to use hacky REST APIs and Python.

However, like grammars, ontologies got some things right, and we could do with saving some of the baby from the bathwater:

  • Thinking about things in terms of graphs and networks seems intuitively right. The fact that ontologies are a useful way to represent data says something in itself about how we think about the world.
  • It turns out that representing graph data as sets of triples works fairly well. This may be useful for further natural language processing engineering. This appears to reflect some of the fractal nature of grammars, and the self-similarity seen in language.
  • Ontologies failed in a similar way to grammars. Neural networks have taught us that hand-crafting features is “not the way to go”. We want to somehow combine the computing and representational aspects of ontologies, with learnt representations from the data. We need our ontologies to be messier. No one has quite got there yet, there have been graph convolutional networks but the maths is harder and so they form a niche area that is relatively unknown.
  • The “thing”, “relationship/property” way of thinking seems to (and was likely chosen to) reflect common noun/verb language patterns, and seems to reflect an underlying way of organising information in our brains, e.g. similar to the “what” and “where” pathways in vision or the “what” and “how” pathways in speech and motor control.

Language is Hard

To end, it is interesting to note that the recent advances in deep learning started with vision, in particularly image processing. Many attempted to port across techniques that had been successful in vision to work on text. Most of the time this failed. Linguists laughed.

For example, compare the output of recent Generative Adversarial Networks (GAN) with that of generative text systems.  There are now many high-resolution GAN architectures but generative text systems struggle with one or two coherent sentences and collapse completely over a paragraph. This strongly suggests that language is an emergent system that operates on top of vision and other sensory modalities (such as speech recognition and generation). One reason why deep learning architectures struggle with language is that they are seeking to indirectly replicate a very complex stack using only the surface form of the stack output. 

Take object persistence as another example. Natural language processing systems currently struggle with entity co-reference that a 5-year old can easily grasp, e.g. knowing that a cat at the start of the story is the same cat at the end of a story. Object persistence in the brain is likely based on at least low-level vision and motor representations. Can we model these independently of the low-level representations?

The current trend in natural language processing is towards bigger and more complex architectures that excel on beating benchmarks but generally fail miserably on real-world data. Are we now over-fitting in architecture space? Maybe one solution is to take a more modular approach where we can slot in different sub-systems that all feed into the priors for word selection. 

In part two, we will look at things from the other side of the fence. We review some of the key findings in neuro- and cognitive science, and have a look at what these could teach machine learning research.

Sampling vs Prediction

Some things have recently been bugging me when applying deep learning models to natural language generation. This post contains my random thoughts on two of these:  sampling and prediction. By writing this post, I hope to try to tease these apart in my head to help improve my natural language models.



Sampling is the act of selecting a value for a variable having an underlying probability distribution.

For example, when we toss a coin, we have a 50% chance of “heads” and a 50% chance of “tails”. In this case, our variable may be “toss outcome” or “t”, our values are “head” and “tail”, and our probability distribution may be modeled as [0.5, 0.5], representing the chances of “heads” and “tails”.

To make a choice according to the probability distribution we sample.

One way of sampling is to generate a (pseudo-) random number between 0 and 1 and to compare this random number with the probability distribution. For example, we can generate a cumulative probability distribution, where we create bands of values that sum to one from the probability distribution. In the present case, we can convert our probability distribution [0.5, 0.5] into a cumulative distribution [0.5, 1], where we say 0 to 0.5 is in the band “heads” and 0.5 to 1 is in the band “tails”. We then compare our random number with these bands, and the band the number falls into is the band we select. E.g. if our random variable is 0.2, we select the first band – 0 to 0.5 – and our variable value is “heads”; if our random variable is 0.7, we select the second band – 0.5 to 1 – and our variable value is tails.

Let’s compare this example to a case of a weighted coin. Our weighted coin has a probability distribution of [0.8, 0.2]. This means that it is weighted so that 80% of tosses come out as “heads”, and 20% of tosses come out as “tails”. If we use the same sampling technique, we generate a cumulative probability distribution of [0.8, 1], with a band of 0 to 0.8 being associated with “heads” and a band of 0.8 to 1 being associated with tails. Given the same random variable values of 0.2 and 0.7, we now get sample values of “heads” and “heads”, as both values fall within the band 0 to 0.8

Now that all seems a bit long winded. Why have I repeated the obvious?

Because most model architectures that act to predict or generate text skip over how sampling is performed. Looking at the code associated with the models, I would say 90% of cases generate an array of probabilities for each time step, and then take the maximum of this array (e.g. using numpy.argmax(array)). This array is typically the output of a softmax layer, and so sums to 1. It may thus be taken as a probability distribution. So our sampling often takes the form of a greedy decoder that just selects the highest probability output at each time step.

Now, if our model consistently outputs probabilities of [0.55, 0.45] to predict a coin toss, then based on the normal greedy decoder, our output will always be “heads”. Repeated 20 times, our output array would be 20 values of “heads”. This seems wrong – our model is saying that over 20 repetitions, we should have 11 “heads” and 9 “tails”.

Let’s park this for a moment and look at prediction.


With machine learning frameworks such as Tensorflow and Keras you create “models” that take an input “tensor” (a multidimensional array). Each “model” applies a number of transformations, typically via one or more neural networks, to output another “tensor”. The generation of an output is typically referred to as a “prediction”.

A “prediction” can thus be deemed an output of a machine learning model, given some input.

To make an accurate prediction, a model needs to be trained to determine a set of parameter values for the model. Often there are millions of parameters. During training, predictions are made based on input data. These predictions are compared with “ground truth” values, i.e. the actual observed/recorded/measured output values. Slightly confusingly, each pair of “input-ground truth values” is often called a sample. However, these samples have nothing really to do with our “sampling” as discussed above.

For training, loss functions for recurrent neural networks are typically based on cross entropy loss. This compares the “true” probability distribution, in the form of a one-hot array for the “ground truth” output token (e.g. for “heads” – [1, 0]), with the “probability distribution” generated by the model (which may be [0.823, 0.177]), where cross entropy is used to compute the difference. This post by Rob DiPietro has a nice explanation of cross entropy. The difference filters back through the model to modify the parameter values via back-propagation.

Sampling vs Prediction

Now we have looked at what we mean by “sampling” and “prediction” we come to the nub of our problem: how do these two processes interact in machine learning models?

One problem is that in normal English usage, a “prediction” is seen as the choice of a discrete output. For example, if you were asked to “predict” a coin toss, you would say one of “heads” and “tails”, not provide me with the array “[0.5, 0.5]”.

Hence, “prediction” in normal English actually refers to the output of the “sampling” process.

This means that, in most cases, when we are “predicting” with a machine learning model (e.g. using `model.predict(input)` in Keras), we are not actually “predicting”. Instead, we are generating an instance of the probability distribution given a set of current conditions. In particular, if our model has a softmax layer at its output, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data. In a recurrent neural network model (such as an LSTM or GRU based model), at each time step, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data and the “state” parameters of the neural network.

As many have suggested, I find it useful to think of machine learning models, especially those using multiple layers (“deep” learning), as function approximators. In many cases, a machine learning model is modeling the probability function for a particular context, where the probability function outputs probabilities across an output discrete variable space. For a coin toss, we our output discrete variable space is “heads” or “tails”; for language systems, our output discrete variable space is based on the set of possible output tokens, e.g. the set of possible characters or words. This is useful, because our problem then becomes: how do we make a model that accurate approximates the probability distribution function. Note this is different from: how do we generate a model that accurately predicts future output tokens.

These different interpretations of the term “prediction” matter less if our output array has most of its mass under one value, e.g. for each “prediction” you get an output that is similar to a one-hot vector. Indeed, this is what the model is trying to fit if we supply our “ground truth” output as one-hot vectors. In this case, taking a random sample and taking the maximum probability value will work out to be mostly the same.

I think some of the confusion occurs as different people think about the model prediction in different ways. If the prediction is seen to be a normalized score, then it makes more sense to pick the highest score. However, if the prediction is seen to be a probability distribution, then this makes less sense. Using a softmax computation on an output does result in a valid probability distribution. So I think we should be treating our model as approximating a probability distribution function and looking at sampling properly.

Another factor is the nature of the “conditioning” of the probability distribution. This can be thought of as: what constraints are there on the resultant distribution. A fully constrained probability distribution becomes deterministic: only one output is going to be correct, and this has a value of one. Hence, the argmax is the same as a random sample but random sampling will still be correct. If we can never truly know or model our constraints then we can never obtain a deterministic setup.

Deterministic vs Stochastic

The recent explosion in deep learning approaches was initially driven by image classification. In these cases, you have an image as an input, and you are trying to ascertain what is in an image (or a pixel). This selection is independent of time and reduces to a deterministic selection: there is either a dog in the image or there is not a dog in the image. The probability distribution output by our image classification model thus indicates more a confidence in our choice.

Language, though, is inherently stochastic: there are multiple interchangeable “right” answers. Also language unfolds over time. This amplifies the stochasticity.

Thinking about a single word choice, language models will generally produce flatter conditional probability distributions for certain contexts. For example, you can often use different words and phrases to mean the same thing (synonyms for example, or different length noun phrases). Also, the length of the sequence is itself a random variable.

In this case, the difference between randomly sampling the probability distribution and taking the maximum value of our probability array becomes more pronounced. Cast your mind back to our coin toss example: if we are accurately modeling the probability distribution then actually argmax doesn’t work – we have two equally likely choices and we have to hack our decoding function to make a choice when we have equal probabilities. However, sampling from the probability distribution works.

A key point people often miss is that we can very accurately model the probability distribution, while producing outputs that do not match “ground truth values”.

What does this all mean?

Firstly, it means that when selecting an output token for each time step, we need to be properly sampling rather than taking the maximum value.

Secondly, it means we have to pay more attention to the sampling of output in our models. For example, using beam search or the Viterbi algorithm to perform our sampling and generate our output sequence. It also suggests more left-field ideas, like how could we build sampling into our model, rather than have it as an external step.

Thirdly, we need to look at our loss function in our language models. At one extreme, you have discriminator-style loss functions, where we perform a binary prediction of whether the output looks right; at the other extreme, you having teaching forcing on individual tokens. The former is difficult to train (how do you know what makes a “good” output and move towards that) but the later also tends to memorize an input, or struggle to produce coherent sequences of tokens that resemble natural language. If we are trying to model a probability distribution, then should our loss be based on trying to replicate the distribution of our input? Is a one-hot encoding the right way to do this?

Of course, I could just be missing the point of all of this. However, the generally poor performance of many generative language models (look at the real output not the BLEU scores!) suggests there is some low-hanging fruit that just requires some shift in perspective.






Understanding Convolution in Tensorflow

This is a quick post intended to help those trying to understand convolution as applied in Tensorflow.

There are many good blog posts on the Internet explaining convolution as applied in convolutional neural networks (CNNs), e.g. see this one by Denny Britz. However, understanding the theory in one thing, knowing how to implement it is another. This is especially the case when trying to apply CNNs to word or character-level natural language processing (NLP) tasks – here the image metaphors break down a little.

I generally use Tensorflow for neural network modelling. Most of the stuff I want to do is a little bespoke, so I need something a little more expressive than Keras. Two dimensional convolution is explained in the Tensorflow documents here. I also found the examples in the stackoverflow answer here very useful.

To summarise, for our input we have a [a, b, c, d] tensor – i.e. a x b x c x d.

  • a is the number of input ‘images’ or examples (this will typically be your batch size);
  • b is the input width (e.g. image width, max. word length in characters or max. sequence length in words);
  • c is the input height (e.g. image height or embedding dimensionality); and
  • d is the number of input channels (grayscale images, characters or words = 1, RGB = 3).

For the convolution kernel filter you also have a [o, p, q, r] tensor – i.e. o x p x q x r.

  • o is the filter width (e.g. patch width or ‘n-gram’ size);
  • p is the filter height (e.g. patch height or embedding dimensionality);
  • q is the number of channels (from the input – i.e. input channels); and
  • r is the number of filters (or output channels).

q basically has to match d. r – the number of output channels – is equal to the number of filters you want. For each output channel, a different filter of width * height will be created and applied.

Most of the time the theory is talking about images of b x c, and a filter of o x p.

(This follows Denny’s configuration. However, I note you can transpose b and c and o and p and get the same outcome.)

For NLP the number of output channels becomes an important parameter. This is because you will typically max-pool over the sequence (e.g. word or character) length, such that you get one value for the complete sequence per filter. Each filter can be thought of as representing something like an n-gram, e.g. the learnt parameters of one filter of length 3 could represent the character embeddings for the suffix “ly” (e.g. [l_embedding, y_embedding, wordend_embedding]) or the prefix “un” (including word start token).

I found it instructive to work through the code associated with Denny’s further blog post here.

Practical Problems with Natural Language Processing

Or the stuff no one tells you that is actually quite hard.

Recently I’ve been playing around with the last 15 years of patent publications as a ‘big data’ source. This includes over 4 million individual documents. Here I thought I’d highlight some problems I faced. I found that a lot of academic papers tend to ignore or otherwise bypass this stuff.

Sentence Segmentation

Many recurrent neural network (RNN) architectures work with sentences as an input sequence, where the sentence is a sequence of word tokens. This introduces a first problem: how do you get your sentences?

A few tutorials and datasets get around this by providing files where each line in the file is a separate sentence. Hence, you can get your sentences by just reading the list of filelines.

In my experience, the only data where file lines are useful is code. For normal documents, there is no correlation between file lines and sentences; indeed, each sentence is typically of a different length and so is spread across multiple file lines.

In reality, text for a document is obtained as one large string (or at most a set of paragraph tags). This means you need a function that takes your document and returns a list of sentences: s = sent_tokenise(document).

A naive approach is to tokenise based on full stops. However, this quickly runs into issues with abbreviations: “U.S.”, “No.”, “e.g.” and the like will cut your sentences too soon.

The Python NLTK library provides a sentence tokeniser function out-of-the-box – sent_tokenize(text). It appears this is slightly more intelligent that a simple naive full stop tokenisation. However, it still appears to be cutting sentences too early based on some abbreviations. Also optical character recognition errors, such as “,” instead of “.”, or variable names, such as “” will give you erroneous tokenisation.


One option to resolve this is to train a pre-processing classifier to identify (i.e. add) <end-of-sentence> tokens. This could work at the word token level, as the NLTK word tokeniser does appear to extract abbreviations, websites and variable names as single word units.

You can train the Punkt sentence tokenizer – and . One option is to test training the Punkt sentence tokenizer with patent text.

Another option is to implement a deep learning segmenter on labelled data – e.g. starting from here – . You can have sequence in and control labels out (e.g. a sequence to sequence system). Or even character based labelling using a window around the character (10-12). This could use a simple feed-forward network. The problem with this approach is you would need a labelled dataset – ideally we would like an unsupervised approach.

Another option is to filter an imperfect sentence tokeniser to remove one or two word sentences.

Not All File Formats are Equal

An aside on file formats. The patent publication data is supplied as various forms of compressed file. One issue I had was that it was relatively quick and easy to access data in a nested zip file (e.g. multiple layers down – using Python’s zipfile). Zip files could be access as hierarchies of file objects. However, this approach didn’t work with tar files, for these files I needed to extract the whole file into memory before I could access the contents. This resulted in ‘.tar’ files taking up to 20x longer to access than ‘.zip’ files.


Related to sentence segmentation is the issue of section titles. These are typically set of <p></p> elements in my original patent XML files, and so form part of the long string of patent text. As such they can confuse sentence tokenisation: they do not end in a full stop and do not exhibit normal sentence grammar.


Titles can however be identified by new lines (\n). A title will have a preceding and following new line and no full stop. It could thus be extracted using a regular expression (e.g. “\n\s?(\W\s?)\n”).

Titles may be removed from the main text string. They may also be used as variables of “section” objects that model the document, where the object stores a long text string for the section.

As an aside, titles in HTML documents are sometimes wrapped in header tags (e.g. <h3></h3>). For web pages titles may thus be extracted as part of the HTML parsing.

Word Segmentation

About half the code examples I have seen use a naive word tokenisation that splits sentences or document strings based on spaces (e.g. as a list comprehension using doc.split()). This works fairly successfully but is not perfect.

The other half of code examples use a word tokenising function supplied by a toolkit (e.g. within Keras, TensorFlow or NLTK). I haven’t looked under the hood but I wouldn’t be surprised if they were just a wrapper for the simple space split technique above.

While these functions work well for small curated datasets, I find the following issues for real world data. For reference I was using word_tokenise( ) from NLTK.

Huge Vocabularies

A parse of 100,000 patent documents indicates that there are around 3 million unique “word” tokens.

Even with stemming and some preprocessing (e.g. replacing patent numbers with a single placeholder token), I can only cut this vocabulary down to 1 million unique tokens.


This indicates that the vocabulary on the complete corpus will easily be in the millions of tokens.

This then quickly makes “word” token based systems impractical for real world datasets. For many RNN system you will need to use a modified softmax (e.g. sampled or hierarchical) on your output, and even these techniques may grind to a holt at dimensionalities north of 1 million.

The underlying issue is that word vocabularies have a set of 50-100k words that are used frequently and a very long tail of infrequent words.

Looking at the vocabulary is instructive. You quickly see patterns of inefficiency.


This is a big one, especially for more technical text sources where numbers turn up a lot. Each unique number that is found is considered a separate token.


This becomes a bigger problem with patent publications. Numbers occur everywhere, from parameters and variable values to cited patent publication references.

Any person looking at this quickly asks – why can’t numbers be represented in real number space rather than token space?

This almost becomes absurd when our models use thousands of 32 bit floating point numbers as  parameters – just one parameter value can represent numbers in a range of -3.4E+38 to +3.4E+38. As such you could reduce your dimensionality by hundreds of thousands of points simply by mapping numbers onto one or two real valued ranges. The problem is this then needs bespoke model tinkering, which is exactly what most deep learning approaches are trying to avoid.

Looking at deep learning papers in the field I can’t see this really being discussed. I’ve noticed that a few replace financial amounts with zeros (e.g. “$123.23” > “$000.00”). This then only requires one token per digit or decimal place. You do then lose any information stored by the number.

I have also noticed that some word embedding models end up mapping numbers onto an approximate number range (they tend to be roughly grouped into linear structures in embedding space). However, you still have the issue of large input/output dimensions for your projections and the softmax issue remains. There is also no guarantee that your model will efficiently encode numbers, e.g. you can easily image local minima where numbers are greedily handled by a largish set of hidden dimensions.

Capital Letters

As can be seen in the list of most common tokens set out above, a naive approach that doesn’t take into account capital letters treats capitalised and uncapitalised versions of a word as separate independent entities, e.g. “The” and “the” are deemed to have no relation in at least the input and output spaces.


Many papers and examples deal with this by lowering the case of all words (e.g. preprocessing using wordstring.lower()). However, this again removes information; capitalisation is there for a reason: it indicates acronyms, proper nouns, the start of sentences etc.

Some hope that this issue is dealt with in a word embedding space, for example that “The” and “the” are mapped to similar real valued continuous n-dimensional word vectors (where n is often 128 or 300). I haven’t seen though any thought as to why this would necessarily happen, e.g. anyone thinking about the use of “The” and “the” and how a skip-gram, count-based or continuous bag of words model would map these tokens to neighbouring areas of space.

One pre-processing technique to deal with capital letters is to convert each word to lowercase, but to then insert an extra control token to indicate capital usage (such as <CAPITAL>). In this case, “The” becomes “<CAPITAL>”, “the”. This seems useful – you still have the indication of capital usage for sequence models but your word vocabulary only consists of lowercase tokens. You are simply transferring dimensionality from your word and embedding spaces to your sequence space. This seems okay – sentences and documents vary in length.

(The same technique can be applied to a character stream to reduce dimensionality by at least 25: if “t” is 24 and “T” is 65 then a <CAPITAL> character may be inserted (e.g. index 3) “T” can become “3”, “24”.)

Hyphenation and Either/Or

We find that most word tokenisation functions treat hyphenated words as single units.


The issue here is that the hyphenated words are considered as a further token that is independent of their component words. However, in our documents we find that many hyphenated words are used in a similar way to their unhyphenated versions, the meaning is approximately the same and the hyphen indicates a slightly closer relation than that of the words used independently.

One option is to hope that our magical word embeddings situate our hyphenated words in embedding space such that they have a relationship with their unhyphenated versions. However, hyphenated words are rare – typically very low use in our long tail – we may only see them one or two times in our data. This is not enough to provide robust mappings to the individual words (which may be seen 1000s of times more often).

So Many Hyphens

An aside on hyphens. There appear to be many different types of hyphen. For example, I can think of at least: a short hyphen, a long hyphen, and a minus sign. There are also hyphens from locale-specific unicode sets. The website here counts 27 different types: . All can be used to represent a hyphen. Some dimensionality reduction would seem prudent here.

Another answer to this problem is to use a similar technique to that for capital letters: we can split hyphenated words “word1-word2” into “word1”, “<HYPHEN>”, “word2”, where “<HYPHEN>” is a special control character in our vocabulary. Again, here we are transferring dimensionality into our sequence space (e.g. into the time dimension). This seems a good trade-off. A sentence typically has a variable token length of ~10-100 tokens. Adding another token would seem not to affect this too much: we seem to have space in the time dimension.

A second related issue is the use of slashes to indicate “either/or” alternatives, e.g. “input/output”, “embed/detect”, “single/double-click”. Again the individual words retain their meaning but the compound phrase is treated as a separate independent token. We are also in the long tail of word frequencies – any word embeddings are going to be unreliable.


One option is to see “/” as an alternative symbol for “or”. Hence, we could have “input”, “or”, “output” or “embed”, “or”, “detect”. Another option is to have a special “<SLASH>” token for “either/or”. Replacement can be performed by regular expression substitution.

Compound Words

The concept of “words” as tokens seems straightforward. However, reality is more complicated.

First, consider the terms “ice cream” and “New York”. Are these one word or two? If we treat them as two independent words “ice cream” becomes “ice”, “cream” where each token is modelled separately (and may have independent word embeddings). However, intuitively “ice cream” seems to be a discrete entity with a semantic meaning that is more than the sum of “ice” and “cream”. Similarly, if “New York” is “New”, “York” the former token may be modelled as a variation of “new” and the latter token as entity “York” (e.g. based on the use of “York” elsewhere in the corpus). Again this seems not quite right – “New York” is a discrete entity whose use is different from “New” and “York”. (For more fun also compare with “The Big Apple” – we feel this should be mapped to “New York” in semantic space, but the separate entities “New”, “York”, “The”, “Big”, “Apple” are unlikely to be modelled as related individually.)

The case of compound words probably has a solution different from that discussed for hyphenation and slashes. My intuition is that compound words reflect features in a higher feature space, i.e. in a feature space above that of the individual words. This suggests that word embedding may be a multi-layer problem.

Random Characters, Maths and Made-Up Names

This is a big issue for patent documents. Our character space has a dimensionality of around 600 unique characters, but only about 100 are used regularly – again we have a longish tail of infrequent characters.


Looking at our infrequent characters we see some patterns: accented versions of common characters (e.g. ‘Ć’ or ‘ë’); unprintable unicode values (possibly from different character sets in different locales); and maths symbols (e.g. ‘≼’ or ‘∯’).

When used to construct words we end up with many variations of common word tokens (‘cafe’ and ‘café’) due to the accented versions of common characters.

Our unprintable unicode values appear not to occur regularly in our rare vocabulary. It thus appears many of these are local variations on control characters, such as whitespace. This suggests that we would not lose too much information with a pre-processing step that removed these characters and replaced them with a space (if they are not printable on a particular set of computers this suggests they hold limited utility and importance).

The maths symbols cause us a bit of a problem. Many of our rare tokens are parts of equations or variable names. These appear somewhat independent of our natural language processing – they are not “words” as we would commonly understand them.

One option is to use a character embedding. I have seen approaches that use between 16 and 128 dimensions for character embedding. This seems overkill. There are 26 letters, this is multiplied by 2 for capital letters, and there are around 10 common punctuation characters (the printable set of punctuation in Python’s string library is only of length 32). All the unicode character space may be easily contained within one real 32 bit floating point dimension. High dimensionality would appear to risk overfitting and would quickly expand our memory and processing requirements. My own experiments suggest that case and punctuation clustering can be seen with just 3 dimensions, as well as use groupings such as vowels and consonants. One risk of low dimensional embeddings is a lack of redundant representations that make the system fragile. My gut says some low dimensionality of between 1 and 16 dimensions should do. Experimentation is probably required within a larger system context. (Colin Morris has a great blog post where he looks at the character embedding used in one of Google Brain’s papers – the dimensionality plots / blogpost link can be found here:

What would we like to see with a character embedding:

  • a relationship between lower and upper case characters – preferable some kind of dimensionality reduction based on a ‘capital’ indication;
  • mapping of out of band unicode characters onto space or an <UNPRINTABLE> character;
  • mapping of accented versions of characters near to their unaccented counterpart, e.g. “é” should be a close neighbour of “e”;
  • clustering of maths symbols; and
  • even spacing of frequently used characters (e.g. letters and common punctuation) to provide a robust embedding.

There is also a valid question as to whether character embedding is needed. Could we have a workable solution with a simple lookup table mapping?

Not Words

Our word tokenisation functions also output a lot of tokens that are not really words. These include variable names (e.g. ‘product.price’ or ‘APP_GROUP_MULTIMEDIA’), websites (e.g. ‘’), path names (e.g. ‘/mmt/glibc-15/lib/’, and text examples (‘PεRi’). These are often: rare – occurring one or two times in a corpus; and document-specific – they often are not repeated across documents. They make up a large proportion of the long-tail of words.

Often these tokens are <UNK>ed, i.e. they are replaced by a single token that represents rare words that are not taken into account. As our token use follows a power law distribution this can significantly reduce our vocabulary size.


For example, the plot above shows that with no filtering we have 3 million tokens. By filtering out tokens that only appear once we reduce our vocabulary size by half: to 1.5 million tokens. By filtering out tokens that only appear twice we can reduce our dimensionality by another 500,000 tokens. Gains fall off as we up the threshold, by removing tokens that appear fewer than 10 times, we can get down to a dimensionality of around 30,000. This is around the dimensionality you tend to see in most papers and public (toy) datasets.

The problem with doing this is you tend to throw away a lot of your document. The <UNK> then becomes like “the” or “of” in your model. You get samples such as “The <UNK> then started to <UNK> about <UNK>”, where any meaning of the sentence is lost. This doesn’t seem like a good approach.

The problem of <UNK> is a fundamental one. Luckily the machine learning community is beginning to wake up to this a little. In the last couple of years (2016+), character embedding approaches are gaining traction. Google’s neural machine translation system uses morpheme-lite word ‘portions’ to deal with out of vocabulary words (see The Google Brain paper on Exploring the Limits of Language Modelling ( explores a character convolutional neural network (‘CharCNN’) that is applied to characters (or character embeddings).

What Have We Learned?

Summing up, we have found the following issues:

  • sentence segmentation is imperfect and produces noisy sentence lists;
  • some parts of a document such as titles may produce further noise in our sentence lists; and
  • word segmentation is also imperfect:
    • our vocabulary size is huge: 3 million tokens (on a dataset of only 100,000 documents);
    • numbers are a problem – they are treated as separate discrete units, which seems inefficient;
    • the concept of a “word” is fuzzy – we need to deal with compound words, hyphenation and either/or notation;
    • there are different character sets that lead to separate discrete tokens – e.g. capitals and accented letters – when common underlying tokens appear possible; and
    • non-language features such as equations and variable names contribute heavily to the vocabulary size.

Filtering out non-printable unicode characters would appear a good preprocessing step that has minimal effect on our models.

Character embedding to a low dimensional space appears useful.

Word/Character Hybrid Approach?

Looking at my data, a character-based approach appears to be the one to use. Even if we include all the random characters, we only have a dimensionality of 600 rather than 3 million for the word token space.

Character-based models would appear well placed to deal with rare, out-of-vocabulary words (e.g. ‘product.price’). It also seems much more sensible to treat numbers as sequences of digits as opposed to discrete tokens. Much like image processing, words then arise as a layer of continuous features. Indeed, it would be relatively easy to insert a layer to model morpheme-like character groupings (it would appear this is what the CNN approaches are doing).

The big issue with character level models is training. Training times are increased by orders of magnitude (state of the art systems take weeks on systems with tens of £1k graphic cards).

However, we do have lots of useful training data at the word level. Our 50 most common unfiltered word tokens make up 46% (!) of our data. The top 10,000 tokens make up 95% of the data. Hence, character information seems most useful for the long tail.


This then suggests some kind of hybrid approach. This could be at a model or training level. The 95% of common data would suggest we could train a system using common words as ground truth labels and configure this model to generalise over the remaining 5%. Alternatively, we could have a model that operates on 10k top tokens with <UNK>able words being diverted into a character level encoding. The former seems preferable as it would allow a common low-level interface, and the character information from the common words could be generalised to rare words.

I’ll let you know what I come up with.

If you have found nifty solutions to any of these issues, or know of papers that address them, please let me know in the comments below!