Natural Language Processing & The Brain: What Can They Teach Each Other?

Part I of II – What Recent Advances in Natural Language Processing Can Teach Us About the Brain

I recently worked out that the day of my birth will be closer to the end of the Second World War than the present day. This means I am living in the future, hooray!

Over the years I’ve been tracking (and experimenting with) various concepts in natural language processing, as well as reading general texts on the brain. To me both streams of research have been running in parallel; in the last 5 years, natural language processing has found a new lease of engineering life via deep learning architectures, and the empirical sciences have been slowly chipping away at cognitive function. Both areas appear to be groping different parts of the same elephant. This piece provides an outlet for the cross-talk in my own head. With advances in natural language processing coming thick and fast, it also provides an opportunity for me to reflect on the important undercurrents, and to try to feel for more general principles.

The post will be in two halves. This half looks at what recent advances in natural language processing and deep learning could teach us about intelligence and the functioning of the human brain. The next half will look at what the brain could teach natural language processing.


I’ll say here that the heavy lifting has been performed by better and brighter folk. I do not claim credit for any of the outlines or summaries provided here; my effort is to try to write things down in a way that make sense to my addled brain, in the hope that things may also make sense to others. I also do not come from a research background, and so may take a few liberties for a general audience.

In natural language processing, these are the areas that have stayed with me:

  • Grammars,
  • Language Models,
  • Distributed Representations,
  • Neural Networks,
  • Attention,
  • Ontologies, and
  • Language is Hard.

In the next section, we’ll run through these (at a negligent speed), looking in particular at what they teach their respective sister fields. If you want to dig deeper, I recommend as a first step the Wikipedia entry on the respective concept, or any of the links set out in this piece.

Let’s get started. Hold on.


Mention the term grammar to most people and they’ll wince, remembering the pain inflicted in English or foreign language lessons. A grammar relates to the rules of language. While we don’t always know what the rules are, we can normally tell when they are being broken.

I would say that a majority of people view grammar like god (indeed Baptists can very nearly equate the two). There is one true Grammar, it is eternal and unchanging, and woe betide you if you break the rules. Peek behind the curtain though and you realise that linguists have proposed over a dozen different models for language, and all of them fail in some way. 

So what does this mean? Are we stuck in a post-modern relativist malaise? No! Luckily, there are some general points we can make.

Most grammars indicate that language is not a string of pearls (as the surface form of words seems to suggest) but has some underlying or latent structure. Many grammars indicate recursion and fractal patterns of self-similarity, nested over hierarchical structures. You can see this here:

  • The ball.
  • The ball was thrown.
  • The red ball was thrown over the wall.
  • In the depths of the game, the red ball was thrown over the wall, becoming a metaphor for the collapse of social morality following the fall of Communism.

Also the absence of “one grammar to rule them all”, teaches us that our rulesets are messy, incomplete and inconsistent. There is chaos with constraint. This hints that maybe language cannot be definitively defined using language. This hints further at Gödel and Church. This doesn’t necessarily rule out building machines that parse and generate language, but it does indicate that these machines may not be able to follow conventional rule-based deterministic processing.

With the resurgence of neural approaches, grammars have gone out of fashion, representing the “conventional rule-based deterministic processing” that “does not work”. But we should not ignore their lessons. Many modern architectures do not seem to accurately capture the recursion and self-similarity, and it appears difficult to train different layers to capture the natural hierarchy. For example, a preferred neural network approach, which we will discuss in more detail below is the recurrent neural network. But this is performing gated repeated multiplication. This means that each sentence above is treated quite differently. This seems to miss the point above. While attention has helped, this seems to be a band-aid as opposed to a solution.

Language Models

A language model is a probabilistic model that seeks to predict a next word given one or more previous words. The “probabilistic” aspect basically means that we are given a list of probabilities associated with a list of candidate words. A word with a probability of 1 would indicate that a word was definitely next (if you are a Bayesian that you are sure this is the next word). A word with a probability of 0 would indicate that the word was definitely not next. A probability is a probability if all the possible outcomes add up to one, so all our probability values across our words need to do this.

In the early 2000s, big strides were made using so-called ‘n-gram‘ approaches. Translated, ‘n-gram’ approaches count different sequences of words. The “n” refers to a number of words in the sequence. If n=3, we count different sequences of three words and use their frequency of occurrence to generate a probability. Here are some examples:

  • the cat sat (fairly high count)
  • he said that (fairly high count)
  • sat cat the (count near 0)
  • said garbage tree (count near 0)

If we have enough digital data, say by scanning all the books, then we can get count data indicating the probabilities of millions of sequences. This can help with things such as spell checking, encoding, optical character recognition and speech recognition.

We can also scale up and down our ‘n-gram’ models to do things like count sequences of characters or sequences of phonemes instead of words.

Language models were useful as they introduced statistical techniques that laid the groundwork for later neural network approaches. They offered a different perspective from rule-based grammars, and were well suited to real-world data that was “messy, incomplete and inconsistent”. They showed that just because a sentence fails the rules of a particular grammar, it does not mean it will not occur in practice. They were good for classification and search: it turns out that there were regular patterns behind language that could enable us to apply topic labels, group documents or search.

Modern language models tend to be built not from n-grams but using recurrent neural networks, such as one or bi- directional Long Short Term Memories (LSTMs). In theory, the approaches are not dissimilar, the LSTMs are in effect counting word sequences and storing weights that reflect regular patterns within text. There are just adding a sprinkling of non-linearity.

Like all models, language models have their downsides. A big one is that people consistently fail to understand what they can and cannot do. They show the general patterns of use, and show where the soft boundaries in language use lie. It turns out that we are more predictable than we think. However, they are not able, on their own, to help us with language generation. If you want to say something new, then this is by its nature going to be of low probability. They do not provide help for semantics, the layer of meaning below the surface form of language. This is why LSTMs can produce text that at first glance seems sensible, with punctuation, grammatically correct endings and what seem like correct spellings. But look closely and you will see that the text is meaningless gibberish.

Quite commonly the answer to the current failings of recurrent neural networks has been to add more layers. This does seem to help a little, as seen with models such as BERT. But just adding more layers doesn’t seem to provide a magic bullet to the problems of meaning or text generation. Outside of the artificial training sets these models still fail in an embarrassing manner.

It is instructive to compare the failures of grammars and language models, as they both fail in different ways. Grammars show that our thoughts and speech have non-linear patterns of structure, that there is something behind language. Language models show that our thoughts and speech do not follow well-defined rules, but do show statistical regularity, normally to an extent that surprises us “free” human agents.

Distributed Representations

Distributed representations are what Geoff Hinton has been banging on about for years and for me are one of the most important principles to emerge from recent advances in machine learning. I’m lying a little when I link them to natural language processing as they originally came to prominence in vision research. Indeed, much of the work on initial neural networks for image recognition was inspired by the earlier neuroscience of Nobel Prize Winners Hubel and Wiesel.

Distributed representations mean that our representations of “things” or “concepts” are shared among multiple components or sub-components, where each component or sub-component forms part of numerous different “things” or “concepts”.

Put another way, it’s a form of reductionist recycling. Imagine you had a box of Lego bricks. You can build different models from the bricks, where the model is something more than the underlying bricks (a car is more than the 2×8 plank, the wheels, those little wheel arch things etc.). So far, so reductionist. The Greeks knew this several millennia ago. However, now imagine that each Lego brick of a particular type (e.g. red 2×1 block, the 2×8 plank, each white “oner”) is the same brick. So all your models that have use red 2×1 blocks use the same red 2×1 block. This tend to turn your head inside out. Of course, in reality you can’t be in two places at the same time, but you can imagine your brain assembling different Lego models really quickly in sequence as we think about “things” (or even not “things”, like abstractions or actions or feelings).

This is most easily understood when thinking about images. This image from Wei et al at MIT might help:

Kindly reproduced from Wei et al here.

In this convolutional neural network, the weights of each layer are trained such that different segments of each layer end up representing different aspects of a complex object. These segments form the “Lego bricks” that are combined to represent the complex object. In effect, the segments reflect different regular patterns in the external environment, and different objects are represented via different combinations of low-level features. As we move up the layers our representations become more independent of the actual sensory input, e.g. they are activated even if lighting conditions change, or if the object moves in our visual field.

Knowing this, several things come to mind with regard to language:

  • It is likely that the representations that relate to words are going to follow a similar pattern to visual objects. Indeed, many speech recognition pipelines use convolutional neural networks to decipher audio signals and convert this to text. This form of representation also fits with the findings from studying grammars: we reuse semantic and syntactic structures and the things we describe can be somewhat invariant of the way we describe them.
  • Our components are going to be hard to imagine. Language seems to come to us fully-formed as discrete units. Even Plato got confused and thought there was some magical free-floating “tree” that existed in an independent reality. We are going to have to become comfortable describing the nuts and bolts of sentences, paragraphs and documents using words to describe things that may not be words.
  • For images, convolutional neural networks are very good at building these distributed representations across the weights of the layer. This is because the convolution and aggregation is good at abstracting over two-dimensional space. But words, sentences, paragraphs and documents are going to need a different architecture; they do not exist in two-dimensional space. Even convolutional neural networks struggle when we move beyond two dimensions into real-world space and time.

Neural Networks

Neural networks are back in fashion! Neural networks have been around since the 1950s but it is only recently we have got them to work in a useful manner. This is due to a number of factors:

  • We now have hugely fast computers and vastly greater memory sizes.
  • We worked out how to practically perform automatic differentiation and build compute graphs.
  • We began to have access to huge datasets to use for training data.

The best way to think of neural network is that they implement differentiable function approximators. Given a set of (data-X, label-Y) pairs neural networks perform a form of non-linear line fitting that maps X>Y.

Within natural language processing, neural networks have out performed comparative approaches in many areas, including:

  • speech processing (text-to-speech and speech-to-text);
  • machine translation;
  • language modelling;
  • question-answering (e.g. simple multiple choice logic problems);
  • summarisation; and
  • image captioning.

In the field of image processing, as set out above, convolutional neural networks rule. In natural language processing, the weapon of choice is the recurrent neural network, especially the LSTM or Gated Recurrent Unit (GRU). Often recurrent neural networks are applied as part of a sequence-to-sequence model. In this model an “encoder” receives a sequence of tokens and generates a fixed-size numeric vector. This vector is then supplied to a “decoder”, which outputs another sequence of tokens. Both the encoder and the decoder are implemented using recurrent neural networks. This is one way machine translation may be performed.

Neural networks do not work like the brain. But they show that a crude model can approximate some aspects of cortical function. They show that it is possible to build models of the world by feeding back small errors between our expectations and reality. No magic is needed.

The limitations of neural networks also show us that we are missing fairly large chunks of the brain puzzle – intelligence is not just sections of cortex. Most of the progress in the field of machine learning has resulted from greater architectural complexity, rather than any changes to the way neural networks are trained, or defined. At the moment things resemble the wild west, with architectures growing based on hunches and hacking. This kind of shifts the problem: the architectures still need to be explicitly defined by human beings. We could do with some theoretical scaffolding for architectures, and a component-based system of parts.

Most state of the art neural network models include some form of attention mechanism. In an over-simplified way, attention involves weighting components of an input sequence for every element in an output sequence.

In a traditional sequence-to-sequence system, such as those used for machine translation, you have an encoder, which encodes tokens in an input sentence (say words in English), and a decoder, which takes an encoded vector from the encoder and generates an output sentence (say words in Chinese). Attention models sought to weight different encoder hidden states, e.g. after each word in the sentence, when producing each decoder state (e.g. each output word).

A nice way to think of attention is in the form of values, keys and queries (as explained here).

In a value/key/query attention model:

  • the query is the decoder state at the time just before replicating the next word (e.g. a a given embedding vector for a word);
  • the keys are items that together with the query are used to determine the attention weights (e.g. these can be the encoder hidden states); and
  • the values are the items that may be weighted using attention (e.g. these can also be the encoder hidden states).

In the paper “Attention is All You Need”, the authors performed a nifty trick by leading with the attention mechanism and ditching some of the sequence-to-sequence model. They used an attention function that used a scaled dot-product to compute the weighted output. If you want to play around with attention in Keras this repository from Philippe Rémy is a good place to start.

Adam Kosiorek provides a good introduction to attention in this blogpost. In his equation (1), the feature vector z is equivalent to the values and the keys are the input vector x. A value/key/query attention model expands the parameterised attention network f(…) to be a function of both two input vectors: the hidden state of the decoder (the query Q) and the hidden states of the encoder (the keys K) – a = f(Q, K). The query here changes how the attention weights are computed based on the output we wish to produce.

Now I have to say that many of the blogposts I have read that try to explain attention fail. What makes attention difficult to explain?

Attention is weighting of an input. This is easy to understand in a simple case: a single, low-dimensionality feature vector x, where the attention weights are calculated using x and are in turn applied to x. Here we have:

a = f(x)


g = ax.

g is the result of applying attention, in this simple example a weighted set of inputs. The element-wise multiplication ⊙ is simple to understand – the kth element of g is computed as g_k = a_k*x_k (i.e. an element of a is used to weight a corresponding element of x). So attention, put like this, is just the usual weighting of an input, where the weights are generated as a function of the input.

Now attention becomes more difficult to understand as we move away from this simple case.

  1. Our keys and values may be different (i.e. x and z may be different sizes with different values). In this case, I find it best to consider the values (i.e. z) as the input we are weighting with the attention weights (i.e. a). However, in this case, we have another input – the keys (or x) – that are used to generate our attention weights, i.e. a=f(x). In sequence-to-sequence examples the keys and values are often the same, but they are sometimes different. This means that some explanations conflate the two, whereas others separate them out, leading to bifurcated explanations.
  2. In a sequence-to-sequence model our keys and/or values are often the hidden states of the encoder. Each hidden state of the encoder may be considered an element of the keys and/or values (e.g. an element of x and/or z). However, encoders are often recurrent neural networks with hidden dimensions of between 200 and 300 values (200-300D), as opposed to single dimension elements (e.g. 1D elements – [x1, x2, x3…]). Our keys and values are thus matrices rather than arrays. Each element of the keys and/or values thus becomes a vector in itself (e.g. x1 = [x11, x12, x13…]). This opens up another degree of freedom when applying the attention weights. Now many systems treat each hidden state as a single unit and multiply all the elements of the hidden state by the attention weight for the hidden state, i.e. gk = ak*[xk1, xk2, xk3…] = [akxk1, akxk2, akxk3…..]. However, it is also possible to apply attention in a multi-dimensional manner, e.g. where each attention weight is a vector that is multiplied by the hidden state vector. In this case, you can apply attention weights across the different dimensions of the encoder hidden state as well as apply attention to the encoder hidden state per se. This is what I believe creates much of the confusion.
  3. Sequence-to-sequence models often generate the attention weights as a function of both the encoder hidden states and the decoder hidden states. Hence, as in the “Attention is All You Need” paper the attention weights are computed as a function of a set of keys and a query, where the query represents a hidden state of a decoder at a time that a next token is being generated. Hence, a = f(Q, K). Like the encoder hidden state, the decoder hidden state is often of high dimensionality (e.g. 200-300D). The function may thus be a function of a vector Q and a matrix K. “Attention is All You Need” further confuses matters by operating on a matrix Q, representing hidden states for multiple tokens in the decoder. For example, you could generate your attention weights as a function of all previous decoder hidden states.
  4. Many papers and code resources optimise mathematical operations for efficiency. For example, most operations will be represented as matrix multiplications to exploit graphical processing unit (GPU) speed-ups. However, this often acts to lose some logical coherence as it is difficult to unpick the separate computations that are involved.
  5. Attention in a sequence-to-sequence model operates over time via the backdoor. Visual attention models are easier to conceptually understand, as you can think of them as a focus over a particular area of a 2D image. However, sentences represent words that are uttered or read at different times, i.e. they are a sequence where each element represents a successive time. Now, I haven’t seen many visual attention models that operate over time as well as 2D (this is in effect a form of object segmentation in time). The query in the attention model above thus has a temporal aspect, it changes the attention weights based on a particular output element, and the hidden state of the decoder will change as outputs are generated. However, “t” doesn’t explicitly occur anywhere.

Unpacking the confusion also helps us see why attention is powerful and leads to better results:

  • In a sequence-to-sequence model, attention changes with each output token. We are thus using different data to condition our output at each time step.
  • Attention teaches us that thinking of cognition as a one-way system leads to bottlenecks and worse results. Attention was developed to overcome the constraints imposed by trying to compress the meaning of a sentence into a single fixed-length representation.
  • In certain sequence-to-sequence models, attention also represents a form of feedback mechanism – if we generate our attention weights based on past output states we are modifying our input based on our output. We are getting closer to a dynamic system – the system is being applied iteratively with time as an implicit variable.
  • Visualisations of attention from papers such as “Show, Attend and Tell” –  and machine translation models seem to match our intuitive notions of cognition. When a system is generating the word “dog” the attention weights emphasise the image areas that feature a dog. When a system is translating a compound noun phrase it tends to attend jointly to all the words of the phrase. Attention can thus be seen as a form of filtering, it helps narrow the input to conditionally weigh an output.
  • Attention is fascinating because it suggests that we can learn a mechanism for attention separately from our mapping function. Our attention function f(…) is a parameterised function where we learn the parameters during training. However, these parameters are often separate from the parameters that implement the encoder and decoder.

Attention appears to have functional overlaps with areas of the thalamus and basal ganglia which form a feedback loop between incoming sensory inputs and the cortex. Knowing about how attention works in deep learning architectures may provide insight into mechanisms that could be implemented in the brain.


In the philosophical sense, an ontology is the study of “being”, i.e. what “things” or “entities” there are in the world, how they exist, what they are and how they relate to each other.

In a computer science sense, the term “ontology” has also been used to describe a method of organising data to describe things. I like to think of it representing something like a database schema on steroids. Over the last few decades, one popular form of an “ontology” has been the knowledge graph, a graph of things and relationships represented by triples, two “things” connected by a “relationship”, where the “things” and the “relationship” form part of the ontology.

Ontologies are another area that has faded out of fashion with the resurgence of deep neural networks. In the early 2000s there was a lot of hype surrounding the “semantic web” and other attempts to make data on the Internet more machine interpretable. Projects like DBpedia and standard drives around RDF and OWL offered to lead us to a brave new world of intelligent devices. As with many things they didn’t quite get there.

What happened? The common problem of overreach was one. Turns out organising human knowledge is hard. Another problem was one shared with grammars, human beings were trying to develop rule-sets, conventions and standards for something that was huge and statistical in nature. Another was that we ended up with a load of JAVA and an adapted form of SQL (SPARQL), while the Internet and research, being stubborn, decided to use hacky REST APIs and Python.

However, like grammars, ontologies got some things right, and we could do with saving some of the baby from the bathwater:

  • Thinking about things in terms of graphs and networks seems intuitively right. The fact that ontologies are a useful way to represent data says something in itself about how we think about the world.
  • It turns out that representing graph data as sets of triples works fairly well. This may be useful for further natural language processing engineering. This appears to reflect some of the fractal nature of grammars, and the self-similarity seen in language.
  • Ontologies failed in a similar way to grammars. Neural networks have taught us that hand-crafting features is “not the way to go”. We want to somehow combine the computing and representational aspects of ontologies, with learnt representations from the data. We need our ontologies to be messier. No one has quite got there yet, there have been graph convolutional networks but the maths is harder and so they form a niche area that is relatively unknown.
  • The “thing”, “relationship/property” way of thinking seems to (and was likely chosen to) reflect common noun/verb language patterns, and seems to reflect an underlying way of organising information in our brains, e.g. similar to the “what” and “where” pathways in vision or the “what” and “how” pathways in speech and motor control.

Language is Hard

To end, it is interesting to note that the recent advances in deep learning started with vision, in particularly image processing. Many attempted to port across techniques that had been successful in vision to work on text. Most of the time this failed. Linguists laughed.

For example, compare the output of recent Generative Adversarial Networks (GAN) with that of generative text systems.  There are now many high-resolution GAN architectures but generative text systems struggle with one or two coherent sentences and collapse completely over a paragraph. This strongly suggests that language is an emergent system that operates on top of vision and other sensory modalities (such as speech recognition and generation). One reason why deep learning architectures struggle with language is that they are seeking to indirectly replicate a very complex stack using only the surface form of the stack output. 

Take object persistence as another example. Natural language processing systems currently struggle with entity co-reference that a 5-year old can easily grasp, e.g. knowing that a cat at the start of the story is the same cat at the end of a story. Object persistence in the brain is likely based on at least low-level vision and motor representations. Can we model these independently of the low-level representations?

The current trend in natural language processing is towards bigger and more complex architectures that excel on beating benchmarks but generally fail miserably on real-world data. Are we now over-fitting in architecture space? Maybe one solution is to take a more modular approach where we can slot in different sub-systems that all feed into the priors for word selection. 

In part two, we will look at things from the other side of the fence. We review some of the key findings in neuro- and cognitive science, and have a look at what these could teach machine learning research.


Rambles about Language Models

“But it must be recognised that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

Chomsky (1969)

Are language models a waste of time?

I recently found this post in my drafts, having written it over the Christmas period in 2017. Having talked with several technologists about so-called “AI”, I’ve realised that there is a wide public misconception about language models and how they are used. In this post I try to explain to myself, and anyone who feels like reading it, the properties of language models and their limits.

What is a language model?

A language model can be defined as a system that computes probabilities for sequences of tokens:


where the output is a probability value between 0 and 1 and t is a vector containing a sequence of tokens, t_1, t_2... t_n.

Languages have multiple levels of abstraction. The tokens can thus be:

  • characters (e.g. ‘a’, ‘.’, ‘ ‘);
  • words (e.g. ‘cat’, ‘mat’, ‘the’); or
  • larger word groups such as sentences or clauses.

So in one sense, a language model is a model of the probability of a sequence of things. For example, if you are given a particular set of words “the cat sat on the mat“, the language model can provide a probability value between 0 and 1 representing the likelihood of those tokens in the language.

Conditional Models

Another form of language model is a conditional token model. This looks to predict tokens one at a time given a set of previous tokens.

P(t_n|t_1, t_2, ... t_{n-1})

Here we have a function that provides a probability value for a particular token at position n in the sequence, given (the symbol “|“) a set of preceding symbols t_1, t_2... t_{n-1}.

Normally, we have a set of possible token values. For example, we might have a dictionary of words, where a token t can take one of the values in the dictionary of words. Many practical language models have a fixed-size dictionary, sometimes also called a “vocabulary“. This may be 10,000 or 100,000 words. Most people have an active vocabulary (i.e. a vocabulary that they can use in expressions) of between 10,000 and 40,000 words (depending on things like education and literacy). The Oxford English Dictionary has entries for around 170,000 words. So our probability function outputs an array or vector of values, representing a probability value for each token value in the dictionary, where the probability values over the complete dictionary sum to 1.

If you hear phrases such as “local effects” and “long range dependencies”, these relate to the number of tokens we need to add into the language model to predict the next token. For example, do we need all the previous tokens in a document or just the last few?

Language as Numbers

Now computers only understand numbers. So, t in many practical language models isn’t really a sequence of token values (such as words), it’s a sequence of numbers, where each number represents an entry in a dictionary or vocabulary. For example, you may have:

“hello”: 456 – the 456th entry in an array of 10,000 words.

“world”: 5633 – the 5633th entry in an array of 10,000 words.

So “hello world” = [456, 5633].

Building Language Models

Language models are typically constructed by processing large bodies of text. This may be a set of documents, articles in Wikipedia, or all webpages on the Internet. Hence, the output probability can be seen as a likelihood for a sequence of tokens, or a next token, based on some form of historical data.

For example, over a corpus of English text, “the cat is black” would have a higher probability value (e.g. 0.85) than “dog a pylon” (e.g. 0.05). This form of model is actually present in most pockets – it is difficult to type the last example on a smartphone, as each word is autocorrected based on a likely word given the characters.

A simple language model can be built using n-grams. N-grams are sequences of tokens of length n. Before the most recent comeback of neural network models, state of the art transcription and translation systems were based on n-grams. You can generate an n-gram model by simply counting sequences of tokens. For example, “the cat is black” contains 3 bi-grams (n=2) – “the cat”, “cat is”, “is black”. Over a large enough corpus of text “the cat” will occur more times than “cat the” and so “the cat” can be assigned a higher probability value proportional to its frequency.

The latest state of the art language models use recurrent neural networks. These networks are parameterised by a set of weights and biases and are trained on a corpus of data. Using stochastic gradient descent and back propagation on an unrolled network, values for the parameters can be estimated. The result is similar to the n-gram probabilities, where the frequency of relationships between sequences of characters influence the parameter values.

What Language Models Are Not

Over the last 10 or 20 years there have been great strides in analysing large bodies of text data to generate accurate language models. Projects such as Google books and the Common Crawl have built language models that cover a large proportion of the written word generated by human beings. This means we can now fairly accurately provide bounds of how likely a sequence of tokens is.

However, issues often when people naively try to use language models to generate text. It is often the case that the most likely sentence, given a corpus, is not the sentence we want or need. Indeed, for a sentence to have high (human) value it often needs to express something new, and so it will diverge from the corpus of past sentences. Hence, a “good” sentence (from a human task perspective) may have a lower probability in our language model than a “common” sentence.

As an exercise for you at home, try to dictate an unusual sentence into your phone. The likely outcome is that the phone tries to generate the most likely sentence based on history rather than the sentence you want to generate.

You also see this with toy implementations of recurrent neural networks such as CharRNN and its varieties. These networks seek to estimate a probability distribution over a vocabulary of terms; what is being generated is likely sequences of tokens given the training data. These toy implementations are what the popular press pick up on as “AI” writers. However, they are nothing more than sequences of likely tokens given a dataset.

Often the toy implementations appear to be smart because of the stochastic nature of the probabilistic models – each sequence will be slightly different due to probabilistic sampling of the token probabilities (plus things like searches over the probabilities). Hence, you will get a slightly different output every time, which looks more natural than a single most likely sentence. However, a closer reading shows that the outputs of these systems is gibberish.

So language models are not generative models, at least not in their popular form.

How Do You Generate Text?

Another exercise: tell me a sentence. Any sentence whatsoever.

It’s harder than it looks. People will generally do one of several things:

  1. pick archetypal sentences “the quick brown fox…” “the X did Y”, where most of these are conditionally learnt at a young age;
  2. pick a sentence based on what they were doing immediately prior to the question; or
  3. look around and pick something in their field of view. The experiment is even more fun with children, as the thought processes are often more transparent.

This test demonstrates a simple truth: the concept of a “random” sentence rarely occurs in practice. All language is conditional. It is easier to provide a sentence about something.

Here are some things (amongst a near infinite multitude of things) that influence what words we select:

  • Document type (report, blog post, novel);
  • Audience (children, adults, English professors);
  • Location in document (start, middle, end);
  • Characters and character history;
  • Country;
  • Previous sentences/paragraphs/chapters; and
  • Domain (engineering, drama, medical).

The better we get at modelling what we are talking about, the better our generative language models.

This is partially seen with summarisation systems. Some of these produce pretty coherent text. The reason? The context is severely constrained by the content of the piece of writing we are summarising.

Distributed Sensory Representations

There is more. Vision, sound and motor control can teach us a lot about language. Indeed different facets of language have piggybacked on the underlying neural configurations used for these abilities. All these areas have distributed hierarchical representations. Complex features are represented by fuzzy combinations of lower features, and at the bottom you have raw sensory input. There is no neuron for “banana” but a whole series of activations (and activation priming) for different aspects of a “banana”. Visualisations of feature layers in convolutional neural network architectures show how complex object representations may be constructed from a series of simple features, such as edges. It is likely that a semantic representation of a word in our brains is similarly constructed from webs of component representations over several layers.

Yet another question: how does a person blind from birth imagine an orange?

I don’t know the answer to this. (I need to find some research on it.) I’d hazard to guess that the mental representation is built from many non-visual sensory representations, where these may be more detailed than an average sighted person. But the key is they still “know” what an orange is. Hence our semantic representations are distributed over different sensory modalities as well as over different layers of complexity.

So I believe we are getting closer to useful generative language models when we look an systems that produce simple image caption labels. These systems typically use a dense vector representation of an image that is output by a convolutional neural network architecture to condition a recurrent neural network language model. The whole system is then trained together. Here the dense vector provides an “about” representation that allows the language model to pick the most likely words, given the image. The surreal errors these systems make (a baseball bat is a “toothbrush”) show the limitations of the abstract representations conditioning the text generation.

Another issue that tends to get ignored by academic papers I have seen is the limitation of selecting a particular input representation. Many systems start with clean, easily tokenised text sources. The limited time scales of research projects means that words are often picked as the input layer. Hence, the language model looks at providing word probabilities over a vocabulary. Often word embeddings are used on this input, which introduces some aspects of correlation in use. However, in our brains, words seem to be an intermediate representation; they are features built upon sounds, phonemes and lower symbols (e.g. probably at least several layers of representations). Given that language is primarily oral (writings is a relatively new bolt on), I’d hazard that these lower levels influence word choice and probability. (For example, why do you remember “the cat on the mat” more than “the cat on the carpet”?) Word embeddings help to free us from the discrete constraints of words as symbols but they may be applying use patterns too early in the layers of representations.

Looking at how motor activity is controlled in the brain, we find that our cortex does not store detailed low-level muscle activation patterns. Through training these patterns are often pushed out of the cortex itself, e.g. into the cerebellum, spinal cord or peripheral nervous system. Also we find that, if practised enough, fairly complex sequences may be encoded as a small number of cortical representations. This appears to apply to language generation as well, especially for spoken language. Our conversations are full of cliches and sayings (“at the end of the day”). Within the cortex itself, brain activity appears to cascade from higher levels to lower levels but with feedback between the layers during language generation, e.g. we translate a representation of an object into a series of sounds then a series of muscle activations.

So in our brain:

  • There is structure in language (otherwise it would be incomprehensible).
  • Comprehension arrives through shared conventions.
  • These shared conventions are fuzzy – as they are shaped through social use different contradictory rules may apply at the same time.
  • The structure of language at least partially reflects preferred methods of information representation and organisation in the human cortex.

This is just a quick run through of some of my thinking on this point. What you should take home is that language models are useful in many engineering applications but they are not “artificial intelligence” as believed by many.

How do we remember characters?

I have a question of cognitive science: how do we hold in our minds the combination of characteristics that make up a particular object or character?

How do we then keep that specific combination in mind and consistent over the span of a narrative?

Consistency is a hard one. Characters may be separated by several pages or chapters but we still expect consistency.

Possible clues:

  • Working memory – e.g. similar to remembering a phone number. We can remember about 3-5 discrete items – but often our characters or objects are of greater complexity than this. Chunking has been seen to be a way of holding more representations in mind; where we combine multiple properties under one limited placeholder representation.
  • It does not seem possible that combinations are stored via changes in synaptic configurations, at least in the short term. What may be the case is that we make use of slowly changeable underlying representations, which we modify (“delta”) and combine.
  • There would seem to be a hierarchy of features, which dictates what needs to be remembered for consistency (e.g. alive or dead is magnitudes more important than repetition of character thought). It maybe that consistency consists of presence or absence of clearly separable pre-constructed representations (e.g. we know alive and dead are clearly different so we just need to link one of those pre-existing concepts).
  • Hot and cold recall – it is a well known fact that it is easier to recall primed or pre-activated representations. This tends to fade but may have effects over a time scale of minutes up to an hour. It would appear that this is a relatively broad-brush effect – it is reported that related representations are also more easily retrieved. Cold recall is probably more influenced by repetition and longer-term changes in synaptic configurations.
  • Combinations of characteristics appear to be based on higher level sensory representations, which in turn may be constructed from lower level sub-components, all with affective weightings. Anecdotally, many of my representations appear to be primarily visual – is there any research on how those that are blind from birth “imagine” objects?
  • Imagination and remembering appear to use the same underlying mechanisms – they all involve the “top down” activation of portions of neuronal representations based on an internal driver that is (semi) independent of the current incoming sensory streams. This is all happening in time within nested feedback loops – locking onto a harmonic signal is a useful analogy.
  • Is imagination, i.e. the ability to create new combinations of characteristics, uniquely human? If so what is it about the human brain that allows this? Evidence suggests a looser coupling between sensory input and representation, or the fact that higher level features can be activated somewhat independently of lower level circuits.
  • The prefrontal cortex is keep to the initial top-down control and the coordination of timing. But imagination and manipulation of representations appears to require the coordination of the whole brain. Often ignored structures such as the thalamus and basal ganglia are likely important for constructing and maintaining the feedback loop.
  • Although we feel an imagined combination appears instantaneously, this may be just how we perceive it – the feedback loops may activate over time in an iterative manner to “lock onto” the combination such that it appears in our minds.

Reflections on “Meaning”

Existential “meaning” is partly the telling of a story featuring ourselves that is available and consistent with our higher-level representations of the world. It is not, generally, rational; it is more a narrative correlated with a feeling of “selfness” and “correctness”.

For example, think of how you feel when you hear your own internal voice as opposed to hearing another person speak. You feel that the internal voice is somehow “you”. This is not a rational thought, indeed the language of rational thought may be seen, in part, to *be* the internal voice. This feeling breaks down in certain brain diseases, such as schizophrenia. With these diseases, the “me” feeling is lost or broken, and hence the internal voice of “you” becomes an auditory hallucination, a voice of “them”.

Concentrate on this feeling of “you” for a moment, try to explore what it feels like.

Now think about a feeling of “correctness”. This can also be seen as a feeling of “truthiness”. For example, try to concentrate on how the feeling of “1 + 1 = 2”, differs from the feeling of “1 + 1 = 5”. The latter invokes a feeling of uneasiness, an itching to correct. It has tones of unpleasantness. It induces a slight anxiety, a feeling that action is needed. The former invokes a feeling of contentment, that no action is required; it may be contemplated for an extended period without unease of additional thought. It’s a similar feeling to artistic “beauty”, the way we can contemplate a great painting or a landscape.

Both feelings may arise from a common mechanism in the cingulate cortex, a medial layer of cortex that sits between older brain structures such as the thalamus and the higher cortical layers. Indeed, certain forms of schizophrenia have been traced back to this structure. The cingulate cortex may be considered to be the emotional gateway between complex neural representations in the upper cortex and structures that manage low-level sensory input and co-ordinate physiology response. “Sensory input” in this context also includes “gut feeling”, sensory input from internal viscera. Work, such as that performed by Antonio Damasio, shows that this input is important for the embodied feeling of self, e.g. “self” may in part be a representation formed from signals from these viscera. In a not dissimilar manner, “correctness” may be based on a representation of error or inconsistency between a sequence of activated cortical representations. At a very naive level this could be built from familiarity, e.g. at a statistical level, does this sequence match previously activated sequences? Over a human life this is a “big data” exercise.

So, back to “meaning”. Our higher-level cortical systems, e.g. the frontal lobes, create narratives as patterned sequences of goal-orientated social behaviour, which may be expressed in various media (stories, plays, comics, dance, songs, poems etc). For “meaning” to be present, we are looking for strong positive correlations between these narratives and the emotional representations of “self” and “correctness”. What form could these correlations take?

First, let’s look at “goal-orientated behaviour”. The frontal lobes build representations of sequences of other representations. These sequences can represent “situation, action, outcome”. There is some overlap with the methods of reinforcement learning. Over time we learn the patterns that these sequences tend to match (google Kurt Vonnegut’s story graphs). The frontal lobes are powerful as they can stack representations over one to seven layers of cortex, allowing for increasing abstraction. Narratives are thus formed from hierarchical sequences of sequences. (There may also be a bottom-up contributions from lower brain structures such as the basal ganglia, which represents more explicit cause-effect pairings without abstract, e.g. “button tap: cocaine”.)

Second let’s look at activities that are widely reported to be “meaningful”, and those that are not. “Meaningful” activities tend to be pro-social. For example, imagine you are an artist on your deathbed; what feels more meaningful, that you produced a work of art seen by no one or that you produced a work of art seen by millions? I’d hazard that the second scenario provides greater “meaning”. We need a sense that we have affected others in a positive manner. Similarly, does “1+1=2” feel “meaningful”? Do your tax returns feel “meaningful”? Does the furniture in a new home feel “meaningful”? I’d hazard “no”. These things do not elicit a strong emotional reaction. The furniture example is a good one; the furniture in your family home may come to have “meaning”, but only because it forms the background of your social memories. We are social animals, like parrots, dolphins or baboons, and so the social realm forms a bedrock to our emotional states.

For correlations to stick in the brain we need two things: 1) for correlations to be present in the outside world (or at least some situations that form a sensory base to those correlations); and 2) for us to regularly experience these external situations. Here “regularly” means at a daily or at least weekly.

Religions have long been aware of these aspects, indeed we often define “religion” as a structured practice built on a common mythological framework. It is widely reported that it is not possible to feel “faith” without practice. In Islam this is explicit, “Islam” means to submit or surrender; you have to practice to believe. The structure of religion provides the regular experience of 2): daily prayers, weekly worship, and annual festivals.

To provide correlations between feelings of “self” and “correctness” and particular narratives, we need to experience them all collectively. It is important that the narratives are at least analogous to our daily experience. If they are not, we cannot experience them as being “correct” or “true”. All of this also needs to take place below a level of conscious awareness.

Again, we can learn a lot from religion. Rituals light up the brain of those experiencing them. You are the one experiencing the ritual, you are taking in heavy sensory stimulation, and are performing actions within the world. Rituals often involve singing, collective and stylised movement, repetition of motifs. These activate common neural representations each time the ritual is performed. The connections that are formed fuse the self and the experience.

The last part of the puzzle involves fusing the narrative and the experience. The self is thus fused with the narrative via the repeated experience.

In religion, rituals tell a story. The Eucharist is rooted in the story of last summer, Passover the liberation of the Israelites, and Ramadan commemorates the receipt of the Quran. It is important that the faithful act in a manner that is consistent with the story. Although many stories are based on an echo of history, historical fact is not important. More important is that the story, or an abstraction of the story, mirrors experience outside of the ritual. For example, many religious stories are based on familial relations, which most can instantly relate to. Many religious stories acknowledge suffering and struggle, as well as moments of joy and exhilaration, which again people regularly feel in their daily lives. The stories are also dynamic, they emerge from history through retelling, emphasis, interpretation. Like our own memories they are recreated every time they are retold. A key role of clergy is to draw parallels between these stories and our daily tribulations.

Having considered these points, we can see why the rational secularism of modernity often leaves people cold and lacking meaning. Science is not a vehicle for creating human meaning. Indeed, I would go as far to say that the factors that make science successful move us in a direction away from meaning. The workaday stories of science need to be sterile for science to work; they need to be objective, unbiased and unemotional. They then provide us with predictive power. But predictive power is not emotional resonance. The predictive stories of science, equations and theories, are not human-centric in a way that matches what we feel. If they were, we would all be reading scientific papers as opposed to watching Netflix.

If people lack meaning in their lives, they quickly fall into nihilism and despair. The challenge of Western post-modernism, having largely ditched religion, is thus to fill the void and create something we can use in our daily lives. Science and engineering could provide the tools and understanding, but they will not provide the solution themselves.

Getting All the Books

This is a short post explaining how to obtain over 50,000 text books for your natural language processing projects.

books on bookshelves
Photo by Mikes Photos on

The source of these books is the excellent Project Gutenberg.

Project Gutenberg offers the ability to use sync the collection of books. To obtain the collection you can set up a private mirror as explained here. However, I’ve found that a couple of tweaks to the rsync setup can be useful.

First, you can use the --list only option in rsync to first obtain a list of files that will be synced. Based on this random Github issue comment, I initially used the command below to generate a list of the files on the UK mirror server (based at the University of Kent):
rsync -av --list-only | awk '{print $5}' > log_gutenberg
(The piping via awk simply takes the 5th column of the list output.)

This file list is around 80MB. We can use this list to add some filters to the rsync command.

On the server books are stored as .txt files. Helpfully, each text file also has a compressed .zip file. Only syncing the .zip files will help to reduce the amount of data that is downloaded. We can either programmatically access the .zip files, or run a script to uncompress (the former is preferred to save disk space).

Some books have accompanying HTML files and/or alternate encodings. We only need ASCII encodings for now. We can thus ignore any file with dash (-) in it (HTML files are *-h* and are zipped; encodings are *-[number].* files).

A book also sometimes has an old folder containing old versions and other rubbish. We can ignore this (as per here). We can use the -m flag to prune empty directories (see here for more details on rsync options).

Also there are some stray .zip files that contain audio readings of books. We want to avoid these as they can be 100s MB. We can thus add an upper size limit of about 10MB (most book files are hundreds of KB).

We can use the --include and --exclude flags in a particular order to filter the files – we first include all subdirectories then exclude files we don’t want before finally only including what we do want.

Bringing this all together gives us the following rsync command-line (i.e. shell) command:

rsync -avm \
--max-size=10m \
--include="*/" \
--exclude="*-*.zip" \
--exclude="*/old/*" \
--include="*.zip" \
--exclude="*" \ ~/data/gutenberg

This syncs the data/gutenberg folder in our home directory with the Kent mirror server. All in all we have about 8GB.

The next steps are then to generate a quick Python wrapper that navigates the directory structure and unzips the files on the fly. We also need to filter out non-English texts and remove the standard Project Gutenberg text headers.

There is a useful GUTINDEX.ALL text file which contains a list of each book and its book number. This can be used to determine the correct path (e.g. book 10000 has a path of 1/0/0/0/10000). The index text file also indicates non-English books, which we could use to filter the books. One option is to create a small SQL database which stores title and path information for English books. It would also be useful to filter fiction from non-fiction, but this may need some clever in-text classification.

So there we are, we have a large folder full of books written before 1920ish, including some of the greatest books ever written (e.g. Brothers Karamazov and Anna Karenina).