Rambles about Language Models

“But it must be recognised that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

Chomsky (1969)

Are language models a waste of time?

I recently found this post in my drafts, having written it over the Christmas period in 2017. Having talked with several technologists about so-called “AI”, I’ve realised that there is a wide public misconception about language models and how they are used. In this post I try to explain to myself, and anyone who feels like reading it, the properties of language models and their limits.

What is a language model?

A language model can be defined as a system that computes probabilities for sequences of tokens:


where the output is a probability value between 0 and 1 and t is a vector containing a sequence of tokens, t_1, t_2... t_n.

Languages have multiple levels of abstraction. The tokens can thus be:

  • characters (e.g. ‘a’, ‘.’, ‘ ‘);
  • words (e.g. ‘cat’, ‘mat’, ‘the’); or
  • larger word groups such as sentences or clauses.

So in one sense, a language model is a model of the probability of a sequence of things. For example, if you are given a particular set of words “the cat sat on the mat“, the language model can provide a probability value between 0 and 1 representing the likelihood of those tokens in the language.

Conditional Models

Another form of language model is a conditional token model. This looks to predict tokens one at a time given a set of previous tokens.

P(t_n|t_1, t_2, ... t_{n-1})

Here we have a function that provides a probability value for a particular token at position n in the sequence, given (the symbol “|“) a set of preceding symbols t_1, t_2... t_{n-1}.

Normally, we have a set of possible token values. For example, we might have a dictionary of words, where a token t can take one of the values in the dictionary of words. Many practical language models have a fixed-size dictionary, sometimes also called a “vocabulary“. This may be 10,000 or 100,000 words. Most people have an active vocabulary (i.e. a vocabulary that they can use in expressions) of between 10,000 and 40,000 words (depending on things like education and literacy). The Oxford English Dictionary has entries for around 170,000 words. So our probability function outputs an array or vector of values, representing a probability value for each token value in the dictionary, where the probability values over the complete dictionary sum to 1.

If you hear phrases such as “local effects” and “long range dependencies”, these relate to the number of tokens we need to add into the language model to predict the next token. For example, do we need all the previous tokens in a document or just the last few?

Language as Numbers

Now computers only understand numbers. So, t in many practical language models isn’t really a sequence of token values (such as words), it’s a sequence of numbers, where each number represents an entry in a dictionary or vocabulary. For example, you may have:

“hello”: 456 – the 456th entry in an array of 10,000 words.

“world”: 5633 – the 5633th entry in an array of 10,000 words.

So “hello world” = [456, 5633].

Building Language Models

Language models are typically constructed by processing large bodies of text. This may be a set of documents, articles in Wikipedia, or all webpages on the Internet. Hence, the output probability can be seen as a likelihood for a sequence of tokens, or a next token, based on some form of historical data.

For example, over a corpus of English text, “the cat is black” would have a higher probability value (e.g. 0.85) than “dog a pylon” (e.g. 0.05). This form of model is actually present in most pockets – it is difficult to type the last example on a smartphone, as each word is autocorrected based on a likely word given the characters.

A simple language model can be built using n-grams. N-grams are sequences of tokens of length n. Before the most recent comeback of neural network models, state of the art transcription and translation systems were based on n-grams. You can generate an n-gram model by simply counting sequences of tokens. For example, “the cat is black” contains 3 bi-grams (n=2) – “the cat”, “cat is”, “is black”. Over a large enough corpus of text “the cat” will occur more times than “cat the” and so “the cat” can be assigned a higher probability value proportional to its frequency.

The latest state of the art language models use recurrent neural networks. These networks are parameterised by a set of weights and biases and are trained on a corpus of data. Using stochastic gradient descent and back propagation on an unrolled network, values for the parameters can be estimated. The result is similar to the n-gram probabilities, where the frequency of relationships between sequences of characters influence the parameter values.

What Language Models Are Not

Over the last 10 or 20 years there have been great strides in analysing large bodies of text data to generate accurate language models. Projects such as Google books and the Common Crawl have built language models that cover a large proportion of the written word generated by human beings. This means we can now fairly accurately provide bounds of how likely a sequence of tokens is.

However, issues often when people naively try to use language models to generate text. It is often the case that the most likely sentence, given a corpus, is not the sentence we want or need. Indeed, for a sentence to have high (human) value it often needs to express something new, and so it will diverge from the corpus of past sentences. Hence, a “good” sentence (from a human task perspective) may have a lower probability in our language model than a “common” sentence.

As an exercise for you at home, try to dictate an unusual sentence into your phone. The likely outcome is that the phone tries to generate the most likely sentence based on history rather than the sentence you want to generate.

You also see this with toy implementations of recurrent neural networks such as CharRNN and its varieties. These networks seek to estimate a probability distribution over a vocabulary of terms; what is being generated is likely sequences of tokens given the training data. These toy implementations are what the popular press pick up on as “AI” writers. However, they are nothing more than sequences of likely tokens given a dataset.

Often the toy implementations appear to be smart because of the stochastic nature of the probabilistic models – each sequence will be slightly different due to probabilistic sampling of the token probabilities (plus things like searches over the probabilities). Hence, you will get a slightly different output every time, which looks more natural than a single most likely sentence. However, a closer reading shows that the outputs of these systems is gibberish.

So language models are not generative models, at least not in their popular form.

How Do You Generate Text?

Another exercise: tell me a sentence. Any sentence whatsoever.

It’s harder than it looks. People will generally do one of several things:

  1. pick archetypal sentences “the quick brown fox…” “the X did Y”, where most of these are conditionally learnt at a young age;
  2. pick a sentence based on what they were doing immediately prior to the question; or
  3. look around and pick something in their field of view. The experiment is even more fun with children, as the thought processes are often more transparent.

This test demonstrates a simple truth: the concept of a “random” sentence rarely occurs in practice. All language is conditional. It is easier to provide a sentence about something.

Here are some things (amongst a near infinite multitude of things) that influence what words we select:

  • Document type (report, blog post, novel);
  • Audience (children, adults, English professors);
  • Location in document (start, middle, end);
  • Characters and character history;
  • Country;
  • Previous sentences/paragraphs/chapters; and
  • Domain (engineering, drama, medical).

The better we get at modelling what we are talking about, the better our generative language models.

This is partially seen with summarisation systems. Some of these produce pretty coherent text. The reason? The context is severely constrained by the content of the piece of writing we are summarising.

Distributed Sensory Representations

There is more. Vision, sound and motor control can teach us a lot about language. Indeed different facets of language have piggybacked on the underlying neural configurations used for these abilities. All these areas have distributed hierarchical representations. Complex features are represented by fuzzy combinations of lower features, and at the bottom you have raw sensory input. There is no neuron for “banana” but a whole series of activations (and activation priming) for different aspects of a “banana”. Visualisations of feature layers in convolutional neural network architectures show how complex object representations may be constructed from a series of simple features, such as edges. It is likely that a semantic representation of a word in our brains is similarly constructed from webs of component representations over several layers.

Yet another question: how does a person blind from birth imagine an orange?

I don’t know the answer to this. (I need to find some research on it.) I’d hazard to guess that the mental representation is built from many non-visual sensory representations, where these may be more detailed than an average sighted person. But the key is they still “know” what an orange is. Hence our semantic representations are distributed over different sensory modalities as well as over different layers of complexity.

So I believe we are getting closer to useful generative language models when we look an systems that produce simple image caption labels. These systems typically use a dense vector representation of an image that is output by a convolutional neural network architecture to condition a recurrent neural network language model. The whole system is then trained together. Here the dense vector provides an “about” representation that allows the language model to pick the most likely words, given the image. The surreal errors these systems make (a baseball bat is a “toothbrush”) show the limitations of the abstract representations conditioning the text generation.

Another issue that tends to get ignored by academic papers I have seen is the limitation of selecting a particular input representation. Many systems start with clean, easily tokenised text sources. The limited time scales of research projects means that words are often picked as the input layer. Hence, the language model looks at providing word probabilities over a vocabulary. Often word embeddings are used on this input, which introduces some aspects of correlation in use. However, in our brains, words seem to be an intermediate representation; they are features built upon sounds, phonemes and lower symbols (e.g. probably at least several layers of representations). Given that language is primarily oral (writings is a relatively new bolt on), I’d hazard that these lower levels influence word choice and probability. (For example, why do you remember “the cat on the mat” more than “the cat on the carpet”?) Word embeddings help to free us from the discrete constraints of words as symbols but they may be applying use patterns too early in the layers of representations.

Looking at how motor activity is controlled in the brain, we find that our cortex does not store detailed low-level muscle activation patterns. Through training these patterns are often pushed out of the cortex itself, e.g. into the cerebellum, spinal cord or peripheral nervous system. Also we find that, if practised enough, fairly complex sequences may be encoded as a small number of cortical representations. This appears to apply to language generation as well, especially for spoken language. Our conversations are full of cliches and sayings (“at the end of the day”). Within the cortex itself, brain activity appears to cascade from higher levels to lower levels but with feedback between the layers during language generation, e.g. we translate a representation of an object into a series of sounds then a series of muscle activations.

So in our brain:

  • There is structure in language (otherwise it would be incomprehensible).
  • Comprehension arrives through shared conventions.
  • These shared conventions are fuzzy – as they are shaped through social use different contradictory rules may apply at the same time.
  • The structure of language at least partially reflects preferred methods of information representation and organisation in the human cortex.

This is just a quick run through of some of my thinking on this point. What you should take home is that language models are useful in many engineering applications but they are not “artificial intelligence” as believed by many.


How do we remember characters?

I have a question of cognitive science: how do we hold in our minds the combination of characteristics that make up a particular object or character?

How do we then keep that specific combination in mind and consistent over the span of a narrative?

Consistency is a hard one. Characters may be separated by several pages or chapters but we still expect consistency.

Possible clues:

  • Working memory – e.g. similar to remembering a phone number. We can remember about 3-5 discrete items – but often our characters or objects are of greater complexity than this. Chunking has been seen to be a way of holding more representations in mind; where we combine multiple properties under one limited placeholder representation.
  • It does not seem possible that combinations are stored via changes in synaptic configurations, at least in the short term. What may be the case is that we make use of slowly changeable underlying representations, which we modify (“delta”) and combine.
  • There would seem to be a hierarchy of features, which dictates what needs to be remembered for consistency (e.g. alive or dead is magnitudes more important than repetition of character thought). It maybe that consistency consists of presence or absence of clearly separable pre-constructed representations (e.g. we know alive and dead are clearly different so we just need to link one of those pre-existing concepts).
  • Hot and cold recall – it is a well known fact that it is easier to recall primed or pre-activated representations. This tends to fade but may have effects over a time scale of minutes up to an hour. It would appear that this is a relatively broad-brush effect – it is reported that related representations are also more easily retrieved. Cold recall is probably more influenced by repetition and longer-term changes in synaptic configurations.
  • Combinations of characteristics appear to be based on higher level sensory representations, which in turn may be constructed from lower level sub-components, all with affective weightings. Anecdotally, many of my representations appear to be primarily visual – is there any research on how those that are blind from birth “imagine” objects?
  • Imagination and remembering appear to use the same underlying mechanisms – they all involve the “top down” activation of portions of neuronal representations based on an internal driver that is (semi) independent of the current incoming sensory streams. This is all happening in time within nested feedback loops – locking onto a harmonic signal is a useful analogy.
  • Is imagination, i.e. the ability to create new combinations of characteristics, uniquely human? If so what is it about the human brain that allows this? Evidence suggests a looser coupling between sensory input and representation, or the fact that higher level features can be activated somewhat independently of lower level circuits.
  • The prefrontal cortex is keep to the initial top-down control and the coordination of timing. But imagination and manipulation of representations appears to require the coordination of the whole brain. Often ignored structures such as the thalamus and basal ganglia are likely important for constructing and maintaining the feedback loop.
  • Although we feel an imagined combination appears instantaneously, this may be just how we perceive it – the feedback loops may activate over time in an iterative manner to “lock onto” the combination such that it appears in our minds.

Reflections on “Meaning”

Existential “meaning” is partly the telling of a story featuring ourselves that is available and consistent with our higher-level representations of the world. It is not, generally, rational; it is more a narrative correlated with a feeling of “selfness” and “correctness”.

For example, think of how you feel when you hear your own internal voice as opposed to hearing another person speak. You feel that the internal voice is somehow “you”. This is not a rational thought, indeed the language of rational thought may be seen, in part, to *be* the internal voice. This feeling breaks down in certain brain diseases, such as schizophrenia. With these diseases, the “me” feeling is lost or broken, and hence the internal voice of “you” becomes an auditory hallucination, a voice of “them”.

Concentrate on this feeling of “you” for a moment, try to explore what it feels like.

Now think about a feeling of “correctness”. This can also be seen as a feeling of “truthiness”. For example, try to concentrate on how the feeling of “1 + 1 = 2”, differs from the feeling of “1 + 1 = 5”. The latter invokes a feeling of uneasiness, an itching to correct. It has tones of unpleasantness. It induces a slight anxiety, a feeling that action is needed. The former invokes a feeling of contentment, that no action is required; it may be contemplated for an extended period without unease of additional thought. It’s a similar feeling to artistic “beauty”, the way we can contemplate a great painting or a landscape.

Both feelings may arise from a common mechanism in the cingulate cortex, a medial layer of cortex that sits between older brain structures such as the thalamus and the higher cortical layers. Indeed, certain forms of schizophrenia have been traced back to this structure. The cingulate cortex may be considered to be the emotional gateway between complex neural representations in the upper cortex and structures that manage low-level sensory input and co-ordinate physiology response. “Sensory input” in this context also includes “gut feeling”, sensory input from internal viscera. Work, such as that performed by Antonio Damasio, shows that this input is important for the embodied feeling of self, e.g. “self” may in part be a representation formed from signals from these viscera. In a not dissimilar manner, “correctness” may be based on a representation of error or inconsistency between a sequence of activated cortical representations. At a very naive level this could be built from familiarity, e.g. at a statistical level, does this sequence match previously activated sequences? Over a human life this is a “big data” exercise.

So, back to “meaning”. Our higher-level cortical systems, e.g. the frontal lobes, create narratives as patterned sequences of goal-orientated social behaviour, which may be expressed in various media (stories, plays, comics, dance, songs, poems etc). For “meaning” to be present, we are looking for strong positive correlations between these narratives and the emotional representations of “self” and “correctness”. What form could these correlations take?

First, let’s look at “goal-orientated behaviour”. The frontal lobes build representations of sequences of other representations. These sequences can represent “situation, action, outcome”. There is some overlap with the methods of reinforcement learning. Over time we learn the patterns that these sequences tend to match (google Kurt Vonnegut’s story graphs). The frontal lobes are powerful as they can stack representations over one to seven layers of cortex, allowing for increasing abstraction. Narratives are thus formed from hierarchical sequences of sequences. (There may also be a bottom-up contributions from lower brain structures such as the basal ganglia, which represents more explicit cause-effect pairings without abstract, e.g. “button tap: cocaine”.)

Second let’s look at activities that are widely reported to be “meaningful”, and those that are not. “Meaningful” activities tend to be pro-social. For example, imagine you are an artist on your deathbed; what feels more meaningful, that you produced a work of art seen by no one or that you produced a work of art seen by millions? I’d hazard that the second scenario provides greater “meaning”. We need a sense that we have affected others in a positive manner. Similarly, does “1+1=2” feel “meaningful”? Do your tax returns feel “meaningful”? Does the furniture in a new home feel “meaningful”? I’d hazard “no”. These things do not elicit a strong emotional reaction. The furniture example is a good one; the furniture in your family home may come to have “meaning”, but only because it forms the background of your social memories. We are social animals, like parrots, dolphins or baboons, and so the social realm forms a bedrock to our emotional states.

For correlations to stick in the brain we need two things: 1) for correlations to be present in the outside world (or at least some situations that form a sensory base to those correlations); and 2) for us to regularly experience these external situations. Here “regularly” means at a daily or at least weekly.

Religions have long been aware of these aspects, indeed we often define “religion” as a structured practice built on a common mythological framework. It is widely reported that it is not possible to feel “faith” without practice. In Islam this is explicit, “Islam” means to submit or surrender; you have to practice to believe. The structure of religion provides the regular experience of 2): daily prayers, weekly worship, and annual festivals.

To provide correlations between feelings of “self” and “correctness” and particular narratives, we need to experience them all collectively. It is important that the narratives are at least analogous to our daily experience. If they are not, we cannot experience them as being “correct” or “true”. All of this also needs to take place below a level of conscious awareness.

Again, we can learn a lot from religion. Rituals light up the brain of those experiencing them. You are the one experiencing the ritual, you are taking in heavy sensory stimulation, and are performing actions within the world. Rituals often involve singing, collective and stylised movement, repetition of motifs. These activate common neural representations each time the ritual is performed. The connections that are formed fuse the self and the experience.

The last part of the puzzle involves fusing the narrative and the experience. The self is thus fused with the narrative via the repeated experience.

In religion, rituals tell a story. The Eucharist is rooted in the story of last summer, Passover the liberation of the Israelites, and Ramadan commemorates the receipt of the Quran. It is important that the faithful act in a manner that is consistent with the story. Although many stories are based on an echo of history, historical fact is not important. More important is that the story, or an abstraction of the story, mirrors experience outside of the ritual. For example, many religious stories are based on familial relations, which most can instantly relate to. Many religious stories acknowledge suffering and struggle, as well as moments of joy and exhilaration, which again people regularly feel in their daily lives. The stories are also dynamic, they emerge from history through retelling, emphasis, interpretation. Like our own memories they are recreated every time they are retold. A key role of clergy is to draw parallels between these stories and our daily tribulations.

Having considered these points, we can see why the rational secularism of modernity often leaves people cold and lacking meaning. Science is not a vehicle for creating human meaning. Indeed, I would go as far to say that the factors that make science successful move us in a direction away from meaning. The workaday stories of science need to be sterile for science to work; they need to be objective, unbiased and unemotional. They then provide us with predictive power. But predictive power is not emotional resonance. The predictive stories of science, equations and theories, are not human-centric in a way that matches what we feel. If they were, we would all be reading scientific papers as opposed to watching Netflix.

If people lack meaning in their lives, they quickly fall into nihilism and despair. The challenge of Western post-modernism, having largely ditched religion, is thus to fill the void and create something we can use in our daily lives. Science and engineering could provide the tools and understanding, but they will not provide the solution themselves.

Getting All the Books

This is a short post explaining how to obtain over 50,000 text books for your natural language processing projects.

books on bookshelves
Photo by Mikes Photos on Pexels.com

The source of these books is the excellent Project Gutenberg.

Project Gutenberg offers the ability to use sync the collection of books. To obtain the collection you can set up a private mirror as explained here. However, I’ve found that a couple of tweaks to the rsync setup can be useful.

First, you can use the --list only option in rsync to first obtain a list of files that will be synced. Based on this random Github issue comment, I initially used the command below to generate a list of the files on the UK mirror server (based at the University of Kent):
rsync -av --list-only rsync.mirrorservice.org::gutenberg.org | awk '{print $5}' > log_gutenberg
(The piping via awk simply takes the 5th column of the list output.)

This file list is around 80MB. We can use this list to add some filters to the rsync command.

On the server books are stored as .txt files. Helpfully, each text file also has a compressed .zip file. Only syncing the .zip files will help to reduce the amount of data that is downloaded. We can either programmatically access the .zip files, or run a script to uncompress (the former is preferred to save disk space).

Some books have accompanying HTML files and/or alternate encodings. We only need ASCII encodings for now. We can thus ignore any file with dash (-) in it (HTML files are *-h* and are zipped; encodings are *-[number].* files).

A book also sometimes has an old folder containing old versions and other rubbish. We can ignore this (as per here). We can use the -m flag to prune empty directories (see here for more details on rsync options).

Also there are some stray .zip files that contain audio readings of books. We want to avoid these as they can be 100s MB. We can thus add an upper size limit of about 10MB (most book files are hundreds of KB).

We can use the --include and --exclude flags in a particular order to filter the files – we first include all subdirectories then exclude files we don’t want before finally only including what we do want.

Bringing this all together gives us the following rsync command-line (i.e. shell) command:

rsync -avm \
--max-size=10m \
--include="*/" \
--exclude="*-*.zip" \
--exclude="*/old/*" \
--include="*.zip" \
--exclude="*" \
rsync.mirrorservice.org::gutenberg.org ~/data/gutenberg

This syncs the data/gutenberg folder in our home directory with the Kent mirror server. All in all we have about 8GB.

The next steps are then to generate a quick Python wrapper that navigates the directory structure and unzips the files on the fly. We also need to filter out non-English texts and remove the standard Project Gutenberg text headers.

There is a useful GUTINDEX.ALL text file which contains a list of each book and its book number. This can be used to determine the correct path (e.g. book 10000 has a path of 1/0/0/0/10000). The index text file also indicates non-English books, which we could use to filter the books. One option is to create a small SQL database which stores title and path information for English books. It would also be useful to filter fiction from non-fiction, but this may need some clever in-text classification.

So there we are, we have a large folder full of books written before 1920ish, including some of the greatest books ever written (e.g. Brothers Karamazov and Anna Karenina).