Part I of II – What Recent Advances in Natural Language Processing Can Teach Us About the Brain
I recently worked out that the day of my birth will be closer to the end of the Second World War than the present day. This means I am living in the future, hooray!
Over the years I’ve been tracking (and experimenting with) various concepts in natural language processing, as well as reading general texts on the brain. To me both streams of research have been running in parallel; in the last 5 years, natural language processing has found a new lease of engineering life via deep learning architectures, and the empirical sciences have been slowly chipping away at cognitive function. Both areas appear to be groping different parts of the same elephant. This piece provides an outlet for the cross-talk in my own head. With advances in natural language processing coming thick and fast, it also provides an opportunity for me to reflect on the important undercurrents, and to try to feel for more general principles.
The post will be in two halves. This half looks at what recent advances in natural language processing and deep learning could teach us about intelligence and the functioning of the human brain. The next half will look at what the brain could teach natural language processing.
I’ll say here that the heavy lifting has been performed by better and brighter folk. I do not claim credit for any of the outlines or summaries provided here; my effort is to try to write things down in a way that make sense to my addled brain, in the hope that things may also make sense to others. I also do not come from a research background, and so may take a few liberties for a general audience.
In natural language processing, these are the areas that have stayed with me:
- Language Models,
- Distributed Representations,
- Neural Networks,
- Ontologies, and
- Language is Hard.
In the next section, we’ll run through these (at a negligent speed), looking in particular at what they teach their respective sister fields. If you want to dig deeper, I recommend as a first step the Wikipedia entry on the respective concept, or any of the links set out in this piece.
Let’s get started. Hold on.
Mention the term grammar to most people and they’ll wince, remembering the pain inflicted in English or foreign language lessons. A grammar relates to the rules of language. While we don’t always know what the rules are, we can normally tell when they are being broken.
I would say that a majority of people view grammar like god (indeed Baptists can very nearly equate the two). There is one true Grammar, it is eternal and unchanging, and woe betide you if you break the rules. Peek behind the curtain though and you realise that linguists have proposed over a dozen different models for language, and all of them fail in some way.
So what does this mean? Are we stuck in a post-modern relativist malaise? No! Luckily, there are some general points we can make.
Most grammars indicate that language is not a string of pearls (as the surface form of words seems to suggest) but has some underlying or latent structure. Many grammars indicate recursion and fractal patterns of self-similarity, nested over hierarchical structures. You can see this here:
- The ball.
- The ball was thrown.
- The red ball was thrown over the wall.
- In the depths of the game, the red ball was thrown over the wall, becoming a metaphor for the collapse of social morality following the fall of Communism.
Also the absence of “one grammar to rule them all”, teaches us that our rulesets are messy, incomplete and inconsistent. There is chaos with constraint. This hints that maybe language cannot be definitively defined using language. This hints further at Gödel and Church. This doesn’t necessarily rule out building machines that parse and generate language, but it does indicate that these machines may not be able to follow conventional rule-based deterministic processing.
With the resurgence of neural approaches, grammars have gone out of fashion, representing the “conventional rule-based deterministic processing” that “does not work”. But we should not ignore their lessons. Many modern architectures do not seem to accurately capture the recursion and self-similarity, and it appears difficult to train different layers to capture the natural hierarchy. For example, a preferred neural network approach, which we will discuss in more detail below is the recurrent neural network. But this is performing gated repeated multiplication. This means that each sentence above is treated quite differently. This seems to miss the point above. While attention has helped, this seems to be a band-aid as opposed to a solution.
A language model is a probabilistic model that seeks to predict a next word given one or more previous words. The “probabilistic” aspect basically means that we are given a list of probabilities associated with a list of candidate words. A word with a probability of 1 would indicate that a word was definitely next (if you are a Bayesian that you are sure this is the next word). A word with a probability of 0 would indicate that the word was definitely not next. A probability is a probability if all the possible outcomes add up to one, so all our probability values across our words need to do this.
In the early 2000s, big strides were made using so-called ‘n-gram‘ approaches. Translated, ‘n-gram’ approaches count different sequences of words. The “n” refers to a number of words in the sequence. If n=3, we count different sequences of three words and use their frequency of occurrence to generate a probability. Here are some examples:
- the cat sat (fairly high count)
- he said that (fairly high count)
- sat cat the (count near 0)
- said garbage tree (count near 0)
If we have enough digital data, say by scanning all the books, then we can get count data indicating the probabilities of millions of sequences. This can help with things such as spell checking, encoding, optical character recognition and speech recognition.
We can also scale up and down our ‘n-gram’ models to do things like count sequences of characters or sequences of phonemes instead of words.
Language models were useful as they introduced statistical techniques that laid the groundwork for later neural network approaches. They offered a different perspective from rule-based grammars, and were well suited to real-world data that was “messy, incomplete and inconsistent”. They showed that just because a sentence fails the rules of a particular grammar, it does not mean it will not occur in practice. They were good for classification and search: it turns out that there were regular patterns behind language that could enable us to apply topic labels, group documents or search.
Modern language models tend to be built not from n-grams but using recurrent neural networks, such as one or bi- directional Long Short Term Memories (LSTMs). In theory, the approaches are not dissimilar, the LSTMs are in effect counting word sequences and storing weights that reflect regular patterns within text. There are just adding a sprinkling of non-linearity.
Like all models, language models have their downsides. A big one is that people consistently fail to understand what they can and cannot do. They show the general patterns of use, and show where the soft boundaries in language use lie. It turns out that we are more predictable than we think. However, they are not able, on their own, to help us with language generation. If you want to say something new, then this is by its nature going to be of low probability. They do not provide help for semantics, the layer of meaning below the surface form of language. This is why LSTMs can produce text that at first glance seems sensible, with punctuation, grammatically correct endings and what seem like correct spellings. But look closely and you will see that the text is meaningless gibberish.
Quite commonly the answer to the current failings of recurrent neural networks has been to add more layers. This does seem to help a little, as seen with models such as BERT. But just adding more layers doesn’t seem to provide a magic bullet to the problems of meaning or text generation. Outside of the artificial training sets these models still fail in an embarrassing manner.
It is instructive to compare the failures of grammars and language models, as they both fail in different ways. Grammars show that our thoughts and speech have non-linear patterns of structure, that there is something behind language. Language models show that our thoughts and speech do not follow well-defined rules, but do show statistical regularity, normally to an extent that surprises us “free” human agents.
Distributed representations are what Geoff Hinton has been banging on about for years and for me are one of the most important principles to emerge from recent advances in machine learning. I’m lying a little when I link them to natural language processing as they originally came to prominence in vision research. Indeed, much of the work on initial neural networks for image recognition was inspired by the earlier neuroscience of Nobel Prize Winners Hubel and Wiesel.
Distributed representations mean that our representations of “things” or “concepts” are shared among multiple components or sub-components, where each component or sub-component forms part of numerous different “things” or “concepts”.
Put another way, it’s a form of reductionist recycling. Imagine you had a box of Lego bricks. You can build different models from the bricks, where the model is something more than the underlying bricks (a car is more than the 2×8 plank, the wheels, those little wheel arch things etc.). So far, so reductionist. The Greeks knew this several millennia ago. However, now imagine that each Lego brick of a particular type (e.g. red 2×1 block, the 2×8 plank, each white “oner”) is the same brick. So all your models that have use red 2×1 blocks use the same red 2×1 block. This tend to turn your head inside out. Of course, in reality you can’t be in two places at the same time, but you can imagine your brain assembling different Lego models really quickly in sequence as we think about “things” (or even not “things”, like abstractions or actions or feelings).
This is most easily understood when thinking about images. This image from Wei et al at MIT might help:
In this convolutional neural network, the weights of each layer are trained such that different segments of each layer end up representing different aspects of a complex object. These segments form the “Lego bricks” that are combined to represent the complex object. In effect, the segments reflect different regular patterns in the external environment, and different objects are represented via different combinations of low-level features. As we move up the layers our representations become more independent of the actual sensory input, e.g. they are activated even if lighting conditions change, or if the object moves in our visual field.
Knowing this, several things come to mind with regard to language:
- It is likely that the representations that relate to words are going to follow a similar pattern to visual objects. Indeed, many speech recognition pipelines use convolutional neural networks to decipher audio signals and convert this to text. This form of representation also fits with the findings from studying grammars: we reuse semantic and syntactic structures and the things we describe can be somewhat invariant of the way we describe them.
- Our components are going to be hard to imagine. Language seems to come to us fully-formed as discrete units. Even Plato got confused and thought there was some magical free-floating “tree” that existed in an independent reality. We are going to have to become comfortable describing the nuts and bolts of sentences, paragraphs and documents using words to describe things that may not be words.
- For images, convolutional neural networks are very good at building these distributed representations across the weights of the layer. This is because the convolution and aggregation is good at abstracting over two-dimensional space. But words, sentences, paragraphs and documents are going to need a different architecture; they do not exist in two-dimensional space. Even convolutional neural networks struggle when we move beyond two dimensions into real-world space and time.
Neural networks are back in fashion! Neural networks have been around since the 1950s but it is only recently we have got them to work in a useful manner. This is due to a number of factors:
- We now have hugely fast computers and vastly greater memory sizes.
- We worked out how to practically perform automatic differentiation and build compute graphs.
- We began to have access to huge datasets to use for training data.
The best way to think of neural network is that they implement differentiable function approximators. Given a set of (data-X, label-Y) pairs neural networks perform a form of non-linear line fitting that maps X>Y.
Within natural language processing, neural networks have out performed comparative approaches in many areas, including:
- speech processing (text-to-speech and speech-to-text);
- machine translation;
- language modelling;
- question-answering (e.g. simple multiple choice logic problems);
- summarisation; and
- image captioning.
In the field of image processing, as set out above, convolutional neural networks rule. In natural language processing, the weapon of choice is the recurrent neural network, especially the LSTM or Gated Recurrent Unit (GRU). Often recurrent neural networks are applied as part of a sequence-to-sequence model. In this model an “encoder” receives a sequence of tokens and generates a fixed-size numeric vector. This vector is then supplied to a “decoder”, which outputs another sequence of tokens. Both the encoder and the decoder are implemented using recurrent neural networks. This is one way machine translation may be performed.
Neural networks do not work like the brain. But they show that a crude model can approximate some aspects of cortical function. They show that it is possible to build models of the world by feeding back small errors between our expectations and reality. No magic is needed.
The limitations of neural networks also show us that we are missing fairly large chunks of the brain puzzle – intelligence is not just sections of cortex. Most of the progress in the field of machine learning has resulted from greater architectural complexity, rather than any changes to the way neural networks are trained, or defined. At the moment things resemble the wild west, with architectures growing based on hunches and hacking. This kind of shifts the problem: the architectures still need to be explicitly defined by human beings. We could do with some theoretical scaffolding for architectures, and a component-based system of parts.
Most state of the art neural network models include some form of attention mechanism. In an over-simplified way, attention involves weighting components of an input sequence for every element in an output sequence.
In a traditional sequence-to-sequence system, such as those used for machine translation, you have an encoder, which encodes tokens in an input sentence (say words in English), and a decoder, which takes an encoded vector from the encoder and generates an output sentence (say words in Chinese). Attention models sought to weight different encoder hidden states, e.g. after each word in the sentence, when producing each decoder state (e.g. each output word).
A nice way to think of attention is in the form of values, keys and queries (as explained here).
In a value/key/query attention model:
- the query is the decoder state at the time just before replicating the next word (e.g. a a given embedding vector for a word);
- the keys are items that together with the query are used to determine the attention weights (e.g. these can be the encoder hidden states); and
- the values are the items that may be weighted using attention (e.g. these can also be the encoder hidden states).
In the paper “Attention is All You Need”, the authors performed a nifty trick by leading with the attention mechanism and ditching some of the sequence-to-sequence model. They used an attention function that used a scaled dot-product to compute the weighted output. If you want to play around with attention in Keras this repository from Philippe Rémy is a good place to start.
Adam Kosiorek provides a good introduction to attention in this blogpost. In his equation (1), the feature vector z is equivalent to the values and the keys are the input vector x. A value/key/query attention model expands the parameterised attention network f(…) to be a function of both two input vectors: the hidden state of the decoder (the query Q) and the hidden states of the encoder (the keys K) – a = f(Q, K). The query here changes how the attention weights are computed based on the output we wish to produce.
Now I have to say that many of the blogposts I have read that try to explain attention fail. What makes attention difficult to explain?
Attention is weighting of an input. This is easy to understand in a simple case: a single, low-dimensionality feature vector x, where the attention weights are calculated using x and are in turn applied to x. Here we have:
a = f(x)
g = a⊙x.
g is the result of applying attention, in this simple example a weighted set of inputs. The element-wise multiplication ⊙ is simple to understand – the kth element of g is computed as g_k = a_k*x_k (i.e. an element of a is used to weight a corresponding element of x). So attention, put like this, is just the usual weighting of an input, where the weights are generated as a function of the input.
Now attention becomes more difficult to understand as we move away from this simple case.
- Our keys and values may be different (i.e. x and z may be different sizes with different values). In this case, I find it best to consider the values (i.e. z) as the input we are weighting with the attention weights (i.e. a). However, in this case, we have another input – the keys (or x) – that are used to generate our attention weights, i.e. a=f(x). In sequence-to-sequence examples the keys and values are often the same, but they are sometimes different. This means that some explanations conflate the two, whereas others separate them out, leading to bifurcated explanations.
- In a sequence-to-sequence model our keys and/or values are often the hidden states of the encoder. Each hidden state of the encoder may be considered an element of the keys and/or values (e.g. an element of x and/or z). However, encoders are often recurrent neural networks with hidden dimensions of between 200 and 300 values (200-300D), as opposed to single dimension elements (e.g. 1D elements – [x1, x2, x3…]). Our keys and values are thus matrices rather than arrays. Each element of the keys and/or values thus becomes a vector in itself (e.g. x1 = [x11, x12, x13…]). This opens up another degree of freedom when applying the attention weights. Now many systems treat each hidden state as a single unit and multiply all the elements of the hidden state by the attention weight for the hidden state, i.e. gk = ak*[xk1, xk2, xk3…] = [akxk1, akxk2, akxk3…..]. However, it is also possible to apply attention in a multi-dimensional manner, e.g. where each attention weight is a vector that is multiplied by the hidden state vector. In this case, you can apply attention weights across the different dimensions of the encoder hidden state as well as apply attention to the encoder hidden state per se. This is what I believe creates much of the confusion.
- Sequence-to-sequence models often generate the attention weights as a function of both the encoder hidden states and the decoder hidden states. Hence, as in the “Attention is All You Need” paper the attention weights are computed as a function of a set of keys and a query, where the query represents a hidden state of a decoder at a time that a next token is being generated. Hence, a = f(Q, K). Like the encoder hidden state, the decoder hidden state is often of high dimensionality (e.g. 200-300D). The function may thus be a function of a vector Q and a matrix K. “Attention is All You Need” further confuses matters by operating on a matrix Q, representing hidden states for multiple tokens in the decoder. For example, you could generate your attention weights as a function of all previous decoder hidden states.
- Many papers and code resources optimise mathematical operations for efficiency. For example, most operations will be represented as matrix multiplications to exploit graphical processing unit (GPU) speed-ups. However, this often acts to lose some logical coherence as it is difficult to unpick the separate computations that are involved.
- Attention in a sequence-to-sequence model operates over time via the backdoor. Visual attention models are easier to conceptually understand, as you can think of them as a focus over a particular area of a 2D image. However, sentences represent words that are uttered or read at different times, i.e. they are a sequence where each element represents a successive time. Now, I haven’t seen many visual attention models that operate over time as well as 2D (this is in effect a form of object segmentation in time). The query in the attention model above thus has a temporal aspect, it changes the attention weights based on a particular output element, and the hidden state of the decoder will change as outputs are generated. However, “t” doesn’t explicitly occur anywhere.
Unpacking the confusion also helps us see why attention is powerful and leads to better results:
- In a sequence-to-sequence model, attention changes with each output token. We are thus using different data to condition our output at each time step.
- Attention teaches us that thinking of cognition as a one-way system leads to bottlenecks and worse results. Attention was developed to overcome the constraints imposed by trying to compress the meaning of a sentence into a single fixed-length representation.
- In certain sequence-to-sequence models, attention also represents a form of feedback mechanism – if we generate our attention weights based on past output states we are modifying our input based on our output. We are getting closer to a dynamic system – the system is being applied iteratively with time as an implicit variable.
- Visualisations of attention from papers such as “Show, Attend and Tell” – and machine translation models seem to match our intuitive notions of cognition. When a system is generating the word “dog” the attention weights emphasise the image areas that feature a dog. When a system is translating a compound noun phrase it tends to attend jointly to all the words of the phrase. Attention can thus be seen as a form of filtering, it helps narrow the input to conditionally weigh an output.
- Attention is fascinating because it suggests that we can learn a mechanism for attention separately from our mapping function. Our attention function f(…) is a parameterised function where we learn the parameters during training. However, these parameters are often separate from the parameters that implement the encoder and decoder.
Attention appears to have functional overlaps with areas of the thalamus and basal ganglia which form a feedback loop between incoming sensory inputs and the cortex. Knowing about how attention works in deep learning architectures may provide insight into mechanisms that could be implemented in the brain.
In the philosophical sense, an ontology is the study of “being”, i.e. what “things” or “entities” there are in the world, how they exist, what they are and how they relate to each other.
In a computer science sense, the term “ontology” has also been used to describe a method of organising data to describe things. I like to think of it representing something like a database schema on steroids. Over the last few decades, one popular form of an “ontology” has been the knowledge graph, a graph of things and relationships represented by triples, two “things” connected by a “relationship”, where the “things” and the “relationship” form part of the ontology.
Ontologies are another area that has faded out of fashion with the resurgence of deep neural networks. In the early 2000s there was a lot of hype surrounding the “semantic web” and other attempts to make data on the Internet more machine interpretable. Projects like DBpedia and standard drives around RDF and OWL offered to lead us to a brave new world of intelligent devices. As with many things they didn’t quite get there.
What happened? The common problem of overreach was one. Turns out organising human knowledge is hard. Another problem was one shared with grammars, human beings were trying to develop rule-sets, conventions and standards for something that was huge and statistical in nature. Another was that we ended up with a load of JAVA and an adapted form of SQL (SPARQL), while the Internet and research, being stubborn, decided to use hacky REST APIs and Python.
However, like grammars, ontologies got some things right, and we could do with saving some of the baby from the bathwater:
- Thinking about things in terms of graphs and networks seems intuitively right. The fact that ontologies are a useful way to represent data says something in itself about how we think about the world.
- It turns out that representing graph data as sets of triples works fairly well. This may be useful for further natural language processing engineering. This appears to reflect some of the fractal nature of grammars, and the self-similarity seen in language.
- Ontologies failed in a similar way to grammars. Neural networks have taught us that hand-crafting features is “not the way to go”. We want to somehow combine the computing and representational aspects of ontologies, with learnt representations from the data. We need our ontologies to be messier. No one has quite got there yet, there have been graph convolutional networks but the maths is harder and so they form a niche area that is relatively unknown.
- The “thing”, “relationship/property” way of thinking seems to (and was likely chosen to) reflect common noun/verb language patterns, and seems to reflect an underlying way of organising information in our brains, e.g. similar to the “what” and “where” pathways in vision or the “what” and “how” pathways in speech and motor control.
Language is Hard
To end, it is interesting to note that the recent advances in deep learning started with vision, in particularly image processing. Many attempted to port across techniques that had been successful in vision to work on text. Most of the time this failed. Linguists laughed.
For example, compare the output of recent Generative Adversarial Networks (GAN) with that of generative text systems. There are now many high-resolution GAN architectures but generative text systems struggle with one or two coherent sentences and collapse completely over a paragraph. This strongly suggests that language is an emergent system that operates on top of vision and other sensory modalities (such as speech recognition and generation). One reason why deep learning architectures struggle with language is that they are seeking to indirectly replicate a very complex stack using only the surface form of the stack output.
Take object persistence as another example. Natural language processing systems currently struggle with entity co-reference that a 5-year old can easily grasp, e.g. knowing that a cat at the start of the story is the same cat at the end of a story. Object persistence in the brain is likely based on at least low-level vision and motor representations. Can we model these independently of the low-level representations?
The current trend in natural language processing is towards bigger and more complex architectures that excel on beating benchmarks but generally fail miserably on real-world data. Are we now over-fitting in architecture space? Maybe one solution is to take a more modular approach where we can slot in different sub-systems that all feed into the priors for word selection.
In part two, we will look at things from the other side of the fence. We review some of the key findings in neuro- and cognitive science, and have a look at what these could teach machine learning research.