Free Will: Do We Have It?

The more we design intelligent systems the more we creep up against the concepts of free will and determinism. These concepts underlie the stories we tell ourselves and underpin our legal systems. But what does free will mean? How does it influence our actions? And can we get rid of it?


The approach of this piece is as follows:

  • First, we will take a look at our starting assumptions when we use the term “free will”.
  • Then we will look in more detail at what the “will” could be.
  • We will then turn to the term “free” and have a look at the “freedom” this could entail.
  • Lastly, we will look at choice and probability, before trying to tie it all up at some kind of conclusion.

Free Will

Ask people whether they believe in free will and most will say yes.  Ask them what “free will” is and they will say something like:

the ability to choose, think, and act voluntarily.

To believe in free will is to believe that human beings can be the authors of their own actions and to reject the idea that human actions are determined by external conditions or fate.

Free will is often contrasted with determinism:

the doctrine that all events, including human action, are ultimately determined by causes regarded as external to the will.

Indeed, this definition rather circularly refers back to the “will”, which is the thing we are saying is “free”. As determinism requires a will, maybe we can make progress by breaking down the term into its components: “free” and “will”.

What is this “will“?

Will, generally, is that faculty of the mind which selects, at the moment of decision, the strongest desire from among the various desires present.

So “will” is a “faculty” and it is “free”. It is the “thing” that selects. It is part of our “mind”, but may or may not be part of our “brains”. It is possibly the “I” that makes a decision, or at least a part of that “I”. Looking at the “will” separately from its freedom is useful. What the “will” is, and how it relates to us as human beings appears to be debatable separately from its property of being “free”.

What does it mean to be “free”?

Freedom can be defined positively or negatively: i.e. from Aristotle it can be the power to do something or the power not to do something.

Fascinatingly, the English term “free” allegedly has its root, via old German, in an Indo-European word meaning “to love”. Its modern meaning seems to have followed the logic that what is loved is not in bondage, or that what is loved belongs to those that love.

Much of the debate surrounding “free will” concerns the amount of freedom that is implied by “free”. If we are “free” can we choose to do anything? Or in its limit is to be “free” to have a choice, e.g. be able to choose something or something else?

We will come to these points later.

The Will

Who Is Driving?

Let’s start by looking in more detail at the “will”. There are some presumptions we can make. You may not agree to all of them. But setting them out helps our thinking.

antique auto automobile automotive
Photo by Pixabay on

Human being are just matter. I am a materialist: matter is all we need to explain how human beings operate in the world. This denies the existence of any spiritual component that is independent of the matter of our bodies. We do not require a ghost to inhabit the machine, whether that be a religious soul or quantum shenanigans.

Human beings are embodied. Each of us are delineated within the world by pointing to a separable collection of matter: our bodies.

It is worth noting that human beings are not a constant set of matter: our cells are generated, act and then die. Different cells have different lifespans, and many cells only survive for days or weeks. Our bodies are themselves an abstraction, a la Heraclitus. However, the longest living cells, those of our brain and nervous system, also seem to be the cells that help define us.

The primary organ of control is the brain. To live and to be able to act we need certain parts of our bodies more than others. We can lose limbs and certain organs but not our head or heart. Although our sensory and control apparatus are distributed throughout our bodies, the root of intelligent control is the brain inside the head (you can have a heart transplant but not a brain transplant).

Even to position control within the brain is semi-controversial. Try this thought experiment: did Mike the headless chicken have any choice over its actions? It apparently “attempted to preen, peck for food and crow” (“though with limited success”). Observing Mike we ask: did he choose to preen at that time rather than peck?

Pulling these together, we can say that if the “will” is that which makes decisions and controls our actions, it is primarily located with the matter of our brains within our bodies.

The Will and Our Frontal Lobes

We have seen how the “will” may be primarily located within the brain. Neuroscience allows us to go one further: many of the functions we ascribe to the “will” may be located with the frontal lobes of the brain.

Sebastian023 [CC BY-SA 3.0 ], via Wikimedia Commons
Quick caveat: when we look at the frontal lobes we have to beware the reductionist quest for the homunculus. When we as human being think and act we use the whole brain (not 10%!), with different aspects of our thoughts being represented in neuronal structures in different areas that are all tied together in time by more neuronal structures. However, brain damage and brain imaging does show correlation between particular areas and certain functions. How we think is also heavily influenced by our bodily state.

Damage to the frontal lobes is known to result in issues with acting in the world. Planning and control may suffer. We may struggle with the appropriate response to a particular situation. We may struggle to control our emotions, or provide an appropriate emotion response. We may lose the feeling of “trueness” that is associated with people and places that we know. All these functions appear to relate to some form of “will”.

There is also a (possibly loose) correlation between the relatively large size of our frontal lobes and our capacities as human beings. The frontal lobes are best seen as a facilitator, they are connected to many other brain areas and help synchronise firing activity in the brain. It is likely the frontal lobes store some of our more abstract neuronal representations. They act in a feedback loop that covers the whole brain (e.g. the fronto-cingulo-parietal network is associated with executive functions). The frontal lobes help to coordinate sequences of mental representations that form the basis for conscious movement and language. In particular, they may help to steer these sequences, while the basal ganglia assemble in the context of changing sensory signals.

Or as set out here:

the prefrontal cortex, in all of its presumed functions, is neither sensory nor motor, but supports those processes that convey information in the central nervous system in a direction opposite to the classical one: not from input to output but conversely, by corollary discharges that modulate sensory systems in anticipation of future change.

We can see how the prefrontal cortex could be associated with the “will” – it provides a level of control that operates “internal” > “external”, as opposed to the more traditionally deterministic “external” > “internal”. Similarly, this paper argues that a key function of the frontal lobes is:

shaping of behaviour by activation of action requirements or goals specified at multiple levels of abstraction.

To be successful in controlling our actions, we need to represent complex tasks within our brains and carry out multiple goals and subgoals in appropriate sequence. This involves breaking down complex problems into manageable chunks. The frontal lobes play an active role in this control.

The frontal lobes have some interesting properties. You need your frontal lobes to perform advanced “human” acts, but we cannot seem to pin down functions to any one particular area (“there is no unitary executive function”). We seem to need many areas working cooperatively to perform complex tasks. There is also a large level of variation between individuals, but many traits are heritable.

So while we cannot say that “will” is another name for the brain matter of the frontal lobes, the frontal lobes or prefrontal cortex appears necessary but not sufficient for the action of the “will”.


To recap:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain; and
  • the “will” is likely associated with the functioning of the frontal lobes.
dawn sunset beach woman
Photo by Pixabay on

The Ambiguity of “Free”

I don’t think it is controversial to place many of the problems with “free will” at the feet of the term “free”. While consideration of the “will” raises spectres such as the non-unity of a static “self”, it is hard to disagree that our bodies encase some form of delineated object that can act in the world. “Free” on the other hand is a philosophical bag of worms.

You can discern that we might run into problems by just glancing at a dictionary definition. “Free” can be used as an adjective, an adverb or a verb. As an adjective it can have multiple meanings, including:

  • able to act or be done as one wishes; not under the control of another.
  • not or no longer confined or imprisoned.
  • not subject to engagements or obligations.
  • not subject to or affected by (something undesirable).
  • given or available without charge.
  • using or expending something without restraint; lavish.
  • not observing the normal conventions of style or form.

We find that freedom is often defined more by an absence of something as opposed to a positive property in itself. All of these definitions include “not” or “without”. Without constraint or restraint are we “free”?

Freedom also appears to be defined with reference to the outside: other people or external conditions. Is there such a thing as internal mental freedom?

These definitions also hint at a darker side to freedom. We can think of situations where acting without restraint or undesirable restrictions can be bad for us.

The freedom of the “will”

In “free will”, “free” is used as a modifier for the “will”. Being a modifier, it at first appears to provide limitations to the will. But actually, we are using the modifier to stress that the will is not constrained or limited. “Free” is thus used as a negation, it stresses that the will is not “not free”. By its use, we learn that multiple different “wills” exist, that the concept of an “unfree will” is possible rather than freedom being an implied property of will.

This is interesting because “free” makes more sense as meaning “not totally constrained and controlled” than it makes as meaning “completely unconstrained”.

One way in which the “will” seems “free” is that two different people experiencing the same set of external conditions may have quite different internal worlds. For example, they could have a differing internal voice, this in turn being implemented by different brain configurations. This is the ethos behind much of Stoic philosophy and more recent reflection.

This position requires that our internal realm, our internal voice of reason, has casual effect on our external circumstances. If it does not, then our external circumstances will define our actions, and the two different people will act in the same way. This thinking has been used over the years to nicely kill some forms of dualism. If the “will” is routed in matter, and our internal realm results from the operation of our brains, then there appears to be no reason why our actions cannot be driven by processes that have few external cues. This appears consistent with common sense, even if what we think does not always change how we act, we feel that what we think has at least some casual effect on how we act.

However, now we seem to have converted the issue of “freedom” into a question of time and space.

Free from External Constraint

Saying that the “will” is free of external conditions seems to be one way forward in analysing free will. Even if our external conditions are completely constrained, our internal mental world can act without constraint. However, this is not entirely true. The conditions inside the human body are not independent of those outside the human body: if we are starving, ill or on fire this will surely have a casual effect on our internal processes. There is also variation between individuals – spiritual mastery is often associated with a looser coupling between external and internal conditions (e.g. the zen monk that lives a rich inner life despite stark outer conditions), while political thought on the left often assumes a strong coupling between these conditions.

The problems tend to arrive once we denounce dualism and see the will as matter. If our brains are matter that is configured by genetics and environment, then the “external” produces the “internal”. We can say that the two different people experiencing the same set of external conditions at one point in time have different internal worlds because they have experienced different external conditions at preceding points in time.

Twin studies provide a useful tool to investigate these questions. They often show that a certain proportion of our inner world is genetic (say 50% for certain factors). A common environment growing up constraints another proportion of variation (say 25%). The remaining variation (say 25%) is based on different experiences within the environment over time. Hence, while identical twins raised in the same family may not share the same thoughts, they are likely to have more correlated inner worlds than non-siblings raised in different environments. “Freedom” appears a matter of degree.

Degrees of Freedom

Often the “free” in free is interpreted as a binary proposition: we are either free or we are not, we are either in control of our actions or we are not. However, this does not seem consistent with twin studies, or an examination of how external conditions construct our internal realms.

It is maybe better to say that our choices are constrained, but not entirely predictable. Let’s look at some examples:

  • We can and cannot choose to starve ourselves. People have died from undertaking hunger strikes, but deaths from voluntary starvation are likely to be in the thousands as compared to billions of people who “choose” to eat everyday. We appear to believe that we have the freedom of will to starve ourselves, but we also give in to the demands of hunger in 99% of cases.
  • Can you choose actions that determine your physical state? We cannot choose to be tall when we are short. We cannot choose the colour of our skin. We cannot change the basic blueprint of our body, but we may be able to exercise and workout to change muscle mass and fitness. But the effects of exercise may vary depending on our genetic configurations. We may work at jobs that require physical exertion and so do not require “leisure” training. Our upbringing may or may not value sports. We may have good levels of co-ordination or bad. We may or may not have commitments that take up our time.
  • Can you choose not to eat junk food? Our bodies our designed to crave high calorific foods that were rare in our ancient lives: high fat and high sugar foods are ideal if you are starving. The problem is that they are not great over the long term when we are not starving. If your body is working against you to naturally pick junk food is there freedom? If you can afford healthier options, and frequent establishments and social circles that stigmatise junk food, have you really chosen to avoid it?
  • Religion appears something that we can freely choose (in modern liberal democracies). But many people gain their religion from the environment of their upbringing, and even those that “rebel” against a particular religious upbringing need something to rebel against.
  • Compare a novel and a biography: in the former, a character has no choice, their actions are determined by the author; in the latter, at the time of their action, there subjectively seems to be a choice, but in the book everything is determined. Indeed, as we abstract over people in history subjective choice appears to disappear.
  • If we have two identical twins, who are brought up in the same environment by their genetic parents, we will likely see that their choices, over time, show correlations. With relation to each other are their choices less free than two entirely separate people?

Choice when explored appears fractal. Casual factors extend and unravel like twine. As you unpick one casual chain, you find others. Looking back with hindsight, actions often appear more determined than our subjective experience of them at the time.

So Maybe Not So Free

Our discussions above, seem to suggest that we may not be as free as we feel. The fractal nature of causality, which expands as you look into it, seems to add constraint as we know more and go back further in time. Would human beings appear free to an omniscient being?

We can ask: why does this matter? As an answer I’d argue that the thought that we lack freedom has throughout history been seen as extremely dangerous. If we have no freedom, then we argue we are not in control of our actions. Does this enable us to do what we want? Does social order and society break down?

Let’s park these difficulties for a moment while we look in more detail as to how “free will” can manifest itself in the world. Maybe this can help us restore our freedom.

Free will and Probability

Let’s regroup:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain;
  • the “will” is likely associated with the functioning of the frontal lobes;
  • freedom seems to relate to a lack of constraint, where the constraint relates to conditions outside of our brains;
  • however, the conditions outside of our brains contribute to the construction of our brains and have a casual effect on what we think; and
  • when we look closely at the interactions between our external and internal worlds, we do not seem to be entirely without constraint.

Another way we can look at “freedom” is through the lens of probability. Probability may be thought of as a way of representing uncertainty when attempting to predict events.

Toss a coin but don’t look at it. It has landed one of two ways (ignore the edge). Probability involves thinking about what outcomes an event can take. In the coin case we have two: heads or tails. As there are two outcomes we can say that each outcome has a 50% chance of occurring.

So how does probability relate to freedom?

If there is more than one outcome for an event we have a form of freedom. The event is not constrained to result in a single outcome. When predicting the coin toss we have a “choice” in which outcome to bet on.

However, probability can still provide a form of constraint. In our coin case we are constrained to heads or tails. We don’t posit that the coin will explode, disappear into a void or turn into a cat.

Human beings struggle with probability. Our struggles with probability are similar to our struggles with freedom. Using probability we can model outcomes of events where we cannot predict the outcome but we can predict the pattern of outcomes that may occur. Does the freedom of will work in a similar way?

Both the freedom of free will and probability require there to be multiple possible outcomes. In an entirely deterministic world we have no free will and no probability. A more sophisticated understanding of probability is that it represents a model of what we do not know. It sets limit on our knowledge. This is a more Bayesian approach – we say that the probability of 0.5 for a head in a coin toss indicates that we can be 50% confident in a head or that we can be 50% confident that we will not get a head. The reality of our lives is that we will always have less than perfect knowledge. As we exist physically in time, and make our measurements physically in time, there will always be something we cannot measure. Indeed, the balance is very heavily weighted, we are only able to measure a tiny fraction of our current or past environments, and those measurements will always have some possible error.

Coming back to free will with this interpretation of probability, we see that our brains can never with 100% confidence predict the future states of ourselves or our environment, let alone those of others. Indeed, we cannot with 100% confidence know the equivalent current or past states of ourselves. Also, if we see our “will” as “free” in that it has a choice, this choice requires multiple outcomes in a similar way to probability theory.  So both free will and probability involve multiple outcomes that we cannot predict with 100% confidence.


What does it mean to make a choice?

A choice exists at a time before action relating to the choice occurs. In certain cases, having a choice requires multiple outcomes that are not certain, i.e. there appears to be a possibility of each occurring. Does making a choice then consist of differentiated actions towards those outcomes? Does this definition hold for all aspects of choice? Does it require an external observer? Do these “outcomes” even exist in reality?

It would appear that a choice may involve only mental actions: most people would say you can choose to think of a situation in a particular way (the famous “glass half full”). Here an action may be as simple as holding a thought.

Can you have a choice without time? If a choice requires action then action would appear to require time. Also a choice would appear to require a particular point in time before any outcome occurs. We can read about a choice in the past, and the possibility of a choice can extend for a period of time, but once action is taken, or an outcome occurs, the choice appears to no longer exist. Does this mean that we are only possibly “free” if we exist in time?

Then there are studies like the Libet experiments. The “choice” in these experiments was when to move a finger. The outcomes of this choice were a set of times spread over a continuous time range. The choice ended at least when brain activity was first detected, where a series of actions were put in motion that led to the finger moving.

There is also an overlap in the way we use “free” to both refer to the “will” and a “choice”: we can have a “free” choice as well as a “free” will. In a free choice we can select any of the multiple outcomes. While it appears to suggest a uniform probability distribution (each outcome has an equal probability), this would only be the appearance to a totally ignorant observer. The more you knew about the situation, say the preferences of the chooser or the patterns of choice taken by a larger population, the more you can weight the different probabilities.

This can be carried across to our use of “free” when applied to the will. The will is maybe only totally free when observed by a completely ignorant observer. This is the opposite of an omniscient god or God. But as the observer knows more, and their ignorance decreases, we become less free. To an omniscient god or God we are not free. Harking back to the old Germanic root, for a god or God to truly love humanity, he’d need to be an ignorant fool.


We are not gods. This gives us our freedom.

Beings routed in time and space will always have uncertainty. Uncertainty means we can never truly know how someone will act. But we are also not idiots. We can use knowledge to weight our beliefs about different outcomes. And we will be right some of the time. Just not all of the time.

Even a little uncertainty means that human beings have the benefit of doubt in their actions. We cannot know for certain whether any one person will commit a crime. But we can allocate resources to reduce time based on our knowledge. We also cannot know for certain why any crime was committed. But we can have hunches and theories. And if we need to select a reason, we can pick our best one. We should remember that our baseline is a blind guess. And the likelihood of a blind guess being right depends on the number of outcomes; the more complex our system the more we need some form of knowledge.

So: Do We Have Free Will?

Let’s regroup:

  • free will is the thought that we can control our choices and actions;
  • we can look at it in terms of a “will” that is “free”;
  • if we assume that human being are embodied and material, the “will” can be primarily located within the brain;
  • the “will” is likely associated with the functioning of the frontal lobes;
  • freedom relates to the knowledge of an observer;
  • we can be an observer of ourselves; and
  • we can never know anything with certainty.

Like quantum physics, the answer to the question of whether the will is free seems to depend on the observer. To an omniscient being that exists outside of time we are not free. On the other hand as human beings, we can never know for certain how things are going to turn out. It thus seems like the question regarding determinism and free will is not a helpful one. We are constrained by external circumstances in all we do. But there is also a universe of possibilities within our own minds. We can never know for certain how these will interact, or what the outcome will be. That does not mean that we cannot say what may be more or less likely.

So to say someone has “free will”, is maybe to say that their brains act in a way where even if we try to constrain or control factors outside of their body, there is still a large amount of uncertainty in our predictions of their future actions. This uncertainty is such that we cannot judge or take actions based on one predicted outcome. Our freedom lies in our necessary ignorance.



Sampling vs Prediction

Some things have recently been bugging me when applying deep learning models to natural language generation. This post contains my random thoughts on two of these:  sampling and prediction. By writing this post, I hope to try to tease these apart in my head to help improve my natural language models.



Sampling is the act of selecting a value for a variable having an underlying probability distribution.

For example, when we toss a coin, we have a 50% chance of “heads” and a 50% chance of “tails”. In this case, our variable may be “toss outcome” or “t”, our values are “head” and “tail”, and our probability distribution may be modeled as [0.5, 0.5], representing the chances of “heads” and “tails”.

To make a choice according to the probability distribution we sample.

One way of sampling is to generate a (pseudo-) random number between 0 and 1 and to compare this random number with the probability distribution. For example, we can generate a cumulative probability distribution, where we create bands of values that sum to one from the probability distribution. In the present case, we can convert our probability distribution [0.5, 0.5] into a cumulative distribution [0.5, 1], where we say 0 to 0.5 is in the band “heads” and 0.5 to 1 is in the band “tails”. We then compare our random number with these bands, and the band the number falls into is the band we select. E.g. if our random variable is 0.2, we select the first band – 0 to 0.5 – and our variable value is “heads”; if our random variable is 0.7, we select the second band – 0.5 to 1 – and our variable value is tails.

Let’s compare this example to a case of a weighted coin. Our weighted coin has a probability distribution of [0.8, 0.2]. This means that it is weighted so that 80% of tosses come out as “heads”, and 20% of tosses come out as “tails”. If we use the same sampling technique, we generate a cumulative probability distribution of [0.8, 1], with a band of 0 to 0.8 being associated with “heads” and a band of 0.8 to 1 being associated with tails. Given the same random variable values of 0.2 and 0.7, we now get sample values of “heads” and “heads”, as both values fall within the band 0 to 0.8

Now that all seems a bit long winded. Why have I repeated the obvious?

Because most model architectures that act to predict or generate text skip over how sampling is performed. Looking at the code associated with the models, I would say 90% of cases generate an array of probabilities for each time step, and then take the maximum of this array (e.g. using numpy.argmax(array)). This array is typically the output of a softmax layer, and so sums to 1. It may thus be taken as a probability distribution. So our sampling often takes the form of a greedy decoder that just selects the highest probability output at each time step.

Now, if our model consistently outputs probabilities of [0.55, 0.45] to predict a coin toss, then based on the normal greedy decoder, our output will always be “heads”. Repeated 20 times, our output array would be 20 values of “heads”. This seems wrong – our model is saying that over 20 repetitions, we should have 11 “heads” and 9 “tails”.

Let’s park this for a moment and look at prediction.


With machine learning frameworks such as Tensorflow and Keras you create “models” that take an input “tensor” (a multidimensional array). Each “model” applies a number of transformations, typically via one or more neural networks, to output another “tensor”. The generation of an output is typically referred to as a “prediction”.

A “prediction” can thus be deemed an output of a machine learning model, given some input.

To make an accurate prediction, a model needs to be trained to determine a set of parameter values for the model. Often there are millions of parameters. During training, predictions are made based on input data. These predictions are compared with “ground truth” values, i.e. the actual observed/recorded/measured output values. Slightly confusingly, each pair of “input-ground truth values” is often called a sample. However, these samples have nothing really to do with our “sampling” as discussed above.

For training, loss functions for recurrent neural networks are typically based on cross entropy loss. This compares the “true” probability distribution, in the form of a one-hot array for the “ground truth” output token (e.g. for “heads” – [1, 0]), with the “probability distribution” generated by the model (which may be [0.823, 0.177]), where cross entropy is used to compute the difference. This post by Rob DiPietro has a nice explanation of cross entropy. The difference filters back through the model to modify the parameter values via back-propagation.

Sampling vs Prediction

Now we have looked at what we mean by “sampling” and “prediction” we come to the nub of our problem: how do these two processes interact in machine learning models?

One problem is that in normal English usage, a “prediction” is seen as the choice of a discrete output. For example, if you were asked to “predict” a coin toss, you would say one of “heads” and “tails”, not provide me with the array “[0.5, 0.5]”.

Hence, “prediction” in normal English actually refers to the output of the “sampling” process.

This means that, in most cases, when we are “predicting” with a machine learning model (e.g. using `model.predict(input)` in Keras), we are not actually “predicting”. Instead, we are generating an instance of the probability distribution given a set of current conditions. In particular, if our model has a softmax layer at its output, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data. In a recurrent neural network model (such as an LSTM or GRU based model), at each time step, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data and the “state” parameters of the neural network.

As many have suggested, I find it useful to think of machine learning models, especially those using multiple layers (“deep” learning), as function approximators. In many cases, a machine learning model is modeling the probability function for a particular context, where the probability function outputs probabilities across an output discrete variable space. For a coin toss, we our output discrete variable space is “heads” or “tails”; for language systems, our output discrete variable space is based on the set of possible output tokens, e.g. the set of possible characters or words. This is useful, because our problem then becomes: how do we make a model that accurate approximates the probability distribution function. Note this is different from: how do we generate a model that accurately predicts future output tokens.

These different interpretations of the term “prediction” matter less if our output array has most of its mass under one value, e.g. for each “prediction” you get an output that is similar to a one-hot vector. Indeed, this is what the model is trying to fit if we supply our “ground truth” output as one-hot vectors. In this case, taking a random sample and taking the maximum probability value will work out to be mostly the same.

I think some of the confusion occurs as different people think about the model prediction in different ways. If the prediction is seen to be a normalized score, then it makes more sense to pick the highest score. However, if the prediction is seen to be a probability distribution, then this makes less sense. Using a softmax computation on an output does result in a valid probability distribution. So I think we should be treating our model as approximating a probability distribution function and looking at sampling properly.

Another factor is the nature of the “conditioning” of the probability distribution. This can be thought of as: what constraints are there on the resultant distribution. A fully constrained probability distribution becomes deterministic: only one output is going to be correct, and this has a value of one. Hence, the argmax is the same as a random sample but random sampling will still be correct. If we can never truly know or model our constraints then we can never obtain a deterministic setup.

Deterministic vs Stochastic

The recent explosion in deep learning approaches was initially driven by image classification. In these cases, you have an image as an input, and you are trying to ascertain what is in an image (or a pixel). This selection is independent of time and reduces to a deterministic selection: there is either a dog in the image or there is not a dog in the image. The probability distribution output by our image classification model thus indicates more a confidence in our choice.

Language, though, is inherently stochastic: there are multiple interchangeable “right” answers. Also language unfolds over time. This amplifies the stochasticity.

Thinking about a single word choice, language models will generally produce flatter conditional probability distributions for certain contexts. For example, you can often use different words and phrases to mean the same thing (synonyms for example, or different length noun phrases). Also, the length of the sequence is itself a random variable.

In this case, the difference between randomly sampling the probability distribution and taking the maximum value of our probability array becomes more pronounced. Cast your mind back to our coin toss example: if we are accurately modeling the probability distribution then actually argmax doesn’t work – we have two equally likely choices and we have to hack our decoding function to make a choice when we have equal probabilities. However, sampling from the probability distribution works.

A key point people often miss is that we can very accurately model the probability distribution, while producing outputs that do not match “ground truth values”.

What does this all mean?

Firstly, it means that when selecting an output token for each time step, we need to be properly sampling rather than taking the maximum value.

Secondly, it means we have to pay more attention to the sampling of output in our models. For example, using beam search or the Viterbi algorithm to perform our sampling and generate our output sequence. It also suggests more left-field ideas, like how could we build sampling into our model, rather than have it as an external step.

Thirdly, we need to look at our loss function in our language models. At one extreme, you have discriminator-style loss functions, where we perform a binary prediction of whether the output looks right; at the other extreme, you having teaching forcing on individual tokens. The former is difficult to train (how do you know what makes a “good” output and move towards that) but the later also tends to memorize an input, or struggle to produce coherent sequences of tokens that resemble natural language. If we are trying to model a probability distribution, then should our loss be based on trying to replicate the distribution of our input? Is a one-hot encoding the right way to do this?

Of course, I could just be missing the point of all of this. However, the generally poor performance of many generative language models (look at the real output not the BLEU scores!) suggests there is some low-hanging fruit that just requires some shift in perspective.