Getting All the Books

This is a short post explaining how to obtain over 50,000 text books for your natural language processing projects.

books on bookshelves

Photo by Mikes Photos on Pexels.com

The source of these books is the excellent Project Gutenberg.

Project Gutenberg offers the ability to use sync the collection of books. To obtain the collection you can set up a private mirror as explained here. However, I’ve found that a couple of tweaks to the rsync setup can be useful.

First, you can use the --list only option in rsync to first obtain a list of files that will be synced. Based on this random Github issue comment, I initially used the command below to generate a list of the files on the UK mirror server (based at the University of Kent):
rsync -av --list-only rsync.mirrorservice.org::gutenberg.org | awk '{print $5}' > log_gutenberg
(The piping via awk simply takes the 5th column of the list output.)

This file list is around 80MB. We can use this list to add some filters to the rsync command.

On the server books are stored as .txt files. Helpfully, each text file also has a compressed .zip file. Only syncing the .zip files will help to reduce the amount of data that is downloaded. We can either programmatically access the .zip files, or run a script to uncompress (the former is preferred to save disk space).

Some books have accompanying HTML files and/or alternate encodings. We only need ASCII encodings for now. We can thus ignore any file with dash (-) in it (HTML files are *-h* and are zipped; encodings are *-[number].* files).

A book also sometimes has an old folder containing old versions and other rubbish. We can ignore this (as per here). We can use the -m flag to prune empty directories (see here for more details on rsync options).

Also there are some stray .zip files that contain audio readings of books. We want to avoid these as they can be 100s MB. We can thus add an upper size limit of about 10MB (most book files are hundreds of KB).

We can use the --include and --exclude flags in a particular order to filter the files – we first include all subdirectories then exclude files we don’t want before finally only including what we do want.

Bringing this all together gives us the following rsync command-line (i.e. shell) command:

rsync -avm \
--max-size=10m \
--include="*/" \
--exclude="*-*.zip" \
--exclude="*/old/*" \
--include="*.zip" \
--exclude="*" \
rsync.mirrorservice.org::gutenberg.org ~/data/gutenberg

This syncs the data/gutenberg folder in our home directory with the Kent mirror server. All in all we have about 8GB.

The next steps are then to generate a quick Python wrapper that navigates the directory structure and unzips the files on the fly. We also need to filter out non-English texts and remove the standard Project Gutenberg text headers.

There is a useful GUTINDEX.ALL text file which contains a list of each book and its book number. This can be used to determine the correct path (e.g. book 10000 has a path of 1/0/0/0/10000). The index text file also indicates non-English books, which we could use to filter the books. One option is to create a small SQL database which stores title and path information for English books. It would also be useful to filter fiction from non-fiction, but this may need some clever in-text classification.

So there we are, we have a large folder full of books written before 1920ish, including some of the greatest books ever written (e.g. Brothers Karamazov and Anna Karenina).

Advertisements

Taming the Docker Blob

Or understanding how to best use Docker.

Docker is a great way to build services with modular and changeable components without borking your server / computer. I like to think of Docker containers as a system version of Python’s virtual environment – you can build a stack of services and applications through a Docker file, and then easily use this on different computers.

However, Docker can become an amorphous blob if you are not careful. Containers and volumes can multiply, until your computer starts freezing because you have used up 100GB of space.

There are two tricks I have learnt to manage my Docker-based systems:

  • Working out the difference between images and containers, and understanding the lifecycle of the latter; and
  • Clever use of volumes.

organic-1002892_640

Images

Images are like ISO disk images, the difference being that they are built layer-by-layer such that layers may be shared between images. An image may be thought of as a class definition.

Images are created when you issued a docker build command. To organise them make sure you build an image using the -ttag option (e.g. -t image_name). Images are normally identified by an ID in the form of a hash, so giving your image a name is useful. To view the images on your computer use: docker images .

Containers

Containers are the computers that are created from images. They can be thought of as virtual machines, or of instances of a class definition.

One image may be used to create multiple container instances. A container is created when you use the docker run command and pass an image name. I recommend also using the --name option to create a name for your container, e.g. --name my_container.

Running containers may be viewed using the docker ps command. What took me a while to work out is that stopped and exited containers are also around. These can be viewed by using the -a option, i.e. docker ps -a.

Another tip is to use the --rmflag to automatically remove temporary containers after use. Beware though: removing a container will also delete all data generated during the running of the container, unless that data is stored in a separate volume.

Containers are really designed to be run as continual background processes. If you are working on a desktop or laptop you may want to turn your machine off and on again. If you exit a container, you can restart it using docker restart my_container.

Volumes

Volumes are chunks of file system that are handled by Docker. They can be connected into multiple containers.

It’s good practice to explicitly create a volume, so that it is easier to keep track of. To do this use the docker volume create vol_name command.

I had a problem where I had very large databases that were filling up a local solid state drive (SSD) that I wanted to move to a 1TB hard drive. The easiest way I found to store volumes on a different drive is to create a symlink (by right-clicking in the Nautilus file manager or using ln -s) to the location where you want to store the data (e.g. /HDD/docker_volumes/vol_1) and then rename the link to match the proposed volume name (e.g. vol_1). Copy and paste this in the Docker volumes directory (typically, /var/lib/docker/volumes) and then create the volume, e.g. docker volume create vol_. The volume will now be managed by Docker but the data will be stored in the linked folder.

To use a volume with a container use the -v flag with the docker run command, e.g. docker run -v vol_1:path_in_container --name my_container my_image.

Checking Disk Usage

Once you get the hang of all this a good check on disk use may be performed using  docker system df -v. This provides a full output showing your Docker disk usage.

A Mongo Example

Here is an example that pulls this altogether. The situation is that we want to store some data in a Mongo database. Instead of installing Mongo locally we can use the mongoDocker image.

Now through detective work (docker inspect mongo_image) I worked out that the mongo Docker image is designed to work with two volumes: one that is mapped to the database directory /data/db in a container; and one that is mapped to the configuration database directory /data/configdb. If you don’t explicitly specify volumes to map to these locations, Docker creates them locally. Now these directories can grow quite large. I thus created two Docker volumes mongo_data and mongo_configwith symlinks to a larger hard disk drive as described above.

To download the image and start a container with your data you can then use: docker run -v mongo_data:/data/db -v mongo_config/data/configdb -p 27017:27017 --name mongo_container mongo. The -p flag maps the local port 27017 to the exposed Mongo port of the container, so we can connect to the Mongo database using localhost and the default port.

If you stop the container (e.g. to restart or switch-off your server/computer), you can restart it using docker restart mongo_container. You can now accidentally delete the container while still keeping the data, which is stored in the `mongo_data` volume. (Although I recommend backing up that data just in case you aggressively prune the wrong volumes!)

Sampling vs Prediction

Some things have recently been bugging me when applying deep learning models to natural language generation. This post contains my random thoughts on two of these:  sampling and prediction. By writing this post, I hope to try to tease these apart in my head to help improve my natural language models.

pexels-photo-277593.jpeg

Sampling

Sampling is the act of selecting a value for a variable having an underlying probability distribution.

For example, when we toss a coin, we have a 50% chance of “heads” and a 50% chance of “tails”. In this case, our variable may be “toss outcome” or “t”, our values are “head” and “tail”, and our probability distribution may be modeled as [0.5, 0.5], representing the chances of “heads” and “tails”.

To make a choice according to the probability distribution we sample.

One way of sampling is to generate a (pseudo-) random number between 0 and 1 and to compare this random number with the probability distribution. For example, we can generate a cumulative probability distribution, where we create bands of values that sum to one from the probability distribution. In the present case, we can convert our probability distribution [0.5, 0.5] into a cumulative distribution [0.5, 1], where we say 0 to 0.5 is in the band “heads” and 0.5 to 1 is in the band “tails”. We then compare our random number with these bands, and the band the number falls into is the band we select. E.g. if our random variable is 0.2, we select the first band – 0 to 0.5 – and our variable value is “heads”; if our random variable is 0.7, we select the second band – 0.5 to 1 – and our variable value is tails.

Let’s compare this example to a case of a weighted coin. Our weighted coin has a probability distribution of [0.8, 0.2]. This means that it is weighted so that 80% of tosses come out as “heads”, and 20% of tosses come out as “tails”. If we use the same sampling technique, we generate a cumulative probability distribution of [0.8, 1], with a band of 0 to 0.8 being associated with “heads” and a band of 0.8 to 1 being associated with tails. Given the same random variable values of 0.2 and 0.7, we now get sample values of “heads” and “heads”, as both values fall within the band 0 to 0.8

Now that all seems a bit long winded. Why have I repeated the obvious?

Because most model architectures that act to predict or generate text skip over how sampling is performed. Looking at the code associated with the models, I would say 90% of cases generate an array of probabilities for each time step, and then take the maximum of this array (e.g. using numpy.argmax(array)). This array is typically the output of a softmax layer, and so sums to 1. It may thus be taken as a probability distribution. So our sampling often takes the form of a greedy decoder that just selects the highest probability output at each time step.

Now, if our model consistently outputs probabilities of [0.55, 0.45] to predict a coin toss, then based on the normal greedy decoder, our output will always be “heads”. Repeated 20 times, our output array would be 20 values of “heads”. This seems wrong – our model is saying that over 20 repetitions, we should have 11 “heads” and 9 “tails”.

Let’s park this for a moment and look at prediction.

Prediction

With machine learning frameworks such as Tensorflow and Keras you create “models” that take an input “tensor” (a multidimensional array). Each “model” applies a number of transformations, typically via one or more neural networks, to output another “tensor”. The generation of an output is typically referred to as a “prediction”.

A “prediction” can thus be deemed an output of a machine learning model, given some input.

To make an accurate prediction, a model needs to be trained to determine a set of parameter values for the model. Often there are millions of parameters. During training, predictions are made based on input data. These predictions are compared with “ground truth” values, i.e. the actual observed/recorded/measured output values. Slightly confusingly, each pair of “input-ground truth values” is often called a sample. However, these samples have nothing really to do with our “sampling” as discussed above.

For training, loss functions for recurrent neural networks are typically based on cross entropy loss. This compares the “true” probability distribution, in the form of a one-hot array for the “ground truth” output token (e.g. for “heads” – [1, 0]), with the “probability distribution” generated by the model (which may be [0.823, 0.177]), where cross entropy is used to compute the difference. This post by Rob DiPietro has a nice explanation of cross entropy. The difference filters back through the model to modify the parameter values via back-propagation.

Sampling vs Prediction

Now we have looked at what we mean by “sampling” and “prediction” we come to the nub of our problem: how do these two processes interact in machine learning models?

One problem is that in normal English usage, a “prediction” is seen as the choice of a discrete output. For example, if you were asked to “predict” a coin toss, you would say one of “heads” and “tails”, not provide me with the array “[0.5, 0.5]”.

Hence, “prediction” in normal English actually refers to the output of the “sampling” process.

This means that, in most cases, when we are “predicting” with a machine learning model (e.g. using `model.predict(input)` in Keras), we are not actually “predicting”. Instead, we are generating an instance of the probability distribution given a set of current conditions. In particular, if our model has a softmax layer at its output, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data. In a recurrent neural network model (such as an LSTM or GRU based model), at each time step, we are generating a conditional probability distribution, where the model is “conditioned” based on the input data and the “state” parameters of the neural network.

As many have suggested, I find it useful to think of machine learning models, especially those using multiple layers (“deep” learning), as function approximators. In many cases, a machine learning model is modeling the probability function for a particular context, where the probability function outputs probabilities across an output discrete variable space. For a coin toss, we our output discrete variable space is “heads” or “tails”; for language systems, our output discrete variable space is based on the set of possible output tokens, e.g. the set of possible characters or words. This is useful, because our problem then becomes: how do we make a model that accurate approximates the probability distribution function. Note this is different from: how do we generate a model that accurately predicts future output tokens.

These different interpretations of the term “prediction” matter less if our output array has most of its mass under one value, e.g. for each “prediction” you get an output that is similar to a one-hot vector. Indeed, this is what the model is trying to fit if we supply our “ground truth” output as one-hot vectors. In this case, taking a random sample and taking the maximum probability value will work out to be mostly the same.

I think some of the confusion occurs as different people think about the model prediction in different ways. If the prediction is seen to be a normalized score, then it makes more sense to pick the highest score. However, if the prediction is seen to be a probability distribution, then this makes less sense. Using a softmax computation on an output does result in a valid probability distribution. So I think we should be treating our model as approximating a probability distribution function and looking at sampling properly.

Another factor is the nature of the “conditioning” of the probability distribution. This can be thought of as: what constraints are there on the resultant distribution. A fully constrained probability distribution becomes deterministic: only one output is going to be correct, and this has a value of one. Hence, the argmax is the same as a random sample but random sampling will still be correct. If we can never truly know or model our constraints then we can never obtain a deterministic setup.

Deterministic vs Stochastic

The recent explosion in deep learning approaches was initially driven by image classification. In these cases, you have an image as an input, and you are trying to ascertain what is in an image (or a pixel). This selection is independent of time and reduces to a deterministic selection: there is either a dog in the image or there is not a dog in the image. The probability distribution output by our image classification model thus indicates more a confidence in our choice.

Language, though, is inherently stochastic: there are multiple interchangeable “right” answers. Also language unfolds over time. This amplifies the stochasticity.

Thinking about a single word choice, language models will generally produce flatter conditional probability distributions for certain contexts. For example, you can often use different words and phrases to mean the same thing (synonyms for example, or different length noun phrases). Also, the length of the sequence is itself a random variable.

In this case, the difference between randomly sampling the probability distribution and taking the maximum value of our probability array becomes more pronounced. Cast your mind back to our coin toss example: if we are accurately modeling the probability distribution then actually argmax doesn’t work – we have two equally likely choices and we have to hack our decoding function to make a choice when we have equal probabilities. However, sampling from the probability distribution works.

A key point people often miss is that we can very accurately model the probability distribution, while producing outputs that do not match “ground truth values”.

What does this all mean?

Firstly, it means that when selecting an output token for each time step, we need to be properly sampling rather than taking the maximum value.

Secondly, it means we have to pay more attention to the sampling of output in our models. For example, using beam search or the Viterbi algorithm to perform our sampling and generate our output sequence. It also suggests more left-field ideas, like how could we build sampling into our model, rather than have it as an external step.

Thirdly, we need to look at our loss function in our language models. At one extreme, you have discriminator-style loss functions, where we perform a binary prediction of whether the output looks right; at the other extreme, you having teaching forcing on individual tokens. The former is difficult to train (how do you know what makes a “good” output and move towards that) but the later also tends to memorize an input, or struggle to produce coherent sequences of tokens that resemble natural language. If we are trying to model a probability distribution, then should our loss be based on trying to replicate the distribution of our input? Is a one-hot encoding the right way to do this?

Of course, I could just be missing the point of all of this. However, the generally poor performance of many generative language models (look at the real output not the BLEU scores!) suggests there is some low-hanging fruit that just requires some shift in perspective.

 

 

 

 

 

Understanding Convolution in Tensorflow

This is a quick post intended to help those trying to understand convolution as applied in Tensorflow.

There are many good blog posts on the Internet explaining convolution as applied in convolutional neural networks (CNNs), e.g. see this one by Denny Britz. However, understanding the theory in one thing, knowing how to implement it is another. This is especially the case when trying to apply CNNs to word or character-level natural language processing (NLP) tasks – here the image metaphors break down a little.

I generally use Tensorflow for neural network modelling. Most of the stuff I want to do is a little bespoke, so I need something a little more expressive than Keras. Two dimensional convolution is explained in the Tensorflow documents here. I also found the examples in the stackoverflow answer here very useful.

To summarise, for our input we have a [a, b, c, d] tensor – i.e. a x b x c x d.

  • a is the number of input ‘images’ or examples (this will typically be your batch size);
  • b is the input width (e.g. image width, max. word length in characters or max. sequence length in words);
  • c is the input height (e.g. image height or embedding dimensionality); and
  • d is the number of input channels (grayscale images, characters or words = 1, RGB = 3).

For the convolution kernel filter you also have a [o, p, q, r] tensor – i.e. o x p x q x r.

  • o is the filter width (e.g. patch width or ‘n-gram’ size);
  • p is the filter height (e.g. patch height or embedding dimensionality);
  • q is the number of channels (from the input – i.e. input channels); and
  • r is the number of filters (or output channels).

q basically has to match d. r – the number of output channels – is equal to the number of filters you want. For each output channel, a different filter of width * height will be created and applied.

Most of the time the theory is talking about images of b x c, and a filter of o x p.

(This follows Denny’s configuration. However, I note you can transpose b and c and o and p and get the same outcome.)

For NLP the number of output channels becomes an important parameter. This is because you will typically max-pool over the sequence (e.g. word or character) length, such that you get one value for the complete sequence per filter. Each filter can be thought of as representing something like an n-gram, e.g. the learnt parameters of one filter of length 3 could represent the character embeddings for the suffix “ly” (e.g. [l_embedding, y_embedding, wordend_embedding]) or the prefix “un” (including word start token).

I found it instructive to work through the code associated with Denny’s further blog post here.

Practical Problems with Natural Language Processing

Or the stuff no one tells you that is actually quite hard.

Recently I’ve been playing around with the last 15 years of patent publications as a ‘big data’ source. This includes over 4 million individual documents. Here I thought I’d highlight some problems I faced. I found that a lot of academic papers tend to ignore or otherwise bypass this stuff.

Sentence Segmentation

Many recurrent neural network (RNN) architectures work with sentences as an input sequence, where the sentence is a sequence of word tokens. This introduces a first problem: how do you get your sentences?

A few tutorials and datasets get around this by providing files where each line in the file is a separate sentence. Hence, you can get your sentences by just reading the list of filelines.

In my experience, the only data where file lines are useful is code. For normal documents, there is no correlation between file lines and sentences; indeed, each sentence is typically of a different length and so is spread across multiple file lines.

In reality, text for a document is obtained as one large string (or at most a set of paragraph tags). This means you need a function that takes your document and returns a list of sentences: s = sent_tokenise(document).

A naive approach is to tokenise based on full stops. However, this quickly runs into issues with abbreviations: “U.S.”, “No.”, “e.g.” and the like will cut your sentences too soon.

The Python NLTK library provides a sentence tokeniser function out-of-the-box – sent_tokenize(text). It appears this is slightly more intelligent that a simple naive full stop tokenisation. However, it still appears to be cutting sentences too early based on some abbreviations. Also optical character recognition errors, such as “,” instead of “.”, or variable names, such as “var.no.1” will give you erroneous tokenisation.

sentence_error_2017-06-21

One option to resolve this is to train a pre-processing classifier to identify (i.e. add) <end-of-sentence> tokens. This could work at the word token level, as the NLTK word tokeniser does appear to extract abbreviations, websites and variable names as single word units.

You can train the Punkt sentence tokenizer – http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt and https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt . One option is to test training the Punkt sentence tokenizer with patent text.

Another option is to implement a deep learning segmenter on labelled data – e.g. starting from here – https://hal.archives-ouvertes.fr/hal-01344500/document . You can have sequence in and control labels out (e.g. a sequence to sequence system). Or even character based labelling using a window around the character (10-12). This could use a simple feed-forward network. The problem with this approach is you would need a labelled dataset – ideally we would like an unsupervised approach.

Another option is to filter an imperfect sentence tokeniser to remove one or two word sentences.

Not All File Formats are Equal

An aside on file formats. The patent publication data is supplied as various forms of compressed file. One issue I had was that it was relatively quick and easy to access data in a nested zip file (e.g. multiple layers down – using Python’s zipfile). Zip files could be access as hierarchies of file objects. However, this approach didn’t work with tar files, for these files I needed to extract the whole file into memory before I could access the contents. This resulted in ‘.tar’ files taking up to 20x longer to access than ‘.zip’ files.

Titles

Related to sentence segmentation is the issue of section titles. These are typically set of <p></p> elements in my original patent XML files, and so form part of the long string of patent text. As such they can confuse sentence tokenisation: they do not end in a full stop and do not exhibit normal sentence grammar.

sentence_error_2017-06-21(1)

Titles can however be identified by new lines (\n). A title will have a preceding and following new line and no full stop. It could thus be extracted using a regular expression (e.g. “\n\s?(\W\s?)\n”).

Titles may be removed from the main text string. They may also be used as variables of “section” objects that model the document, where the object stores a long text string for the section.

As an aside, titles in HTML documents are sometimes wrapped in header tags (e.g. <h3></h3>). For web pages titles may thus be extracted as part of the HTML parsing.

Word Segmentation

About half the code examples I have seen use a naive word tokenisation that splits sentences or document strings based on spaces (e.g. as a list comprehension using doc.split()). This works fairly successfully but is not perfect.

The other half of code examples use a word tokenising function supplied by a toolkit (e.g. within Keras, TensorFlow or NLTK). I haven’t looked under the hood but I wouldn’t be surprised if they were just a wrapper for the simple space split technique above.

While these functions work well for small curated datasets, I find the following issues for real world data. For reference I was using word_tokenise( ) from NLTK.

Huge Vocabularies

A parse of 100,000 patent documents indicates that there are around 3 million unique “word” tokens.

Even with stemming and some preprocessing (e.g. replacing patent numbers with a single placeholder token), I can only cut this vocabulary down to 1 million unique tokens.

token_frequency_2017-06-21(1)

This indicates that the vocabulary on the complete corpus will easily be in the millions of tokens.

This then quickly makes “word” token based systems impractical for real world datasets. For many RNN system you will need to use a modified softmax (e.g. sampled or hierarchical) on your output, and even these techniques may grind to a holt at dimensionalities north of 1 million.

The underlying issue is that word vocabularies have a set of 50-100k words that are used frequently and a very long tail of infrequent words.

Looking at the vocabulary is instructive. You quickly see patterns of inefficiency.

Numbers

This is a big one, especially for more technical text sources where numbers turn up a lot. Each unique number that is found is considered a separate token.

sentence_error_2017-06-21(2)

This becomes a bigger problem with patent publications. Numbers occur everywhere, from parameters and variable values to cited patent publication references.

Any person looking at this quickly asks – why can’t numbers be represented in real number space rather than token space?

This almost becomes absurd when our models use thousands of 32 bit floating point numbers as  parameters – just one parameter value can represent numbers in a range of -3.4E+38 to +3.4E+38. As such you could reduce your dimensionality by hundreds of thousands of points simply by mapping numbers onto one or two real valued ranges. The problem is this then needs bespoke model tinkering, which is exactly what most deep learning approaches are trying to avoid.

Looking at deep learning papers in the field I can’t see this really being discussed. I’ve noticed that a few replace financial amounts with zeros (e.g. “$123.23” > “$000.00”). This then only requires one token per digit or decimal place. You do then lose any information stored by the number.

I have also noticed that some word embedding models end up mapping numbers onto an approximate number range (they tend to be roughly grouped into linear structures in embedding space). However, you still have the issue of large input/output dimensions for your projections and the softmax issue remains. There is also no guarantee that your model will efficiently encode numbers, e.g. you can easily image local minima where numbers are greedily handled by a largish set of hidden dimensions.

Capital Letters

As can be seen in the list of most common tokens set out above, a naive approach that doesn’t take into account capital letters treats capitalised and uncapitalised versions of a word as separate independent entities, e.g. “The” and “the” are deemed to have no relation in at least the input and output spaces.

token_frequency_2017-06-21(3)

Many papers and examples deal with this by lowering the case of all words (e.g. preprocessing using wordstring.lower()). However, this again removes information; capitalisation is there for a reason: it indicates acronyms, proper nouns, the start of sentences etc.

Some hope that this issue is dealt with in a word embedding space, for example that “The” and “the” are mapped to similar real valued continuous n-dimensional word vectors (where n is often 128 or 300). I haven’t seen though any thought as to why this would necessarily happen, e.g. anyone thinking about the use of “The” and “the” and how a skip-gram, count-based or continuous bag of words model would map these tokens to neighbouring areas of space.

One pre-processing technique to deal with capital letters is to convert each word to lowercase, but to then insert an extra control token to indicate capital usage (such as <CAPITAL>). In this case, “The” becomes “<CAPITAL>”, “the”. This seems useful – you still have the indication of capital usage for sequence models but your word vocabulary only consists of lowercase tokens. You are simply transferring dimensionality from your word and embedding spaces to your sequence space. This seems okay – sentences and documents vary in length.

(The same technique can be applied to a character stream to reduce dimensionality by at least 25: if “t” is 24 and “T” is 65 then a <CAPITAL> character may be inserted (e.g. index 3) “T” can become “3”, “24”.)

Hyphenation and Either/Or

We find that most word tokenisation functions treat hyphenated words as single units.

token_frequency_2017-06-21(5)

The issue here is that the hyphenated words are considered as a further token that is independent of their component words. However, in our documents we find that many hyphenated words are used in a similar way to their unhyphenated versions, the meaning is approximately the same and the hyphen indicates a slightly closer relation than that of the words used independently.

One option is to hope that our magical word embeddings situate our hyphenated words in embedding space such that they have a relationship with their unhyphenated versions. However, hyphenated words are rare – typically very low use in our long tail – we may only see them one or two times in our data. This is not enough to provide robust mappings to the individual words (which may be seen 1000s of times more often).

So Many Hyphens

An aside on hyphens. There appear to be many different types of hyphen. For example, I can think of at least: a short hyphen, a long hyphen, and a minus sign. There are also hyphens from locale-specific unicode sets. The website here counts 27 different types: https://www.cs.tut.fi/~jkorpela/dashes.html#unidash . All can be used to represent a hyphen. Some dimensionality reduction would seem prudent here.

Another answer to this problem is to use a similar technique to that for capital letters: we can split hyphenated words “word1-word2” into “word1”, “<HYPHEN>”, “word2”, where “<HYPHEN>” is a special control character in our vocabulary. Again, here we are transferring dimensionality into our sequence space (e.g. into the time dimension). This seems a good trade-off. A sentence typically has a variable token length of ~10-100 tokens. Adding another token would seem not to affect this too much: we seem to have space in the time dimension.

A second related issue is the use of slashes to indicate “either/or” alternatives, e.g. “input/output”, “embed/detect”, “single/double-click”. Again the individual words retain their meaning but the compound phrase is treated as a separate independent token. We are also in the long tail of word frequencies – any word embeddings are going to be unreliable.

either_or_2017-06-21

One option is to see “/” as an alternative symbol for “or”. Hence, we could have “input”, “or”, “output” or “embed”, “or”, “detect”. Another option is to have a special “<SLASH>” token for “either/or”. Replacement can be performed by regular expression substitution.

Compound Words

The concept of “words” as tokens seems straightforward. However, reality is more complicated.

First, consider the terms “ice cream” and “New York”. Are these one word or two? If we treat them as two independent words “ice cream” becomes “ice”, “cream” where each token is modelled separately (and may have independent word embeddings). However, intuitively “ice cream” seems to be a discrete entity with a semantic meaning that is more than the sum of “ice” and “cream”. Similarly, if “New York” is “New”, “York” the former token may be modelled as a variation of “new” and the latter token as entity “York” (e.g. based on the use of “York” elsewhere in the corpus). Again this seems not quite right – “New York” is a discrete entity whose use is different from “New” and “York”. (For more fun also compare with “The Big Apple” – we feel this should be mapped to “New York” in semantic space, but the separate entities “New”, “York”, “The”, “Big”, “Apple” are unlikely to be modelled as related individually.)

The case of compound words probably has a solution different from that discussed for hyphenation and slashes. My intuition is that compound words reflect features in a higher feature space, i.e. in a feature space above that of the individual words. This suggests that word embedding may be a multi-layer problem.

Random Characters, Maths and Made-Up Names

This is a big issue for patent documents. Our character space has a dimensionality of around 600 unique characters, but only about 100 are used regularly – again we have a longish tail of infrequent characters.

either_or_2017-06-21(2)

Looking at our infrequent characters we see some patterns: accented versions of common characters (e.g. ‘Ć’ or ‘ë’); unprintable unicode values (possibly from different character sets in different locales); and maths symbols (e.g. ‘≼’ or ‘∯’).

When used to construct words we end up with many variations of common word tokens (‘cafe’ and ‘café’) due to the accented versions of common characters.

Our unprintable unicode values appear not to occur regularly in our rare vocabulary. It thus appears many of these are local variations on control characters, such as whitespace. This suggests that we would not lose too much information with a pre-processing step that removed these characters and replaced them with a space (if they are not printable on a particular set of computers this suggests they hold limited utility and importance).

The maths symbols cause us a bit of a problem. Many of our rare tokens are parts of equations or variable names. These appear somewhat independent of our natural language processing – they are not “words” as we would commonly understand them.

One option is to use a character embedding. I have seen approaches that use between 16 and 128 dimensions for character embedding. This seems overkill. There are 26 letters, this is multiplied by 2 for capital letters, and there are around 10 common punctuation characters (the printable set of punctuation in Python’s string library is only of length 32). All the unicode character space may be easily contained within one real 32 bit floating point dimension. High dimensionality would appear to risk overfitting and would quickly expand our memory and processing requirements. My own experiments suggest that case and punctuation clustering can be seen with just 3 dimensions, as well as use groupings such as vowels and consonants. One risk of low dimensional embeddings is a lack of redundant representations that make the system fragile. My gut says some low dimensionality of between 1 and 16 dimensions should do. Experimentation is probably required within a larger system context. (Colin Morris has a great blog post where he looks at the character embedding used in one of Google Brain’s papers – the dimensionality plots / blogpost link can be found here: http://colinmorris.github.io/lm1b/char_emb_dimens/.)

What would we like to see with a character embedding:

  • a relationship between lower and upper case characters – preferable some kind of dimensionality reduction based on a ‘capital’ indication;
  • mapping of out of band unicode characters onto space or an <UNPRINTABLE> character;
  • mapping of accented versions of characters near to their unaccented counterpart, e.g. “é” should be a close neighbour of “e”;
  • clustering of maths symbols; and
  • even spacing of frequently used characters (e.g. letters and common punctuation) to provide a robust embedding.

There is also a valid question as to whether character embedding is needed. Could we have a workable solution with a simple lookup table mapping?

Not Words

Our word tokenisation functions also output a lot of tokens that are not really words. These include variable names (e.g. ‘product.price’ or ‘APP_GROUP_MULTIMEDIA’), websites (e.g. ‘www.mybusiness.com’), path names (e.g. ‘/mmt/glibc-15/lib/libc.so.6’, and text examples (‘PεRi’). These are often: rare – occurring one or two times in a corpus; and document-specific – they often are not repeated across documents. They make up a large proportion of the long-tail of words.

Often these tokens are <UNK>ed, i.e. they are replaced by a single token that represents rare words that are not taken into account. As our token use follows a power law distribution this can significantly reduce our vocabulary size.

token_frequency_2017-06-21(2)

For example, the plot above shows that with no filtering we have 3 million tokens. By filtering out tokens that only appear once we reduce our vocabulary size by half: to 1.5 million tokens. By filtering out tokens that only appear twice we can reduce our dimensionality by another 500,000 tokens. Gains fall off as we up the threshold, by removing tokens that appear fewer than 10 times, we can get down to a dimensionality of around 30,000. This is around the dimensionality you tend to see in most papers and public (toy) datasets.

The problem with doing this is you tend to throw away a lot of your document. The <UNK> then becomes like “the” or “of” in your model. You get samples such as “The <UNK> then started to <UNK> about <UNK>”, where any meaning of the sentence is lost. This doesn’t seem like a good approach.

The problem of <UNK> is a fundamental one. Luckily the machine learning community is beginning to wake up to this a little. In the last couple of years (2016+), character embedding approaches are gaining traction. Google’s neural machine translation system uses morpheme-lite word ‘portions’ to deal with out of vocabulary words (see https://research.google.com/pubs/pub45610.html). The Google Brain paper on Exploring the Limits of Language Modelling (https://arxiv.org/pdf/1602.02410.pdf) explores a character convolutional neural network (‘CharCNN’) that is applied to characters (or character embeddings).

What Have We Learned?

Summing up, we have found the following issues:

  • sentence segmentation is imperfect and produces noisy sentence lists;
  • some parts of a document such as titles may produce further noise in our sentence lists; and
  • word segmentation is also imperfect:
    • our vocabulary size is huge: 3 million tokens (on a dataset of only 100,000 documents);
    • numbers are a problem – they are treated as separate discrete units, which seems inefficient;
    • the concept of a “word” is fuzzy – we need to deal with compound words, hyphenation and either/or notation;
    • there are different character sets that lead to separate discrete tokens – e.g. capitals and accented letters – when common underlying tokens appear possible; and
    • non-language features such as equations and variable names contribute heavily to the vocabulary size.

Filtering out non-printable unicode characters would appear a good preprocessing step that has minimal effect on our models.

Character embedding to a low dimensional space appears useful.

Word/Character Hybrid Approach?

Looking at my data, a character-based approach appears to be the one to use. Even if we include all the random characters, we only have a dimensionality of 600 rather than 3 million for the word token space.

Character-based models would appear well placed to deal with rare, out-of-vocabulary words (e.g. ‘product.price’). It also seems much more sensible to treat numbers as sequences of digits as opposed to discrete tokens. Much like image processing, words then arise as a layer of continuous features. Indeed, it would be relatively easy to insert a layer to model morpheme-like character groupings (it would appear this is what the CNN approaches are doing).

The big issue with character level models is training. Training times are increased by orders of magnitude (state of the art systems take weeks on systems with tens of £1k graphic cards).

However, we do have lots of useful training data at the word level. Our 50 most common unfiltered word tokens make up 46% (!) of our data. The top 10,000 tokens make up 95% of the data. Hence, character information seems most useful for the long tail.

token_frequency_2017-06-21(4)

This then suggests some kind of hybrid approach. This could be at a model or training level. The 95% of common data would suggest we could train a system using common words as ground truth labels and configure this model to generalise over the remaining 5%. Alternatively, we could have a model that operates on 10k top tokens with <UNK>able words being diverted into a character level encoding. The former seems preferable as it would allow a common low-level interface, and the character information from the common words could be generalised to rare words.

I’ll let you know what I come up with.

If you have found nifty solutions to any of these issues, or know of papers that address them, please let me know in the comments below!

Artificial Morality (or How Do We Teach Robots to Love)

One Saturday morning I came upon the website 80000 Hours. The idea of the site is to direct our activity to maximise impact. They have a list of world problems here. One of the most pressing is explained as the artificial intelligence “control problem” : how do we control forces that can out think us? This got me thinking. Here are those thoughts.

carnival-686283_1280

The Definition Problem (You Say Semantics…)

As with any abstraction, we are first faced with the problems of definition. You could base a doctorate on this alone.

At its heart, ‘morality’ is about ‘right’ and ‘wrong’. These can be phrased as ‘good’ and ‘bad’ or ‘should’ and ‘should not’.

This is about where the agreement ends.

Let’s start with scope. Does morality apply to an internal world of thought as well as an external world of action? Religions often feature the concept of ‘immoral’ thoughts; however, most would agree that action is the final arbiter. Without getting too metaphysical, I would argue that thoughts (or data routines) are immoral to the extent that they cause physical change in the world in a manner that increases the likelihood of an immoral action (even though that action need not actually occur). For example, ruminating on killing is immoral in the sense that it leads to physical changes in the brain that make a person more likely to kill in future situations.

The main show in morality revolves around the moral groupings: just what is ‘right’ or ‘wrong’? This is where the mud tends to be thrown.

‘Morality’ itself has had a bad rap lately. There are overhangs from periods of dogmatic and repressive religious control. Post modernism, together with advanced knowledge of other cultures, has questioned the certainties that, at least in Europe and North America, supported the predominantly Judeo-Christian moral viewpoint. This has lead to some voices questioning the very basis of morality: if the moral groupings seem arbitrary, do we even need them?

As with other subjects, I think the existential panic that post modernism delivered is constructive for our thinking on morality, but we should use it to build from firmer foundations rather than abandon the building altogether. The body of knowledge from other cultures helps us map the boundaries and commonalities in human morality that can teach us how to construct an artificial machine morality.

Interestingly, morality does appear to be a binary classification. For me concepts, such as an action being half moral or a quarter immoral don’t really make sense. When thinking of morality, it is similarly hard to think of a category that is neither moral nor immoral. There is the concept of amorality – but this indicates the absence of a classification. Hence, morality is a binary classification that can itself be applied in a binary manner.

An Aside on Tribalism

Morality has deep ties to its abstractive cousins: politics and religion. Moral groupings are often used to indicate tribal affiliations in these areas. Indeed, some suggest that the differences in moral groupings have come about to better delineate social groupings. This means that disagreement often becomes heated as definitions are intrinsically linked to a definition of (social) self.

Fear of falling into the wrong side of a social grouping can often constrain public discourse on morality. This is possibly one of the reasons for the limited field size described in the 80000 hours problem profile.

Another, often overlooked point, is that those with the strongest personal views on morality tend to lie on the right of the political spectrum (i.e. be conservative), whereas those writing about morality in culture and academia tend to lie on the left (i.e. be liberal in the US sense). Hence, those writing “objectively” about morality tend to view the subject from a different subjective viewpoint than those who feel most passionately about right and wrong. This sets up a continuing misunderstanding. In my reading I have felt that those on the left tend to underestimate the visceral pull of morality, while those on the right tend to over-emphasise a fixed rules based approach.

Seductive Rules

Programmers and engineers love rules. A simple set of rules appears as a seductive solution to the problem of morality: think the Ten Commandments or Asimov’s Three Laws. However, this does not work in practice. This is clear from nature. Social life is far too complex.

Rules may be better thought of as a high-level surface representation of an underlying complex  probabilistic decision-making process. As such, in many situations the rules and behaviour will overlap. This gives us the causative fallacy that the rules cause the behaviour, whereas in reality similarities in genetics and culture lead human beings to act in ways that can be clustered and labelled as ‘rules’.

This is most apparent at edge cases of behaviour – in certain situations humans act in a reasonable or understandable way that goes against the rules. For example, “Thou shall not kill” unless you are at war, in which case you should. Or “Thou shall not steal”, unless your family is starving and those you are stealing from can afford it. Indeed, it is these messy edge cases that form the foundations of a lot of great literature.

However, we should not see rules of human behaviour as having no use – they are the human-intelligible labels we apply to make sense of the world and to communicate. Like the proverbial iceberg tip, they can also guide us to the underlying mechanisms. They can also provide a reference test set to evaluate an artificial morality: does our morality system organic arrive at well-known human moral rules without explicit programming?

How Humans Do It (Lord of the Flies)

When we evaluate artificial intelligence we need to understand we are doing this relative to human beings. For example, an artificial morality may be possible that goes against commonly-agreed moral groupings in a human based morality. Or we could come up with a corvid morality that overlapped inexactly with a human morality. However, the “control problem” defined in the 80000hours article is primarily concerned with constructing an artificial morality that is beneficial for, and consistent with generally held concepts of, humanity.

As with many philosophical abstracts, human morality likely arises from the interplay of multiple adaptive systems.  I will look at some of the key suspects below.

(Maternal) Love is All You Need

In at least mammals, the filial bond is likely at the heart of many behavioural aspects that are deemed ‘good’ across cultures. The clue is kind of in the name: the extended periods of nursing found in mammals, and the biological mechanisms such as oxytocin to allow this, provide for a level of self-sacrifice and concern that human beings respect and revere. The book Affective Neuroscience gives a good basic grounding in these mechanisms.

This, I think, also solves much of the control problem – parents are more intelligent than their children but (when things are working) do not try to exterminate them as a threat at any opportunity.

Indeed, it is likely not a coincidence that the bureaucratic apparatus that forms the basis for the automation of artificial intelligence first arose in China. This is a country whose Confucian/Daoist morality prizes filial respect, and extends it across non-kin hierarchies.

If our machines cared for us as children we may not control them, but they would act in our best interest.

Moreover, one of the great inventions of the mono-theistic religions of the Middle East, was the extension of filial love (think Father and Son) to other human beings. The concepts of compassion and love that at least Christian scholars developed in the first millennium (AD) had at their basis not the lust of romantic love but the platonic love of parent and child. This in turn was driven by the problem of regulating behaviour in urban societies that were growing increasing distant from any kind of kin relationship.

Social Grouping

The approach discussed above does have its limitations. These are played out all over history. Despite those mono-theistic religions extending the filial bond, they were not able to extend it to all humanity; it hit a brick wall at the limits of social groups.

Although it goes in and out of fashion, it may be that the group selection mechanisms explored by clever people such as Edward O. Wilson, are at play. Are social group boundaries necessary for the survival of those within the group? Is there something inherently flawed, in the form of long-term survival, if the filial bond is extended too far? Or is this limitation only in the constraints of the inherited biology of human beings?

Returning to morality, Jared Diamond notes in The World Until Yesterday that many tribal cultures group human beings into ‘within tribe’ and ‘outside tribe’, wherein the latter are classed as ‘enemies’ that may be ‘morally’ killed. Furthermore, many tribal cultures are plagued by a tit-for-tat cycle of killing, which was deemed the ‘right’ action until the later arrival of a state mechanism where justice was out-sourced from the tribe. We are reminded that “Thou shall not kill” does not apply to all those smitten in the Old Testament.

For machines and morality, this seems an issue. Would an artificial intelligence need to define in and out groups for it to be accepted and trusted by human being? If so how can we escape cataclysmic conflict? Do you program a self driving car to value all life equally, or those of your countries citizens above others? As has been pointed out by many, bias may be implicit in our training data. Does our culture and observed behaviour train artificial intelligence systems to naturally favour one group over another? (Groups being defined by a collection of shared features detected from the data). If so this may be an area where explicit guidance is required.

Disgust

Marc Hauser in Moral Minds touches on how many visceral feelings of right and wrong may be driven, or built upon, our capacity for disgust.

Disgust as an emotion has clearly defined facial expressions (see the work of Paul Ekman) that are shared across different human groups, indicating a deep shared biological basis in the brain.

Disgust is primarily an emotion of avoidance. It is best understood as a reaction to substances and situations that may be hazardous to our health. For example, disgust is a natural reaction to faeces, tainted foods and water supplies, vomit and decaying flesh. This makes us avoid these items and thus avoid the diseases (whether viral, bacterial or fungal) that accompany them. The feeling itself is based around a sensing and control of digestive organs such as the stomach and colon, the feeling is the pre-cursor to adaptive behaviours to purge the body of possibly disease-ridden consumables.

Hauser discusses research that suggests that the mechanisms of disgust have been extended to more abstract categories of items. When considering these items, people who have learned (or possibly inherited) an association feel an echo of the visceral disgust emotion that guides their decision making. There are also possible links to the natural strength of the disgust emotion in people and their moral sense: those who feel disgust more strongly tend also to be those who have a clearer binary feeling of right and wrong.

This is not to say that this linking of disgust and moral sense is always adaptive (and possibly ‘right’). Disgust is often a driving factor in out-group delineation. It may also underlie aversion to homosexuality among religious conservatives. However, it is often forgotten in moral philosophy, which tends to avoid ‘fluffy’ ‘feelings’ and subjective minefield this opens up.

Any artificial morality needs to bear disgust in mind though. Not only does it suggest one mechanism for implementing a moral sense at a nuts and bolts level, any implementation that ignores it will likely slip into the uncanny valley when it comes to human appraisal.

Fear

Another overlooked component of a human moral sense is fear.

Fear is another avoidance emotion that is primarily driven through the amygdala. Indeed, there may be overlaps between fear and disgust, as implemented in the brain. The other side of fear is the kick-starting of the ‘fight’ reflex, the release of epinephrine, norepinephrine and cortisol.

In moral reasoning, fear, like disgust, may be a mechanism to provide quick decision making. Fear responses may be linked to cultural learning (e.g. the internalised response to an angry or fearful parent around dangerous or ‘bad’ behaviours) and may guide the actual decision itself, e.g. pushing someone off a bridge or into a river is ‘bad’ because of the associated fear of falling or drowning, which gives us a feeling of ‘badness’.

Frontal Lobes

The moral reasoning discussed above forms the foundations of our thoughts. The actual thoughts themselves, including their linguistic expression in notes such as this, are also driven and controlled by the higher executive areas of the frontal lobes and prefrontal cortex. These areas are the conductor, who oversees the expression of neural activity over time in the rest of the cortex, including areas associated with sensory and motor processing.

In the famous case of Phineas Gage, violent trauma to the frontal lobes led to a decline in ‘moral’ behaviour and an increase in the ‘immoral’ vices of gambling, drinking and loose women. Hence, they appear to form a necessary part of our equipment for moral reasoning. Indeed, any model of artificial morality would do well to model the action of the prefrontal cortex and its role in inhibiting behaviour that is believed to be morally unsound.

The prefrontal cortex may also have another role: that of storyteller to keep our actions consistent. You see this behaviour often with small children: in order to keep beliefs regarding behaviour consistent in the face of often quite obvious inconsistencies, elaborate (and often quite hilarious) stories are told. It is also found in split brain patients to explain a behaviour caused by a side of the brain that is inaccessible to consciousness. Hence, human beings highly rate, and respond to, explanations of moral behaviour that are narratively consistent, even if they deviate from the more random and chaotic nature of objective reality. This is the critical front-end of our moral apparatus.

Where Does Culture Fit In?

Culture fits in as the guiding force for growth of the mechanisms discussed above. Causation is two-way, the environment drives epigenetic changes and neural growth and as agents we shape our environment. This all happens constantly over time.

Often it is difficult to determine the level at which a behaviour is hard-wired. The environmental human beings now live in around the world has been largely shaped by human beings. Clues for evaluating the depth of mechanisms, and for determining the strength of any association, include: universal expression across cultures, appearance in close genetic relatives such as apes and mammals, independent evolution in more distant cousins (e.g. tool use and social behaviour in birds), and consistency of behaviour over recent recorded time (10k years).

My own inclination is that culture guides expression, but it is difficult if not impossible to overwrite inherited behaviour. This is both good and bad. For example, evidence points to slavery and genocide as being cultural, they come and go throughout history. However, it is very difficult to train yourself not to gag when faced with the smell of fresh vomit or a decaying corpse.

A Note on Imperfection

Abuse. Murder. Violence. Post-natal depression. Crimes of passion. War. Things can and do go wrong. Turn on the news, it’s there for all to see.

Humans have a certain acceptance that humans are imperfect. Again a lot of great art revolves around this idea. People make mistakes. However, I’d argue that a machine that made mistakes wouldn’t last long.

A machine that reasons morally would necessarily not be perfect. To deal with the complexity of reality machines would need to reason probabilistically. This then means we have to abandon certainty, in particular the certainty of prediction. Classification rates in many machine learning tasks plateau at an 80-90% success rate, with progress then being measured for years in fractions of a percent. Would we be happy with a machine that only seems to be right 80-90% of the time?

Saying this I do note a tendency towards expecting perfection in society in reason years. When something goes wrong someone is to blame. Politicians need to step down; CEOs need to resign. There are lawsuits for negligence. This I feel is the flipside of technological certainty. We can predict events on a quantum scale and have supercomputers in our pockets; surely we can control the forces of nature? Maybe the development of imperfect yet powerful machines will allow us to regain some of our humanity.