Easy Audio/Video Capture with Python

At present it is difficult to obtain audio/video data in Python. For example, many deep learning methods assume you have easy access to your data in the form of a numpy array. Often you don’t. Based on the good efforts of those online, this post presents a number of Python classes to address this issue.

Just give me the code.

General Interface

Firstly, we can use threads to constantly update sensor data in the background. We can then read this data asynchronously.

Secondly, we can define a general interface for sensor data.

import threading

class SensorSource:
    """Abstract object for a sensory modality."""
    def __init__(self):
        """Initialise object."""
    def start(self):
        """Start capture source."""
        if self.started:
            print('[!] Asynchronous capturing has already been started.')
            return None
        self.started = True
        self.thread = threading.Thread(
        return self
    def update(self):
        """Update data."""
    def read(self):
        """Read data."""
    def stop(self):
        """Stop daemon."""
        self.started = False


For our video capture class, we can use OpenCV. You can install this in a conda environment using

conda install opencv
or via pip using
pip install opencv
. This allows access to the cv2 library.

Beware: you may need to do a bit of tweaking to get your video capture working – different cameras / system configurations need different tweaks.

# Video source

import cv2

class VideoSource(SensorSource):
    """Object for video using OpenCV."""
    def __init__(self, src=0):
        """Initialise video capture."""
        # width=640, height=480
        self.src = src
        self.cap = cv2.VideoCapture(self.src)
        #self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        #self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
        self.grabbed, self.frame = self.cap.read()
        self.started = False
        self.read_lock = threading.Lock()
    def update(self):
        """Update based on new video data."""
        while self.started:
            grabbed, frame = self.cap.read()
            with self.read_lock:
                self.grabbed = grabbed
                self.frame = frame
    def read(self):
        """Read video."""
        with self.read_lock:
            frame = self.frame.copy()
            grabbed = self.grabbed
        return grabbed, frame

    def __exit__(self, exec_type, exc_value, traceback):

The initialisation sets up the camera and the threading lock. The update method is run as part of the thread to continuously update the self.frame data. The data may then be (asynchronously) accessed using the read() method on the object. The exit line means that the camera resource is released when the object is deleted or the Python kernel is stopped so you can then use the camera in other applications.

Beware: I had issues setting the width and height so I have commented out those lines. Also remember OpenCV provides the data in BGR format – so channels 0, 1, 2 correspond to Blue, Green and Red rather than RGB. You might also want to set to YUV mode by adding the following to the __init__ method:

self.cap.set(16, 0)


You’ll see many posts online that use pyaudio for audio capture. I couldn’t get this to work in a conda environment due to an issue with the underlying PortAudio library. I had more success with alsaaudio:

conda install alsaaudio
pip install alsaaudio

# Audio source
import struct
from collections import deque
import numpy as np
import logging
import alsaaudio

class AudioSource(SensorSource):
    """Object for audio using alsaaudio."""
    def __init__(self, sample_freq=44100, nb_samples=65536):
        """Initialise audio capture."""
        # Initialise audio
        self.inp = alsaaudio.PCM(
        # set attributes: Mono, frequency, 16 bit little endian samples
        self.read_lock = threading.Lock()
        # Create a FIFO structure for the data
        self._s_fifo = deque([0] * nb_samples, maxlen=nb_samples)
        self.l = 0
        self.started = False
        self.read_lock = threading.Lock()
    def update(self):
        """Update based on new audio data."""
        while self.started:
            self.l, data = self.inp.read()
            if self.l > 0:
                # extract and format sample 
                raw_smp_l = struct.unpack('h' * self.l, data)
                with self.read_lock:
                    f'Sampler error occur (l={self.l} and len data={len(data)})'
    def read(self):
        """Read audio."""
        with self.read_lock:
            return self.l, np.asarray(self._s_fifo, dtype=np.int16)

The approach for audio is similar to video. We set up an audio input source and a threading lock in the __init__ method. In the audio case, we are recording a (time) series of audio samples, so we do this in a buffer of length nb_samples. The deque object acts as a FIFO queue and provides this buffer. The update method is run continuously in the background within the thread and adds new samples to the queue over time, with old samples falling off the back of the queue. The struct library is used to decode the binary data from the alsaaudio object and convert it into integer values that we can add to the queue. When we read the data, we convert the queue to a 16-bit integer numpy array.

In both cases, the read() method returns a tuple: (data_check_value, data) where the data_check_value is a value returned from the underlying capture objects. It is often useful for debugging.

Combining and Simplifying

Now we have defined sensor data sources, we can combine them so that we only need to perform one read() call to obtain data from all sources. To do this, we create a wrapper object that allows us to iterate through each added sensor data source.

class CombinedSource:
    """Object to combine multiple modalities."""
    def __init__(self):
        self.sources = dict()
    def add_source(self, source, name=None):
        """Add a source object.
        source is a derived class from SensorSource
        name is an optional string name."""
        if not name:
            name = source.__class__.__name__
        self.sources[name] = source
    def start(self):
        """Start all sources."""
        for name, source in self.sources.items():
    def read(self):
        """Read from all sources.
        return as dict of tuples."""
        data = dict()
        for name, source in self.sources.items():
            data[name] = source.read()[1]
        return data
    def stop(self):
        """Stop all sources."""
        for name, source in self.sources.items():
    def __del__(self):
        for name, source in self.sources.items():
            if source.__class__.__name__ == "VideoSource":
    def __exit__(self, exec_type, exc_value, traceback):
        for name, source in self.sources.items():
            if source.__class__.__name__ == "VideoSource":

The delete and exit logic is added to clean up the camera object – without these the camera is kept open and locked, which can cause problems. Data is returned as a dictionary, indexed by a string name for the data source.

We can simplify things even further by creating a derived class that automatically adds an audio and video capture object.

class AVCapture(CombinedSource):
    """Auto populate with audio and video."""
    def __init__(self):
        self.sources = dict()
        a = AudioSource()
        self.add_source(a, "audio")
        v = VideoSource()
        self.add_source(v, "video")

This then allows us to access audio and video data in a couple of lines.

av = AVCapture()
data = av.read()

Here are some outputs from:

import matplotlib.pyplot as plt
Colours are crazy because imshow expects RGB not BGR!
BBC Radio 6 Music in Graphical Form

Finishing Off

You can find the code in a Gist here, together with some testing lines that you could easily convert into a library.

You can also expand the sensor classes to capture other data. I plan to create a class to capture CPU and memory use information.


Making Proper Histograms with Numpy and Matplotlib

Often I find myself needing to visualise an array, such as bunch of pixel or audio channel values. A nice way to do this is via a histogram.

When building histograms you have two options: numpy’s histogram or matplotlib’s hist. As you may expect numpy is faster when you just need the data rather than the visualisation. Matplotlib is easier to apply to get a nice bar chart.

So I remember, here is a quick post with an example.

# First import numpy and matplotlib
import numpy as np
import matplotlib.pyplot as plt

I started with a data volume of size 256 x 256 x 8 x 300, corresponding to 300 frames of video at a resolution of 256 by 256 with 8 different image processing operations. The data values were 8-bit, i.e. 0 to 255. I wanted to visualise the distribution of pixel values within this data volume.

Using numpy, you can easily pass in the whole data volume and it will flatten the arrays within the function. Hence, to get a set of histogram values and integer bins between 0 and 255 you can run:

values, bins = np.histogram(data_vol, np.arange(0, 255))

You can then use matplotlib’s bar chart to plot this:

plt.bar(bins[:-1], values, width = 1)

Using matplotlib’s hist function, we need to flatten the data first:

results = plt.hist(data_vol.ravel(), bins=np.arange(0, 255))

The result of both approaches is the same. If we are being more professional, we can also use more of matplotlib’s functionality:

fig, ax = plt.subplots()
results = ax.hist(data_vol.ravel(), bins=np.arange(0, 255))
ax.set_title('Pixel Value Histogram')
ax.set_xlabel('Pixel Value')

Things get a little more tricky when we start changing our bin sizes. A good run through is found here. In this case, the slower Matplotlib function becomes easier:

fig, ax = plt.subplots()
results = ax.hist(
    bins=np.linspace(0, 255, 16)
ax.set_title('Pixel Value Histogram (4-bit)')
ax.set_xlabel('Pixel Value')
Using 16-bins provides us with 4-bit quantisation. You can see here we could represent a large chunk of the data with just three values (if we subtract 128: <0 = -1, 0 = 0 and >0 = 1).

Capturing Live Audio and Video in Python

In my robotics projects I want to capture live audio and video data and convert it into Numpy multi-dimensional arrays for further processing. To save you several days, this blog post explains how I go about doing this.

Audio / Video Not Audio + Video

A first realisation is that you need to capture audio and video independently. You can record movie files with audio, but as far as I could find there is no simple way to live capture both audio and video data.


For video processing, I found there were two different approaches that could be used to process video data:

  • Open CV in Python; and
  • Wrapping FFMPEG using SubProcess.

Open CV

The default library for video processing in Python is OpenCV. Things have come a long way since my early experiences with OpenCV in C++ over a decade ago. Now there is a nice Python wrapper and you don’t need to touch any low-level code. The tutorials here are a good place to start.

I generally use Conda/Anaconda these days to manage my Python environments (the alternative being old skool virtual environments). Setting up a new environment with Jupyter Notebook and Open CV is now straightforward:

conda install opencv jupyter

As a note – installing OpenCV in Conda seems to have been a pain up to a few years ago. There are thus several out of date Stack Overflow answers that come up in the searches, that refer to installing from specific sources (e.g. from menpo). This appears not to be needed now.

One problem I had in Linux (Ubuntu 18.04) is that the GTK libraries didn’t play nicely in the Conda environment. I could capture images from the webcam but not display them in a window. This lead me to look for alternative visualisation strategies that I describe below.

A good place to start with OpenCV is this video tutorial. As drawing windows led to errors I designed a workaround where I used PIL (Python Image Library) and IPython to generate an image from the Numpy array and then show it at about 30 fps. The code separates out each of the YUV components and displays them next to each other. This is useful for bio-inspired processing.

# Imports
import PIL
import io
import cv2
import matplotlib.pyplot as plt
from IPython import display
import time
import numpy as np

# Function to convert array to JPEG for display as video frame
def showarray(a, fmt='jpeg'):
    f = io.BytesIO()
    PIL.Image.fromarray(a).save(f, fmt)

# Initialise camera
cam = cv2.VideoCapture(0)
# Optional - set to YUV mode (remove for BGR)
cam.set(16, 0)
# These allow for a frame rate to be printed
t1 = time.time()

# Loops until an interrupt
        t2 = time.time()
        # Capture frame-by-frame
        ret, frame = cam.read()
        # Join components horizontally
        joined_array = np.concatenate(
            frame[:, 1::2, 1], 
            frame[:, 0::2, 1]
        ), axis=1)
        # Use above function to show array
        # Print frame rate
        print(f"{int(1/(t2-t1))} FPS")
        # Display the frame until new frame is available
        t1 = t2
except KeyboardInterrupt:
    # Release the camera when interrupted
    print("Stream stopped")</code></pre>

In the above code, “frame” is a three-dimensional tensor or array where the first dimension relates to rows of the image (e.g. the y-direction of the image), the second dimension relates to columns of the image (e.g. the x-direction of the image) and the third dimension relates to the three colour channels. Often for image processing it is useful to separate out the channels and just work on a single channel at a time (e.g. equivalent to a 2D matrix or grayscale image).


An alternative to using OpenCV is to use subprocess to wrap the FFMPEG, a command line video and audio processing utility.

This is a little trickier as it involves accessing the video buffers. I have based on solution on this guide by Zulko here.

import subprocess as sp
import numpy as np
import matplotlib.pyplot as plt

FFMPEG_BIN = "ffmpeg"
# Define command line command
command = [ FFMPEG_BIN,
            '-i', '/dev/video0',
            '-f', 'image2pipe',
            '-pix_fmt', 'rgb24',
            '-an','-sn', #-an, -sn disables audio and sub-title processing respectively
            '-vcodec', 'rawvideo', '-']
# Open pipe
pipe = sp.Popen(command, stdout = sp.PIPE, bufsize=(640*480*3))

# Display a few frames
no_of_frames = 5
fig, axes = plt.subplots(no_of_frames, 1)

for i in range(0, no_of_frames):
    # Get the raw byte values from the buffer
    raw_image = pipe.stdout.read(640*480*3)
    # transform the byte read into a numpy array
    image = np.frombuffer(raw_image, dtype='uint8')
    image = image.reshape((480,640, 3))
    # Flush the pipe

Now I had issues flushing the pipe in a Jupyter notebook, so I ended up using the OpenCV method in the end. Also it is trickier working out the byte structure for YUV data.


My middle daughter generates a lot of noise.

For audio, there are also a number of options. I have tried:

Now PyAudio appears to be preferred. However, I am quickly learning that audio / video processing in Python is not yet as polished as pure image processing or building a neural network.

PyAudio provides a series of wrappers around the PortAudio libraries. However, I had issues getting this to work in an Conda environment. Initially, no audio devices showed up. After a long time working through Stack Overflow, I found that installing from the Conda-Forge source did allow me to find audio devices (see here). But even though I could see the audio devices I then had errors opening an audio stream. (One tip for both audio and video is to look at your terminal output when capturing audio and video – the low level errors will be displayed here rather than in a Jupyter notebook.)


Given my difficulties with PyAudio I then tried AlsaAudio. I had more success with this.

My starting point was the code for recording audio that is provided in the AlsaAudio Github repository. The code below records a snippet of audio then loads it from the file into a Numpy array. It became the starting point for a streaming solution.

# Imports
import alsaaudio
import time
import numpy as np

# Setup Audio for Capture
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NONBLOCK, device="default")

# Record a short snippet
with open("test.wav", 'wb') as f:
    loops = 1000000
    while loops > 0:
        loops -= 1
        # Read data from device
        l, data = inp.read()
        if l:

f = open("test.wav", 'rb')

# Open the device in playback mode. 
out = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, device="default")

# Set attributes: Mono, 44100 Hz, 16 bit little endian frames

# The period size controls the internal number of frames per period.
# The significance of this parameter is documented in the ALSA api.
# We also have 2 bytes per sample so 160*2 = 320 = number of bytes read from buffer

# Read data from stdin
data = f.read(320)
numpy_array = np.frombuffer(data, dtype='<i2')
while data:
    data = f.read(320)
    decoded_block = np.frombuffer(data, dtype='<i2')
    numpy_array = np.concatenate((numpy_array, decoded_block))

The numpy_array is then a long array of sound amplitudes.

Sampler Object

I found a nice little Gist for computing the FFT here. This uses a Sampler object to wrap the AlsaAudio object.

from collections import deque
import struct
import sys
import threading
import alsaaudio
import numpy as np

# some const
# 44100 Hz sampling rate (for 0-22050 Hz view, 0.0227ms/sample)
# 66000 samples buffer size (near 1.5 second)
NB_SAMPLE = 66000

class Sampler(threading.Thread):
    def __init__(self):
        # init thread
        self.daemon = True
        # init ALSA audio
        self.inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL, device="default")
        # set attributes: Mono, frequency, 16 bit little endian samples
        # sample FIFO
        self._s_lock = threading.Lock()
        self._s_fifo = deque([0] * NB_SAMPLE, maxlen=NB_SAMPLE)

    def get_sample(self):
        with self._s_lock:
            return list(self._s_fifo)

    def run(self):
        while True:
            # read data from device
            l, data = self.inp.read()
            if l > 0:
                # extract and format sample (normalize sample to 1.0/-1.0 float)
                raw_smp_l = struct.unpack('h' * l, data)
                smp_l = (float(raw_smp / 32767) for raw_smp in raw_smp_l)
                with self._s_lock:
                print('sampler error occur (l=%s and len data=%s)' % (l, len(data)), file=sys.stderr)

Next Steps

This is where I am so far.

The next steps are:

  • look into threading and multiprocessing so that we can run parallel audio and video sampling routines;
  • extend the audio (and video?) processing to obtain the FFT; and
  • optimise for speed of capture.

Playing Around with Retinal-Cortex Mappings

Here is a little notebook where I play around with converting images from a polar representation to a Cartesian representation. This is similar to the way our bodies map information from the retina onto the early visual areas.

Mapping from the visual field (A) to the thalamus (B) to the cortex (C)

These ideas are based on information we have about how the visual field is mapped to the cortex. As can be seen in the above figures, we view the world in a polar sense and this is mapped to a two-dimensional grid of values in the lower cortex.

You can play around with mappings between polar and Cartesian space at this website.

To develop some methods in Python I’ve leaned heavily on this great blogpost by Amnon Owed. This gives us some methods in Processing I have adapted for my purposes.

Amnon suggests using a look-up table to speed up the mapping. In this way we build a look-up table that maps co-ordinates in polar space to an equivalent co-ordinate in Cartesian space. We then use this look-up table to look-up the mapping and use the mapping to transform the image data.

import math
import numpy as np
import matplotlib.pyplot as plt

def calculateLUT(radius):
    """Precalculate a lookup table with the image maths."""
    LUT = np.zeros((radius, 360, 2), dtype=np.int16)
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over diameter
        for r in range(0, radius):
            theta = math.radians(angle)
            # Take angles from the vertical
            col = math.floor(r*math.sin(theta))
            row = math.floor(r*math.cos(theta))
            # rows and cols will be +ve and -ve representing
            # at offset from an origin
            LUT[r, angle] = [row, col]
    return LUT

def convert_image(img, LUT):
    Convert image from cartesian to polar co-ordinates.

    img is a numpy 2D array having shape (height, width)
    LUT is a numpy array having shape (diameter, 180, 2)
    storing [x, y] co-ords corresponding to [r, angle]
    # Use centre of image as origin
    centre_row = img.shape[0] // 2
    centre_col = img.shape[1] // 2
    # Determine the largest radius
    if centre_row > centre_col:
        radius = centre_col
        radius = centre_row
    output_image = np.zeros(shape=(radius, 360))
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over radius
        for r in range(0, radius):
            # Get mapped x, y
            (row, col) = tuple(LUT[r, angle])
            # Translate origin to centre
            m_row = centre_row - row
            m_col = col+centre_col
            output_image[r, angle] = img[m_row, m_col]
    return output_image

def calculatebackLUT(max_radius):
    """Precalculate a lookup table for mapping from x,y to polar."""
    LUT = np.zeros((max_radius*2, max_radius*2, 2), dtype=np.int16)
    # Iterate around x and y
    for row in range(0, max_radius*2):
        for col in range(0, max_radius*2):
            # Translate to centre
            m_row = max_radius - row
            m_col = col - max_radius
            # Calculate angle w.r.t. y axis
            angle = math.atan2(m_col, m_row)
            # Convert to degrees
            degrees = math.degrees(angle)
            # Calculate radius
            radius = math.sqrt(m_row*m_row+m_col*m_col)
            # print(angle, radius)
            LUT[row, col] = [int(radius), int(degrees)]
    return LUT

def build_mask(img, backLUT, ticks=20):
    """Build a mask showing polar co-ord system."""
    overlay = np.zeros(shape=img.shape, dtype=np.bool)
    # We need to set origin backLUT has origin at radius, radius
    row_adjust = backLUT.shape[0]//2 - img.shape[0]//2
    col_adjust = backLUT.shape[1]//2 - img.shape[1]//2
    for row in range(0, img.shape[0]):
        for col in range(0, img.shape[1]):
            m_row = row + row_adjust
            m_col = col + col_adjust
            (r, theta) = backLUT[m_row, m_col]
            if (r % ticks) == 0 or (theta % ticks) == 0:
                overlay[row, col] = 1
    masked = np.ma.masked_where(overlay == 0, overlay)
    return masked

First build the backwards and forwards look-up tables. We’ll set a max radius of 300 pixels, allowing us to map images of 600 by 600.

backLUT = calculatebackLUT(300)
forwardLUT = calculateLUT(300)

Now we’ll try this out with some test images from skimage. We’ll normalise these to a range of 0 to 255.

from skimage.data import chelsea, astronaut, coffee

img = chelsea()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

img = astronaut()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

img = coffee()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

In the methods, the positive y axis is the reference for the angle, which is extends clockwise.

Now, within the brain the visual field is actually divided in two. As such, each hemisphere gets half of the bottom image (0-180 to the right hemisphere and 180-360 to the left hemisphere).

Also within the brain, the map on the cortex is rotated clockwise by 90 degrees, such that angle from the horizontal eye line is on the x-axis. The brain receives information from the fovea at a high resolution and information from the periphery at a lower resolution.

The short Jupyter Notebook can be found here.

Extra: proof this occurs in the human brain!

Understanding Convolution in Tensorflow

This is a quick post intended to help those trying to understand convolution as applied in Tensorflow.

There are many good blog posts on the Internet explaining convolution as applied in convolutional neural networks (CNNs), e.g. see this one by Denny Britz. However, understanding the theory in one thing, knowing how to implement it is another. This is especially the case when trying to apply CNNs to word or character-level natural language processing (NLP) tasks – here the image metaphors break down a little.

I generally use Tensorflow for neural network modelling. Most of the stuff I want to do is a little bespoke, so I need something a little more expressive than Keras. Two dimensional convolution is explained in the Tensorflow documents here. I also found the examples in the stackoverflow answer here very useful.

To summarise, for our input we have a [a, b, c, d] tensor – i.e. a x b x c x d.

  • a is the number of input ‘images’ or examples (this will typically be your batch size);
  • b is the input width (e.g. image width, max. word length in characters or max. sequence length in words);
  • c is the input height (e.g. image height or embedding dimensionality); and
  • d is the number of input channels (grayscale images, characters or words = 1, RGB = 3).

For the convolution kernel filter you also have a [o, p, q, r] tensor – i.e. o x p x q x r.

  • o is the filter width (e.g. patch width or ‘n-gram’ size);
  • p is the filter height (e.g. patch height or embedding dimensionality);
  • q is the number of channels (from the input – i.e. input channels); and
  • r is the number of filters (or output channels).

q basically has to match d. r – the number of output channels – is equal to the number of filters you want. For each output channel, a different filter of width * height will be created and applied.

Most of the time the theory is talking about images of b x c, and a filter of o x p.

(This follows Denny’s configuration. However, I note you can transpose b and c and o and p and get the same outcome.)

For NLP the number of output channels becomes an important parameter. This is because you will typically max-pool over the sequence (e.g. word or character) length, such that you get one value for the complete sequence per filter. Each filter can be thought of as representing something like an n-gram, e.g. the learnt parameters of one filter of length 3 could represent the character embeddings for the suffix “ly” (e.g. [l_embedding, y_embedding, wordend_embedding]) or the prefix “un” (including word start token).

I found it instructive to work through the code associated with Denny’s further blog post here.

Practical Problems with Natural Language Processing

Or the stuff no one tells you that is actually quite hard.

Recently I’ve been playing around with the last 15 years of patent publications as a ‘big data’ source. This includes over 4 million individual documents. Here I thought I’d highlight some problems I faced. I found that a lot of academic papers tend to ignore or otherwise bypass this stuff.

Sentence Segmentation

Many recurrent neural network (RNN) architectures work with sentences as an input sequence, where the sentence is a sequence of word tokens. This introduces a first problem: how do you get your sentences?

A few tutorials and datasets get around this by providing files where each line in the file is a separate sentence. Hence, you can get your sentences by just reading the list of filelines.

In my experience, the only data where file lines are useful is code. For normal documents, there is no correlation between file lines and sentences; indeed, each sentence is typically of a different length and so is spread across multiple file lines.

In reality, text for a document is obtained as one large string (or at most a set of paragraph tags). This means you need a function that takes your document and returns a list of sentences: s = sent_tokenise(document).

A naive approach is to tokenise based on full stops. However, this quickly runs into issues with abbreviations: “U.S.”, “No.”, “e.g.” and the like will cut your sentences too soon.

The Python NLTK library provides a sentence tokeniser function out-of-the-box – sent_tokenize(text). It appears this is slightly more intelligent that a simple naive full stop tokenisation. However, it still appears to be cutting sentences too early based on some abbreviations. Also optical character recognition errors, such as “,” instead of “.”, or variable names, such as “var.no.1” will give you erroneous tokenisation.


One option to resolve this is to train a pre-processing classifier to identify (i.e. add) <end-of-sentence> tokens. This could work at the word token level, as the NLTK word tokeniser does appear to extract abbreviations, websites and variable names as single word units.

You can train the Punkt sentence tokenizer – http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt and https://stackoverflow.com/questions/21160310/training-data-format-for-nltk-punkt . One option is to test training the Punkt sentence tokenizer with patent text.

Another option is to implement a deep learning segmenter on labelled data – e.g. starting from here – https://hal.archives-ouvertes.fr/hal-01344500/document . You can have sequence in and control labels out (e.g. a sequence to sequence system). Or even character based labelling using a window around the character (10-12). This could use a simple feed-forward network. The problem with this approach is you would need a labelled dataset – ideally we would like an unsupervised approach.

Another option is to filter an imperfect sentence tokeniser to remove one or two word sentences.

Not All File Formats are Equal

An aside on file formats. The patent publication data is supplied as various forms of compressed file. One issue I had was that it was relatively quick and easy to access data in a nested zip file (e.g. multiple layers down – using Python’s zipfile). Zip files could be access as hierarchies of file objects. However, this approach didn’t work with tar files, for these files I needed to extract the whole file into memory before I could access the contents. This resulted in ‘.tar’ files taking up to 20x longer to access than ‘.zip’ files.


Related to sentence segmentation is the issue of section titles. These are typically set of <p></p> elements in my original patent XML files, and so form part of the long string of patent text. As such they can confuse sentence tokenisation: they do not end in a full stop and do not exhibit normal sentence grammar.


Titles can however be identified by new lines (\n). A title will have a preceding and following new line and no full stop. It could thus be extracted using a regular expression (e.g. “\n\s?(\W\s?)\n”).

Titles may be removed from the main text string. They may also be used as variables of “section” objects that model the document, where the object stores a long text string for the section.

As an aside, titles in HTML documents are sometimes wrapped in header tags (e.g. <h3></h3>). For web pages titles may thus be extracted as part of the HTML parsing.

Word Segmentation

About half the code examples I have seen use a naive word tokenisation that splits sentences or document strings based on spaces (e.g. as a list comprehension using doc.split()). This works fairly successfully but is not perfect.

The other half of code examples use a word tokenising function supplied by a toolkit (e.g. within Keras, TensorFlow or NLTK). I haven’t looked under the hood but I wouldn’t be surprised if they were just a wrapper for the simple space split technique above.

While these functions work well for small curated datasets, I find the following issues for real world data. For reference I was using word_tokenise( ) from NLTK.

Huge Vocabularies

A parse of 100,000 patent documents indicates that there are around 3 million unique “word” tokens.

Even with stemming and some preprocessing (e.g. replacing patent numbers with a single placeholder token), I can only cut this vocabulary down to 1 million unique tokens.


This indicates that the vocabulary on the complete corpus will easily be in the millions of tokens.

This then quickly makes “word” token based systems impractical for real world datasets. For many RNN system you will need to use a modified softmax (e.g. sampled or hierarchical) on your output, and even these techniques may grind to a holt at dimensionalities north of 1 million.

The underlying issue is that word vocabularies have a set of 50-100k words that are used frequently and a very long tail of infrequent words.

Looking at the vocabulary is instructive. You quickly see patterns of inefficiency.


This is a big one, especially for more technical text sources where numbers turn up a lot. Each unique number that is found is considered a separate token.


This becomes a bigger problem with patent publications. Numbers occur everywhere, from parameters and variable values to cited patent publication references.

Any person looking at this quickly asks – why can’t numbers be represented in real number space rather than token space?

This almost becomes absurd when our models use thousands of 32 bit floating point numbers as  parameters – just one parameter value can represent numbers in a range of -3.4E+38 to +3.4E+38. As such you could reduce your dimensionality by hundreds of thousands of points simply by mapping numbers onto one or two real valued ranges. The problem is this then needs bespoke model tinkering, which is exactly what most deep learning approaches are trying to avoid.

Looking at deep learning papers in the field I can’t see this really being discussed. I’ve noticed that a few replace financial amounts with zeros (e.g. “$123.23” > “$000.00”). This then only requires one token per digit or decimal place. You do then lose any information stored by the number.

I have also noticed that some word embedding models end up mapping numbers onto an approximate number range (they tend to be roughly grouped into linear structures in embedding space). However, you still have the issue of large input/output dimensions for your projections and the softmax issue remains. There is also no guarantee that your model will efficiently encode numbers, e.g. you can easily image local minima where numbers are greedily handled by a largish set of hidden dimensions.

Capital Letters

As can be seen in the list of most common tokens set out above, a naive approach that doesn’t take into account capital letters treats capitalised and uncapitalised versions of a word as separate independent entities, e.g. “The” and “the” are deemed to have no relation in at least the input and output spaces.


Many papers and examples deal with this by lowering the case of all words (e.g. preprocessing using wordstring.lower()). However, this again removes information; capitalisation is there for a reason: it indicates acronyms, proper nouns, the start of sentences etc.

Some hope that this issue is dealt with in a word embedding space, for example that “The” and “the” are mapped to similar real valued continuous n-dimensional word vectors (where n is often 128 or 300). I haven’t seen though any thought as to why this would necessarily happen, e.g. anyone thinking about the use of “The” and “the” and how a skip-gram, count-based or continuous bag of words model would map these tokens to neighbouring areas of space.

One pre-processing technique to deal with capital letters is to convert each word to lowercase, but to then insert an extra control token to indicate capital usage (such as <CAPITAL>). In this case, “The” becomes “<CAPITAL>”, “the”. This seems useful – you still have the indication of capital usage for sequence models but your word vocabulary only consists of lowercase tokens. You are simply transferring dimensionality from your word and embedding spaces to your sequence space. This seems okay – sentences and documents vary in length.

(The same technique can be applied to a character stream to reduce dimensionality by at least 25: if “t” is 24 and “T” is 65 then a <CAPITAL> character may be inserted (e.g. index 3) “T” can become “3”, “24”.)

Hyphenation and Either/Or

We find that most word tokenisation functions treat hyphenated words as single units.


The issue here is that the hyphenated words are considered as a further token that is independent of their component words. However, in our documents we find that many hyphenated words are used in a similar way to their unhyphenated versions, the meaning is approximately the same and the hyphen indicates a slightly closer relation than that of the words used independently.

One option is to hope that our magical word embeddings situate our hyphenated words in embedding space such that they have a relationship with their unhyphenated versions. However, hyphenated words are rare – typically very low use in our long tail – we may only see them one or two times in our data. This is not enough to provide robust mappings to the individual words (which may be seen 1000s of times more often).

So Many Hyphens

An aside on hyphens. There appear to be many different types of hyphen. For example, I can think of at least: a short hyphen, a long hyphen, and a minus sign. There are also hyphens from locale-specific unicode sets. The website here counts 27 different types: https://www.cs.tut.fi/~jkorpela/dashes.html#unidash . All can be used to represent a hyphen. Some dimensionality reduction would seem prudent here.

Another answer to this problem is to use a similar technique to that for capital letters: we can split hyphenated words “word1-word2” into “word1”, “<HYPHEN>”, “word2”, where “<HYPHEN>” is a special control character in our vocabulary. Again, here we are transferring dimensionality into our sequence space (e.g. into the time dimension). This seems a good trade-off. A sentence typically has a variable token length of ~10-100 tokens. Adding another token would seem not to affect this too much: we seem to have space in the time dimension.

A second related issue is the use of slashes to indicate “either/or” alternatives, e.g. “input/output”, “embed/detect”, “single/double-click”. Again the individual words retain their meaning but the compound phrase is treated as a separate independent token. We are also in the long tail of word frequencies – any word embeddings are going to be unreliable.


One option is to see “/” as an alternative symbol for “or”. Hence, we could have “input”, “or”, “output” or “embed”, “or”, “detect”. Another option is to have a special “<SLASH>” token for “either/or”. Replacement can be performed by regular expression substitution.

Compound Words

The concept of “words” as tokens seems straightforward. However, reality is more complicated.

First, consider the terms “ice cream” and “New York”. Are these one word or two? If we treat them as two independent words “ice cream” becomes “ice”, “cream” where each token is modelled separately (and may have independent word embeddings). However, intuitively “ice cream” seems to be a discrete entity with a semantic meaning that is more than the sum of “ice” and “cream”. Similarly, if “New York” is “New”, “York” the former token may be modelled as a variation of “new” and the latter token as entity “York” (e.g. based on the use of “York” elsewhere in the corpus). Again this seems not quite right – “New York” is a discrete entity whose use is different from “New” and “York”. (For more fun also compare with “The Big Apple” – we feel this should be mapped to “New York” in semantic space, but the separate entities “New”, “York”, “The”, “Big”, “Apple” are unlikely to be modelled as related individually.)

The case of compound words probably has a solution different from that discussed for hyphenation and slashes. My intuition is that compound words reflect features in a higher feature space, i.e. in a feature space above that of the individual words. This suggests that word embedding may be a multi-layer problem.

Random Characters, Maths and Made-Up Names

This is a big issue for patent documents. Our character space has a dimensionality of around 600 unique characters, but only about 100 are used regularly – again we have a longish tail of infrequent characters.


Looking at our infrequent characters we see some patterns: accented versions of common characters (e.g. ‘Ć’ or ‘ë’); unprintable unicode values (possibly from different character sets in different locales); and maths symbols (e.g. ‘≼’ or ‘∯’).

When used to construct words we end up with many variations of common word tokens (‘cafe’ and ‘café’) due to the accented versions of common characters.

Our unprintable unicode values appear not to occur regularly in our rare vocabulary. It thus appears many of these are local variations on control characters, such as whitespace. This suggests that we would not lose too much information with a pre-processing step that removed these characters and replaced them with a space (if they are not printable on a particular set of computers this suggests they hold limited utility and importance).

The maths symbols cause us a bit of a problem. Many of our rare tokens are parts of equations or variable names. These appear somewhat independent of our natural language processing – they are not “words” as we would commonly understand them.

One option is to use a character embedding. I have seen approaches that use between 16 and 128 dimensions for character embedding. This seems overkill. There are 26 letters, this is multiplied by 2 for capital letters, and there are around 10 common punctuation characters (the printable set of punctuation in Python’s string library is only of length 32). All the unicode character space may be easily contained within one real 32 bit floating point dimension. High dimensionality would appear to risk overfitting and would quickly expand our memory and processing requirements. My own experiments suggest that case and punctuation clustering can be seen with just 3 dimensions, as well as use groupings such as vowels and consonants. One risk of low dimensional embeddings is a lack of redundant representations that make the system fragile. My gut says some low dimensionality of between 1 and 16 dimensions should do. Experimentation is probably required within a larger system context. (Colin Morris has a great blog post where he looks at the character embedding used in one of Google Brain’s papers – the dimensionality plots / blogpost link can be found here: http://colinmorris.github.io/lm1b/char_emb_dimens/.)

What would we like to see with a character embedding:

  • a relationship between lower and upper case characters – preferable some kind of dimensionality reduction based on a ‘capital’ indication;
  • mapping of out of band unicode characters onto space or an <UNPRINTABLE> character;
  • mapping of accented versions of characters near to their unaccented counterpart, e.g. “é” should be a close neighbour of “e”;
  • clustering of maths symbols; and
  • even spacing of frequently used characters (e.g. letters and common punctuation) to provide a robust embedding.

There is also a valid question as to whether character embedding is needed. Could we have a workable solution with a simple lookup table mapping?

Not Words

Our word tokenisation functions also output a lot of tokens that are not really words. These include variable names (e.g. ‘product.price’ or ‘APP_GROUP_MULTIMEDIA’), websites (e.g. ‘www.mybusiness.com’), path names (e.g. ‘/mmt/glibc-15/lib/libc.so.6’, and text examples (‘PεRi’). These are often: rare – occurring one or two times in a corpus; and document-specific – they often are not repeated across documents. They make up a large proportion of the long-tail of words.

Often these tokens are <UNK>ed, i.e. they are replaced by a single token that represents rare words that are not taken into account. As our token use follows a power law distribution this can significantly reduce our vocabulary size.


For example, the plot above shows that with no filtering we have 3 million tokens. By filtering out tokens that only appear once we reduce our vocabulary size by half: to 1.5 million tokens. By filtering out tokens that only appear twice we can reduce our dimensionality by another 500,000 tokens. Gains fall off as we up the threshold, by removing tokens that appear fewer than 10 times, we can get down to a dimensionality of around 30,000. This is around the dimensionality you tend to see in most papers and public (toy) datasets.

The problem with doing this is you tend to throw away a lot of your document. The <UNK> then becomes like “the” or “of” in your model. You get samples such as “The <UNK> then started to <UNK> about <UNK>”, where any meaning of the sentence is lost. This doesn’t seem like a good approach.

The problem of <UNK> is a fundamental one. Luckily the machine learning community is beginning to wake up to this a little. In the last couple of years (2016+), character embedding approaches are gaining traction. Google’s neural machine translation system uses morpheme-lite word ‘portions’ to deal with out of vocabulary words (see https://research.google.com/pubs/pub45610.html). The Google Brain paper on Exploring the Limits of Language Modelling (https://arxiv.org/pdf/1602.02410.pdf) explores a character convolutional neural network (‘CharCNN’) that is applied to characters (or character embeddings).

What Have We Learned?

Summing up, we have found the following issues:

  • sentence segmentation is imperfect and produces noisy sentence lists;
  • some parts of a document such as titles may produce further noise in our sentence lists; and
  • word segmentation is also imperfect:
    • our vocabulary size is huge: 3 million tokens (on a dataset of only 100,000 documents);
    • numbers are a problem – they are treated as separate discrete units, which seems inefficient;
    • the concept of a “word” is fuzzy – we need to deal with compound words, hyphenation and either/or notation;
    • there are different character sets that lead to separate discrete tokens – e.g. capitals and accented letters – when common underlying tokens appear possible; and
    • non-language features such as equations and variable names contribute heavily to the vocabulary size.

Filtering out non-printable unicode characters would appear a good preprocessing step that has minimal effect on our models.

Character embedding to a low dimensional space appears useful.

Word/Character Hybrid Approach?

Looking at my data, a character-based approach appears to be the one to use. Even if we include all the random characters, we only have a dimensionality of 600 rather than 3 million for the word token space.

Character-based models would appear well placed to deal with rare, out-of-vocabulary words (e.g. ‘product.price’). It also seems much more sensible to treat numbers as sequences of digits as opposed to discrete tokens. Much like image processing, words then arise as a layer of continuous features. Indeed, it would be relatively easy to insert a layer to model morpheme-like character groupings (it would appear this is what the CNN approaches are doing).

The big issue with character level models is training. Training times are increased by orders of magnitude (state of the art systems take weeks on systems with tens of £1k graphic cards).

However, we do have lots of useful training data at the word level. Our 50 most common unfiltered word tokens make up 46% (!) of our data. The top 10,000 tokens make up 95% of the data. Hence, character information seems most useful for the long tail.


This then suggests some kind of hybrid approach. This could be at a model or training level. The 95% of common data would suggest we could train a system using common words as ground truth labels and configure this model to generalise over the remaining 5%. Alternatively, we could have a model that operates on 10k top tokens with <UNK>able words being diverted into a character level encoding. The former seems preferable as it would allow a common low-level interface, and the character information from the common words could be generalised to rare words.

I’ll let you know what I come up with.

If you have found nifty solutions to any of these issues, or know of papers that address them, please let me know in the comments below!

Fixing Errors on Apache-Served Flask Apps

This is just a quick post to remind me of the steps to resolve errors on an Apache-served Flask app. I’m using Anaconda as I’m on Puppy Linux (old PC) and some compilations give me errors. Stuff in square brackets is for you to fill in.

Log into remote server (I use ssh keys):

ssh -p [MyPort] [user]@[server]

Check the error logs (the name of the log is set in the app configuration):

nano /var/log/apache2/[my_app_error].log

On a local machine clone the production Flask App (again I have ssh keys setup):

git clone git@github.com:[user]/[project].git
cd [project]

Setup a local virtual environment (with the right version of python):

conda create -n [project] python=2.7

Activate the environment:

source activate [project]

Install requirements:

pip install -r requirements.txt

[Use ‘conda install X’ for stuff that has trouble compiling (‘lxml’ is a nightmare).]

Setup environment variables:

Add ‘etc/conda/activate.d’ and’etc/conda/deactivate.d’ folders in the Anaconda environments directory and set env_vars.sh files in each folder:

mkdir -p ~/anaconda3/envs/[project]/etc/conda/activate.d
touch ~/anaconda3/envs/[project]/etc/conda/activate.d/env_vars.sh
mkdir -p ~/anaconda3/envs/[project]/etc/conda/deactivate.d
touch ~/anaconda3/envs/[project]/etc/conda/deactivate.d/env_vars.sh

(The ‘-p’ flag in ‘mkdir’ also creates the required parent directories.)

In the ‘activate.d/env_vars.sh’ set the environment variables:

cd [project_path]
export HOST=""
export PORT="80"
export MY_VAR = 'customvalue'

In the ‘deactivate.d/env_vars.sh’ clear the environment variables:

unset MY_VAR

Now you should be able to run the app and have it hosted locally.

You can then test and fix the bug. Then add, commit and push the updates.

Then re-log into the remote server. Go to the project directory. Pull the updates from github. Restart the server.

cd [project]
git pull origin master
sudo service apache2 restart