Easy Audio/Video Capture with Python

At present it is difficult to obtain audio/video data in Python. For example, many deep learning methods assume you have easy access to your data in the form of a numpy array. Often you don’t. Based on the good efforts of those online, this post presents a number of Python classes to address this issue.

Just give me the code.

General Interface

Firstly, we can use threads to constantly update sensor data in the background. We can then read this data asynchronously.

Secondly, we can define a general interface for sensor data.

import threading

class SensorSource:
    """Abstract object for a sensory modality."""
    def __init__(self):
        """Initialise object."""
    def start(self):
        """Start capture source."""
        if self.started:
            print('[!] Asynchronous capturing has already been started.')
            return None
        self.started = True
        self.thread = threading.Thread(
        return self
    def update(self):
        """Update data."""
    def read(self):
        """Read data."""
    def stop(self):
        """Stop daemon."""
        self.started = False


For our video capture class, we can use OpenCV. You can install this in a conda environment using

conda install opencv
or via pip using
pip install opencv
. This allows access to the cv2 library.

Beware: you may need to do a bit of tweaking to get your video capture working – different cameras / system configurations need different tweaks.

# Video source

import cv2

class VideoSource(SensorSource):
    """Object for video using OpenCV."""
    def __init__(self, src=0):
        """Initialise video capture."""
        # width=640, height=480
        self.src = src
        self.cap = cv2.VideoCapture(self.src)
        #self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
        #self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
        self.grabbed, self.frame = self.cap.read()
        self.started = False
        self.read_lock = threading.Lock()
    def update(self):
        """Update based on new video data."""
        while self.started:
            grabbed, frame = self.cap.read()
            with self.read_lock:
                self.grabbed = grabbed
                self.frame = frame
    def read(self):
        """Read video."""
        with self.read_lock:
            frame = self.frame.copy()
            grabbed = self.grabbed
        return grabbed, frame

    def __exit__(self, exec_type, exc_value, traceback):

The initialisation sets up the camera and the threading lock. The update method is run as part of the thread to continuously update the self.frame data. The data may then be (asynchronously) accessed using the read() method on the object. The exit line means that the camera resource is released when the object is deleted or the Python kernel is stopped so you can then use the camera in other applications.

Beware: I had issues setting the width and height so I have commented out those lines. Also remember OpenCV provides the data in BGR format – so channels 0, 1, 2 correspond to Blue, Green and Red rather than RGB. You might also want to set to YUV mode by adding the following to the __init__ method:

self.cap.set(16, 0)


You’ll see many posts online that use pyaudio for audio capture. I couldn’t get this to work in a conda environment due to an issue with the underlying PortAudio library. I had more success with alsaaudio:

conda install alsaaudio
pip install alsaaudio

# Audio source
import struct
from collections import deque
import numpy as np
import logging
import alsaaudio

class AudioSource(SensorSource):
    """Object for audio using alsaaudio."""
    def __init__(self, sample_freq=44100, nb_samples=65536):
        """Initialise audio capture."""
        # Initialise audio
        self.inp = alsaaudio.PCM(
        # set attributes: Mono, frequency, 16 bit little endian samples
        self.read_lock = threading.Lock()
        # Create a FIFO structure for the data
        self._s_fifo = deque([0] * nb_samples, maxlen=nb_samples)
        self.l = 0
        self.started = False
        self.read_lock = threading.Lock()
    def update(self):
        """Update based on new audio data."""
        while self.started:
            self.l, data = self.inp.read()
            if self.l > 0:
                # extract and format sample 
                raw_smp_l = struct.unpack('h' * self.l, data)
                with self.read_lock:
                    f'Sampler error occur (l={self.l} and len data={len(data)})'
    def read(self):
        """Read audio."""
        with self.read_lock:
            return self.l, np.asarray(self._s_fifo, dtype=np.int16)

The approach for audio is similar to video. We set up an audio input source and a threading lock in the __init__ method. In the audio case, we are recording a (time) series of audio samples, so we do this in a buffer of length nb_samples. The deque object acts as a FIFO queue and provides this buffer. The update method is run continuously in the background within the thread and adds new samples to the queue over time, with old samples falling off the back of the queue. The struct library is used to decode the binary data from the alsaaudio object and convert it into integer values that we can add to the queue. When we read the data, we convert the queue to a 16-bit integer numpy array.

In both cases, the read() method returns a tuple: (data_check_value, data) where the data_check_value is a value returned from the underlying capture objects. It is often useful for debugging.

Combining and Simplifying

Now we have defined sensor data sources, we can combine them so that we only need to perform one read() call to obtain data from all sources. To do this, we create a wrapper object that allows us to iterate through each added sensor data source.

class CombinedSource:
    """Object to combine multiple modalities."""
    def __init__(self):
        self.sources = dict()
    def add_source(self, source, name=None):
        """Add a source object.
        source is a derived class from SensorSource
        name is an optional string name."""
        if not name:
            name = source.__class__.__name__
        self.sources[name] = source
    def start(self):
        """Start all sources."""
        for name, source in self.sources.items():
    def read(self):
        """Read from all sources.
        return as dict of tuples."""
        data = dict()
        for name, source in self.sources.items():
            data[name] = source.read()[1]
        return data
    def stop(self):
        """Stop all sources."""
        for name, source in self.sources.items():
    def __del__(self):
        for name, source in self.sources.items():
            if source.__class__.__name__ == "VideoSource":
    def __exit__(self, exec_type, exc_value, traceback):
        for name, source in self.sources.items():
            if source.__class__.__name__ == "VideoSource":

The delete and exit logic is added to clean up the camera object – without these the camera is kept open and locked, which can cause problems. Data is returned as a dictionary, indexed by a string name for the data source.

We can simplify things even further by creating a derived class that automatically adds an audio and video capture object.

class AVCapture(CombinedSource):
    """Auto populate with audio and video."""
    def __init__(self):
        self.sources = dict()
        a = AudioSource()
        self.add_source(a, "audio")
        v = VideoSource()
        self.add_source(v, "video")

This then allows us to access audio and video data in a couple of lines.

av = AVCapture()
data = av.read()

Here are some outputs from:

import matplotlib.pyplot as plt
Colours are crazy because imshow expects RGB not BGR!
BBC Radio 6 Music in Graphical Form

Finishing Off

You can find the code in a Gist here, together with some testing lines that you could easily convert into a library.

You can also expand the sensor classes to capture other data. I plan to create a class to capture CPU and memory use information.


Making Proper Histograms with Numpy and Matplotlib

Often I find myself needing to visualise an array, such as bunch of pixel or audio channel values. A nice way to do this is via a histogram.

When building histograms you have two options: numpy’s histogram or matplotlib’s hist. As you may expect numpy is faster when you just need the data rather than the visualisation. Matplotlib is easier to apply to get a nice bar chart.

So I remember, here is a quick post with an example.

# First import numpy and matplotlib
import numpy as np
import matplotlib.pyplot as plt

I started with a data volume of size 256 x 256 x 8 x 300, corresponding to 300 frames of video at a resolution of 256 by 256 with 8 different image processing operations. The data values were 8-bit, i.e. 0 to 255. I wanted to visualise the distribution of pixel values within this data volume.

Using numpy, you can easily pass in the whole data volume and it will flatten the arrays within the function. Hence, to get a set of histogram values and integer bins between 0 and 255 you can run:

values, bins = np.histogram(data_vol, np.arange(0, 255))

You can then use matplotlib’s bar chart to plot this:

plt.bar(bins[:-1], values, width = 1)

Using matplotlib’s hist function, we need to flatten the data first:

results = plt.hist(data_vol.ravel(), bins=np.arange(0, 255))

The result of both approaches is the same. If we are being more professional, we can also use more of matplotlib’s functionality:

fig, ax = plt.subplots()
results = ax.hist(data_vol.ravel(), bins=np.arange(0, 255))
ax.set_title('Pixel Value Histogram')
ax.set_xlabel('Pixel Value')

Things get a little more tricky when we start changing our bin sizes. A good run through is found here. In this case, the slower Matplotlib function becomes easier:

fig, ax = plt.subplots()
results = ax.hist(
    bins=np.linspace(0, 255, 16)
ax.set_title('Pixel Value Histogram (4-bit)')
ax.set_xlabel('Pixel Value')
Using 16-bins provides us with 4-bit quantisation. You can see here we could represent a large chunk of the data with just three values (if we subtract 128: <0 = -1, 0 = 0 and >0 = 1).

Bio-Inspired Robotics for Beginners

There are several well-known problems with modern “deep” learning approaches. These include the need for large quantities of training data and a lack of robustness. These are related. Neural network architectures are trained by computing an error between some “ground truth” data in the training data and the architecture’s prediction of that same data, given a set of associated input data. The error is directed based on differentials for each component of the architecture. This feels to me as doing everything in reverse.

Animal brains offer an alternative view of intelligence. Animals are not taught complex behaviours by providing millions of labelled training examples. Instead, they learn the natural (local) structure of their environment over time.

Photo by Pixabay on Pexels.com

What Do We Know that Might Help?

So, what do we know about animal brains that could help us build artificial intelligent systems?

Quite a lot it turns out.

Mammals have evolved a clever solution to the problem of predicting their environment. They have multiple sense organs that provide information in the form of electrical signals. These electrical signals are provided to neural structures. There are roughly three tiers: the brainstem, the midbrain and the cortex. These represent different levels of processing. They also show us the pathway of evolution: newer structures have been built on top of older structures, and older structures have been co-opted to provide important support functions for the newer structures.

In the mammalian brain there are some commonalities as to how information is received and processed. The cortex appears to be the main computational device. Different sensory modalities appear to be processed using similar cortical processes. Pre-processing and feedback is controlled by the mid-brain structures such as the thalamus and basal ganglia.  

In terms of knowledge of sensory processing, we know most about the visual system. We then know similar amounts about auditory and motor systems, including perception of the muscles and skin (somatosensory). We know the least about smell and interoception, our sensing of internal signals relating to organs and keeping our body in balance. All the sensory systems use nerve fibres to communicate information in the form of electrical signals.

Cortex I/O

In mammalian brains, the patterns of sensory input to the brain and motor output are fairly well conserved across species. All input and output (apart from smell) is routed via the thalamus. The basal ganglia appears to be a support structure for at least motor control. The brain is split into two halves, and each half receives input from one side of the body. Most mammals have the following:

  • V1 – the primary visual cortex – an area of the cortex that receives an input from the eye (the retina) via the thalamus;
  • A1 – the primary audio cortex – an area of the cortex that receives an input from the ear via the thalamus;
  • S1 – the primary somatosensory cortex – an area of the cortex that receives an input from touch, position, pain and temperature sensors that are positioned over the skin and within muscle structures (again via the thalamus);
  • Insula – an area of the cortex that receives an input from the body, indicating its internal state, from the thalamus; and
  • M1 – the primary motor cortex – an area of the cortex that provides an output to the muscles (it receives feedback from the thalamus and basal ganglia).

For a robotic device, we normally have access to the following:

  • Image data – for video, frames (two-dimensional matrices) at a certain resolution (e.g. 640 by 480) with a certain frame rate (e.g. 30-60 frames per second) and a certain number of channels (e.g. 3 for RGB). The cortex receives image information via the lateral geniculate nucleus (LGN) of the thalamus. The image information is split down the centre, so each hemisphere receives one half of the visual field.  The LGN has been shown to perform a Difference Of Gaussians (DOG) computation to provide something similar to an edge image. The image that is formed in M1 of the cortex is also mapped to a polar representation, with axes representing an angle of rotation and a radius (or visual degree from the horizontal).
  • Audio – in one form, a one-dimensional array of intensities or amplitudes (typically 44100 samples per second) with two channels (e.g. left and right). The cochlea of the ear actually outputs frequency information as opposed to raw sound (e.g. pressure) information. This can be approximated in a robotic device by taking the Fast Fourier Transform to get a one-dimensional array of frequencies and amplitudes. 
  • Touch sensors / motor positions / capacitive or resistive touch – this is more ropey and has a variety of formats. We can normally pre-process to a position (in 2 or 3 dimensions) and an intensity. This could be multiple channel, e.g. at each position we could have a temperature reading and a pressure reading. On a LEGO Mindstorms ev3 robot, we have a Motor class that provides information such as the current position of a motor in pulses of the rotary encoder, the current motor speed, whether the motor is running and certain motor parameters. 
  • Computing device information – this is again more of a jump. An equivalent of interoception could be seen as information on running processes and system utilization. If the robotic device has a battery this could also include battery voltages and currents. In Python, we can use something like psutil to get some of this information. On a LEGO Mindstorms ev3 robot, we have PowerSupply classes that provide information on the battery power.
  • Motor commands – robotic devices may have one or more linear or rotary motors. Typically, these are controlled by sending motor commands. This may be to move to a particular relative position, to move at a given speed for a given time and/or to rotate by a given number of rotations.

We will leave smell for now (although it’s a large part of many mammals sensory repertoire). It probably slots in best as part of interoception. One day sensors may provide an equivalent data stream. 

We can summarise this rough mapping as follows:

Eyes / Retina > V1Video Camera > Split L/R > Polar Mapping
Ears / Cochlea > A1Microphone > Channel Split L/R > FFT
Outer body + muscle > S1Multisensor > Position + value
Interoception > InsulaDevice measurements > numeric array
M1 > musclesNumeric commands > motors

One advantage of the “deep” learning movement is that it is now conventional to represent all information as multidimensional arrays, such as Numpy arrays within Python. Indeed, we can consider “intelligence” as a series of transformations between different multidimensional arrays.

Cortex Properties 

The mammalian cortex is a two-dimensional sheet. This provides a strong constraint on the computational architecture. Human brains appear wrinkly because they are trapped within our skulls and need to maximise their surface area. Mouse brains are quite smooth.

The cortex is a layered structure, providing its thickness. Each layer is a few mm in height. There are between 4 and 6 layers, depending on the area of the cortex. The layers contain a large number of implementing neurons. These neurons provide a combination of excitatory and inhibitory connections. Different layers receive different inputs and provide different outputs. Feedback appears to be supplied over a large area via layer 1, input is received from the thalamus at layer 4, input is received from other parts of the cortex at layers 3 and 4, layer 2 provides feed back to a neighbouring cortical area, layer 3 provides a feed forward output to other cortical areas (wider range) and layers 5 and 6 provide feedback to the thalamus, layer 5 also provides a feed forward output to a neighbouring cortical area. Computation occurs vertically in the layers and information is passed within the plane of the cortical sheet.

The two-dimensional cortical sheet of many mammals appears to have a common general topology. The input and output areas appear reasonably fixed, and are likely genetically determined.  The visual, somatosensory and motor areas appear aligned. This may be what creates “embodied” intelligence; we think conceptually using a common co-ordinate system. For example, the half an image from the eyes is aligned bottom-to-top within the visual processing areas, which is aligned with the feet-to-head axis of the body as felt and the feet-to-head axis of the body as acted-upon (i.e. V1, S1 and M1 are aligned). This makes sense – it is more efficient to arrange our maps in this way.

From Finlay et al

The cortex also appears to have neuronal groupings that have certain functional roles. This is best understood in the visual processing areas, where different cortical columns are found to relate to different portions of the visual field; each column has a receptive field equivalent to a small group of pixels (say somewhere around 1000). Outside of the visual field evidence is more shaky. 

The cortex of higher mammals also appears to have a uniform volume but a differing neuronal density. This neuronal density appears similar to a diffusion gradient. Within the visual areas towards the back of the brain there are a large number of neurons per square millimetre; towards the front of the brain there are fewer neurons per square millimetre. The baboon has a 4:1 ratio. However, because the volume of the cortical sheet is reasonably constant, the neurons towards the front of the brain are more densely connected (as there is room). A simple gradient in neuron number may provide an information bottleneck that forces a compression of neural representations, leading to greater abstraction.

Considering a two-dimensional computing sheet gives us an insight into two cortical pathways that have been described for vision. A first processing pathway (the dorsal stream) is the line drawn from V1 to S1 and M1. This pathway is swayed towards motion, i.e. information that is useful for muscle control. A second pathway (the ventral stream) is the line drawn from V1 to A1. This pathway is swayed towards object recognition, which makes sense because we can correlate audio and visual representations of objects to identify them. What is also interesting is that the lower visual field abuts the first processing pathway and the upper visual field abuts the second processing pathway – this may reflect the fact that the body is orientated below the horizontal line of vision and so it makes sense to map the lower visual field to the body in a more one-to-one manner.

The cortical sheet also gives us an insight into implementing efficient motor control. You will see that S1, the primary somatosensory area, is adjacent to M1, the primary motor cortex. This means that activation of the motor cortex, e.g. to move muscles, will activate the somatosensory representations regardless of any somatosensory input from the thalamus. We thus have two ways of activating the body representations, one being generated from the motor commands and one being generated by the input from the body. This gives us the possibility to compare and integrate these signals in time to provide control. For example, if the signal received from the body did not match the proposed representation from the motor commands then either the motor commands may be modified or our sensing of our body is modified (or both).

So to recap:

  • the main computing machinery of the brain is a two-dimensional sheet;
  • it has an embodied organisation – the relative placement of processing areas has functional relevance;
  • it has a density gradient that is aligned with the embodied organisation; and
  • it appears to use common, repeated computing units.

Cortex to Robot

This suggests a plan for organising computation for a robotic device.
Computers typically work according to a one-dimensional representation: data in memory. If we are creating a bio-inspired robotic device, we need to extend this to two dimensions, where the two dimensions indicate an ordering of processing units. It might be easier to image a football field of interconnected electronic calculators.

We can then align our input arrays with the processing units. At input and output areas of any computing architecture there could be a one-to-one organisation of array values to processing units. The number of processing units may decrease along an axis of the two-dimensional representation, while their interconnections increase. In fact, a general pattern is that connectivity is local at points of input or output but becomes global in between these points. These spaces in between store more abstract representations.

This architecture of the cortex reminds us of another architecture that has developed with “deep” learning: the autoencoder.

Standard Autoencoder
Standard Autoencoder

Indeed, this is not by accident – the structure of the visual processing areas of the brain has been the inspiration for these kind of structures. Taking a two-dimensional view of computation also allows us to visualise connectivity using a two-dimensional graph structure.


So we can start our bio-inspired robot by constructing a setup where different sensory modalities are converted into Numpy arrays and then processed so that we can compare them in a two-dimensional structure. The picture becomes murkier once we look at how associations are formed and how the thalamus comes into the equation. However, by looking at possible forms of our signals in a common processing environment we can begin to join the dots.


Capturing Live Audio and Video in Python

In my robotics projects I want to capture live audio and video data and convert it into Numpy multi-dimensional arrays for further processing. To save you several days, this blog post explains how I go about doing this.

Audio / Video Not Audio + Video

A first realisation is that you need to capture audio and video independently. You can record movie files with audio, but as far as I could find there is no simple way to live capture both audio and video data.


For video processing, I found there were two different approaches that could be used to process video data:

  • Open CV in Python; and
  • Wrapping FFMPEG using SubProcess.

Open CV

The default library for video processing in Python is OpenCV. Things have come a long way since my early experiences with OpenCV in C++ over a decade ago. Now there is a nice Python wrapper and you don’t need to touch any low-level code. The tutorials here are a good place to start.

I generally use Conda/Anaconda these days to manage my Python environments (the alternative being old skool virtual environments). Setting up a new environment with Jupyter Notebook and Open CV is now straightforward:

conda install opencv jupyter

As a note – installing OpenCV in Conda seems to have been a pain up to a few years ago. There are thus several out of date Stack Overflow answers that come up in the searches, that refer to installing from specific sources (e.g. from menpo). This appears not to be needed now.

One problem I had in Linux (Ubuntu 18.04) is that the GTK libraries didn’t play nicely in the Conda environment. I could capture images from the webcam but not display them in a window. This lead me to look for alternative visualisation strategies that I describe below.

A good place to start with OpenCV is this video tutorial. As drawing windows led to errors I designed a workaround where I used PIL (Python Image Library) and IPython to generate an image from the Numpy array and then show it at about 30 fps. The code separates out each of the YUV components and displays them next to each other. This is useful for bio-inspired processing.

# Imports
import PIL
import io
import cv2
import matplotlib.pyplot as plt
from IPython import display
import time
import numpy as np

# Function to convert array to JPEG for display as video frame
def showarray(a, fmt='jpeg'):
    f = io.BytesIO()
    PIL.Image.fromarray(a).save(f, fmt)

# Initialise camera
cam = cv2.VideoCapture(0)
# Optional - set to YUV mode (remove for BGR)
cam.set(16, 0)
# These allow for a frame rate to be printed
t1 = time.time()

# Loops until an interrupt
        t2 = time.time()
        # Capture frame-by-frame
        ret, frame = cam.read()
        # Join components horizontally
        joined_array = np.concatenate(
            frame[:, 1::2, 1], 
            frame[:, 0::2, 1]
        ), axis=1)
        # Use above function to show array
        # Print frame rate
        print(f"{int(1/(t2-t1))} FPS")
        # Display the frame until new frame is available
        t1 = t2
except KeyboardInterrupt:
    # Release the camera when interrupted
    print("Stream stopped")</code></pre>

In the above code, “frame” is a three-dimensional tensor or array where the first dimension relates to rows of the image (e.g. the y-direction of the image), the second dimension relates to columns of the image (e.g. the x-direction of the image) and the third dimension relates to the three colour channels. Often for image processing it is useful to separate out the channels and just work on a single channel at a time (e.g. equivalent to a 2D matrix or grayscale image).


An alternative to using OpenCV is to use subprocess to wrap the FFMPEG, a command line video and audio processing utility.

This is a little trickier as it involves accessing the video buffers. I have based on solution on this guide by Zulko here.

import subprocess as sp
import numpy as np
import matplotlib.pyplot as plt

FFMPEG_BIN = "ffmpeg"
# Define command line command
command = [ FFMPEG_BIN,
            '-i', '/dev/video0',
            '-f', 'image2pipe',
            '-pix_fmt', 'rgb24',
            '-an','-sn', #-an, -sn disables audio and sub-title processing respectively
            '-vcodec', 'rawvideo', '-']
# Open pipe
pipe = sp.Popen(command, stdout = sp.PIPE, bufsize=(640*480*3))

# Display a few frames
no_of_frames = 5
fig, axes = plt.subplots(no_of_frames, 1)

for i in range(0, no_of_frames):
    # Get the raw byte values from the buffer
    raw_image = pipe.stdout.read(640*480*3)
    # transform the byte read into a numpy array
    image = np.frombuffer(raw_image, dtype='uint8')
    image = image.reshape((480,640, 3))
    # Flush the pipe

Now I had issues flushing the pipe in a Jupyter notebook, so I ended up using the OpenCV method in the end. Also it is trickier working out the byte structure for YUV data.


My middle daughter generates a lot of noise.

For audio, there are also a number of options. I have tried:

Now PyAudio appears to be preferred. However, I am quickly learning that audio / video processing in Python is not yet as polished as pure image processing or building a neural network.

PyAudio provides a series of wrappers around the PortAudio libraries. However, I had issues getting this to work in an Conda environment. Initially, no audio devices showed up. After a long time working through Stack Overflow, I found that installing from the Conda-Forge source did allow me to find audio devices (see here). But even though I could see the audio devices I then had errors opening an audio stream. (One tip for both audio and video is to look at your terminal output when capturing audio and video – the low level errors will be displayed here rather than in a Jupyter notebook.)


Given my difficulties with PyAudio I then tried AlsaAudio. I had more success with this.

My starting point was the code for recording audio that is provided in the AlsaAudio Github repository. The code below records a snippet of audio then loads it from the file into a Numpy array. It became the starting point for a streaming solution.

# Imports
import alsaaudio
import time
import numpy as np

# Setup Audio for Capture
inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NONBLOCK, device="default")

# Record a short snippet
with open("test.wav", 'wb') as f:
    loops = 1000000
    while loops > 0:
        loops -= 1
        # Read data from device
        l, data = inp.read()
        if l:

f = open("test.wav", 'rb')

# Open the device in playback mode. 
out = alsaaudio.PCM(alsaaudio.PCM_PLAYBACK, device="default")

# Set attributes: Mono, 44100 Hz, 16 bit little endian frames

# The period size controls the internal number of frames per period.
# The significance of this parameter is documented in the ALSA api.
# We also have 2 bytes per sample so 160*2 = 320 = number of bytes read from buffer

# Read data from stdin
data = f.read(320)
numpy_array = np.frombuffer(data, dtype='<i2')
while data:
    data = f.read(320)
    decoded_block = np.frombuffer(data, dtype='<i2')
    numpy_array = np.concatenate((numpy_array, decoded_block))

The numpy_array is then a long array of sound amplitudes.

Sampler Object

I found a nice little Gist for computing the FFT here. This uses a Sampler object to wrap the AlsaAudio object.

from collections import deque
import struct
import sys
import threading
import alsaaudio
import numpy as np

# some const
# 44100 Hz sampling rate (for 0-22050 Hz view, 0.0227ms/sample)
# 66000 samples buffer size (near 1.5 second)
NB_SAMPLE = 66000

class Sampler(threading.Thread):
    def __init__(self):
        # init thread
        self.daemon = True
        # init ALSA audio
        self.inp = alsaaudio.PCM(alsaaudio.PCM_CAPTURE, alsaaudio.PCM_NORMAL, device="default")
        # set attributes: Mono, frequency, 16 bit little endian samples
        # sample FIFO
        self._s_lock = threading.Lock()
        self._s_fifo = deque([0] * NB_SAMPLE, maxlen=NB_SAMPLE)

    def get_sample(self):
        with self._s_lock:
            return list(self._s_fifo)

    def run(self):
        while True:
            # read data from device
            l, data = self.inp.read()
            if l > 0:
                # extract and format sample (normalize sample to 1.0/-1.0 float)
                raw_smp_l = struct.unpack('h' * l, data)
                smp_l = (float(raw_smp / 32767) for raw_smp in raw_smp_l)
                with self._s_lock:
                print('sampler error occur (l=%s and len data=%s)' % (l, len(data)), file=sys.stderr)

Next Steps

This is where I am so far.

The next steps are:

  • look into threading and multiprocessing so that we can run parallel audio and video sampling routines;
  • extend the audio (and video?) processing to obtain the FFT; and
  • optimise for speed of capture.

Predicting the Future

The rise of machine learning and developments in neuroscience hint that prediction is key to how brains navigate the world. But how could this work in practice?

Let’s ignore neuroscience a minute and just think how we would manually predict the future. Off the top of my head there appear to be three approaches, which we will look at below.

Instantaneous Prediction Using Rates of Change

Photo by Pixabay on Pexels.com

In the physical world, one way to predict the future is to remember our secondary school physics (or university kinematics). For example, if we wanted to predict a position of an object moving along one dimension, we would attempt to find out its position, speed and acceleration, and use equations of motion. In one-dimension, speed is just a measure of a change of position over time (a first-order differential with respect to time). Acceleration is a measure of change of speed over time (a second order differential with respect to time, or a change in a change).

In fact, it turns out the familiar equations of motion are actually just a specific instance of a more general mathematical pattern. James Gregory, Brook Taylor and Colin Maclaurin formalised this approach in the 17th and 18th centuries, but thinking on this issue goes back at least to Zeno, Democritus and Aristotle in ancient Greece, as well as Chinese, Indian and Middle Eastern mathematicians. In modern times, we generally refer to the pattern as a Taylor Series: a function may be represented as an infinite sum of differentials about a point.

For example, our normal equation for distance travelled in one dimension is d(t) = d_0 + v*t + \frac{1}{2}*a*t^2, where d is distance, v is velocity and a is acceleration. In this case, velocity is our first differential – a first-order rate of change of distance with time – and acceleration is our second differential – a second-order rate of change of distance with time (or a rate of change of velocity). We know from school that if we have constant velocity, a is zero and we just have d(t) = d_0 + v*t. What isn’t stressed as much is that the equations learnt by every school kid are for a single dimension with at worst constant acceleration. It is not until university that the veil begins to be removed and we realise that if we have changing acceleration (third-order changes), we need to add another term, and that we can continue this ad-nausem.

As you move into more advanced engineering and physics classes, you are also taught how to extend the equations of motion into two or three dimensions. When we have movement in three-dimensional space, we can model points in spacetime using four-dimensional vectors ([x, y, z, t]). As we move into the multidimensional case, we can look at how each dimension changes with respect to each other dimension for the point. For example, changes in an x-dimension with time (t) may resemble our one-dimensional speed. However, in the multivariate case we now can determine changes in the y and z direction with time. Hence, our velocity becomes a vector indicating how the x, y and z dimensions of the point change over time. Because the directions of modelled Cartesian space are orthogonal, we can analyse them separately.

We will stop here for now but we can also go further. If we can guess at the mass of an object, we can predict an acceleration using F=ma. Hence, our brains can begin to build a suite of approximated functions that predict the rates of change from additional data or latent variables discerned from the data.

Neural Units

So let’s return to thinking about brains. Populations of neurons in our brains do not have god-like powers to view an objective reference frame that depicts spacetime. Instead, they are fixed in space, and travel through time. What they can be is connected. But even this must be limited by practical reality, such as space and energy.

In fact, we can think about “neural units”, which may be a single neuron or a population of neurons. What makes a neural unit “a unit” is that it operates on a discrete set of information. In an image processing case this may be a pixel, in a audio processing example this may be a sample in time, or a frequency measurement.

Now the operation of a neural unit begins to resemble the assumptions for the Taylor Series: it is the point around which our function is evaluated, and all we need is local information relating to our derivatives. We’ll ignore for now the fact that our functions may not be infinitely differentiable about our unit, as it turns out approximations often seem to work fine.

So bringing this all together, we see that a neural unit may be able to (approximately) predict future activity, either of itself in time, or its neighbours in space, by determining local rates of change of different orders.

For example, if we consider the image intensity of a single pixel, we can see that a neural unit may be able to predict the intensity of that pixel at a future time, if there are patterns in the rates of change. For example, if the pixel intensity is increasing at a constant rate, r, then the intensity at a future time, t, may be determined like our velocity above: I(t) = I_0 + r*t.

Linear Approximations

Another way of viewing the same thing is to think about linear approximations.

What are linear approximations? They are just functions where the terms are linear, i.e. are not a power. In a Taylor Series this means chopping off everything past a first order differential. Now, if a car is accelerating towards you, assuming a constant velocity is going to be a very costly mistake. But what is surprising is that a fair bit of engineering is built upon linear approximations. In fact, even now some engineers pull a funny face and start sweating when you move away from linear models. It turns out that a significant portion of the world we live in has first order patterns.

If you go up a little way to include second order patterns, you find that another large chunk of the world can be approximated there. This is most visible in the equations of motion. Why can we stop with second order acceleration terms? Because gravity is constant at 9.8m/s. Until recently, the main things that displayed changing accelerations were animals.

So can we just throw away higher order terms? Not exactly. One issue with a power series is that as we try to predict further from our point or neural unit, higher order differentials become more important. For example, looking forward beyond the first few immediate units and the higher terms will dominate the prediction. Often this is compounded by sensory inaccuracy, the rates of change will never be exact, and errors in measurement are multiplied by the large high-order terms.

So what have we learnt? If we are predicting locally, either in space or in time, in a world with patterns in space-time, we can make good approximations using a Taylor series. However, these predictions become less useful in a rapidly changing world where we need to predict over longer distances in space and time.

Prediction Using Cycles

Image by Dirk Rabe from Pixabay 

Many of the patterns in the natural world are cyclical. These include the patterns of day and night caused by the rotation of the Earth upon its (roughly) north-south axis, the lunar months and tides caused by the rotation of the moon around the Earth, and the seasons caused by the rotation of the Earth around the sun. These are “deep” patterns – they have existed for the whole of our evolutionary history, and so our modelled at least chemically at a low-level in our DNA.

There are then patterns that our based on these patterns. Patterns of sleep and rest, of meal times, of food harvest, of migration, or our need for shelter. Interestingly many of these patterns are interoceptive patterns, e.g. relating to an inner state of our bodies representing how we feel.

The physical world also has patterns. Oscillations in time generate sound. Biological repetition and feedback cycles generates cyclical patterns, such as the vertical lines of light-and-dark observed looking into a forest or the stripes on a zebra.

How do we make predictions using cycles?

Often we have reference patterns that we apply at different rates. In engineering, this forms the basis of Fourier analysis. The rate of repetition we refer to as “frequency”. We can then build up complex functions and signals by the addition of simple periodic or repeating functions with different magnitudes and phases. In mathematics the simple periodic functions are typically the sine or cosine functions.

Image by David Zydd from Pixabay

So if we can use a base set of reference patterns, how does working in the frequency domain help prediction?

In one dimension the answer is that we can make predictions of values before or after a particular point based on our knowledge of the reference functions, and estimates for magnitudes and phases. For a signal that extends in space or time, we only need to know a general reference pattern for a short patch of space or time, stretch or shift it and repeat it, rather than trying to predict each point separately and individually.

When our brain attempts to predict sounds, it can thus attempt to predict frequencies and phases as opposed to complex sound waveforms. In space, things are less intuitive but apply similarly. For example, repeating patterns of intensity in space, such as stretches of light and dark lines (the stripes on a zebra) may be approximated using a reference pattern of one light and one dark line, and then repeating the pattern at an estimate scale, strength and phase. Many textures can be efficiently represented in this way (think of the patterns on plants and animals).

Thinking about neural units, we can see how hierarchies of units may be useful to implement predictions of periodic sequences. We need a unit or population of units to replicate the reference pattern, and to somehow represent an amplitude and phase.

Statistical Prediction

A third way to make predictions is using statistics and probability.

Photo by Balázs Utasi on Pexels.com

Statistics is all about large numbers of measurements (“big data”, when that was trendy). If we have large numbers of measurement we can look for patterns in those numbers.

Roll a six-sided dice a few times and you will record what look like random outcomes. We might have three “4s”, and two “1s”. Roll a dice a few million times and you will see that each of the six numbers occur in more-or-less similar proportions: each number occurs 1/6th of the time. The probability of rolling each number may then be represented as “1/6”.

Rates of change are fairly useless here. This is because we are dealing with discrete outcomes that are often independent. These “discrete outcomes” are also typically complex high-order events (try explaining “roll a dice” to an alien). If you were to measure the change in rolled number (e.g. “4” on roll 1 minus “1” on roll 0 = 3), this wouldn’t be very useful. Similarly, there are no repeating patterns in time or space that make Fourier analysis immediately useful for prediction.

Thinking about a neural unit, we can see that probability may be another way to predict the future. If a neural unit received an intensity for a pixel associated with the centre of a dice, it could learn that the intensity could be 0 or 1 with a roughly 50% likelihood (e.g. numbers 1, 3 and 5 having a central dot, which is absent from numbers 2, 4 and 6). If it got an intensity of 0.5, something strange has happened.

Probability, at its heart, is simply a normalised weight for a likelihood of an outcome. We use a value between 0 and 1 (or 0 and 100%) so that we can compare different events, such as rolling a dice or determining if a cow is going to charge us. In a discrete case, we have a set of defined outcomes. In a continuous case, we have a defined range of outcome values.

How Do Rates of Change and Probabilities Fit Together?

Imagine a set of neural units relate to a pixel in an image. For example, we might look at a nearest pixel to a centre of a webcam image.

In this case, each neural unit may have one associated variable: an intensity or amplitude. Say we have an 8-bit image processing system, so the neural unit can receive a value between 0 and 255 representing a measured image characteristic. This could be a channel measurement, e.g. an intensity for lightness (say 0 is black and 255 is white) or for “Red” (say 0 is not red and 255 is the most red) or an opponent colour space (say 0 is green and 255 is red).

Now nature is lazy. And thinking is hard work. Our neural units want to minimise any effort or activity.

One way to minimise effort is to make local predictions of sensory inputs, and to only pass on a signal when those predictions fail, i.e. to output a prediction error.

A neural unit could predict its own intensity at a future time I(t_0 + t_{interval}) or the intensity of one of its neighbours, e.g. I(x_i + x_{i+1}, y_j + y_{j+1}) in space. If a neural unit receives an intensity in I_{sensory}, it can compute an overall intensity prediction based on time and space prediction I_{prediction} and then determine an error between them e = I_{prediction} - I_{sensory}.

One way to approximate a rate of change is to simply compare neighbouring units in space, or current and past values in time. To compute higher orders, we just repeat this comparison on previously computed rates of change.

If they are arranged in multiple layers, our neural units could begin to predict cyclical patterns. Over time repeated patterns of activity could be represented by the activity of a single set of neural units and a reference to the underlying units that show this activity, e.g. as scaled or shifted. This would be lazier – we could just copy or communicate the activity of the single unit to the lower neural units.

Probability may come into play when looking at a default level of activity for a given context. For example, consider an “at rest” case. In many animals the top of the visual field is generally lighter than the bottom of the visual field. Why is this? Because the sky is above and the ground is below. Of course, this won’t always be the case, but it will be a general average over time. Hence, if you have no other information, a neural unit in the upper visual field would do wise to err on a base level of intensity that is higher (e.g. lighter) that a neural unit in the lower visual field. This also allows laziness in the brain. A non-light intensity signal received by the neural unit in the upper visual field is more informative than a light intensity signal as it is more unlikely. Hence, if there is a finite amount of energy, the neural unit in the upper visual field wants to use more energy to provide a signal in the case of a received non-light intensity signal than in the case of a received light intensity signal. Some of you would spot that we are now moving into the realms of (Shannon) entropy.

In the brain then, it is likely that all these approaches for prediction are applied simultaneously. Indeed, it is probable that the separate functions are condensed into common non-linear predictive functions. It is also likely that modern multi-layer neural networks are able to learn these functions from available data (or at least rough approximations based on the nature of the training data and the high-level error representation).

Playing Around with Retinal-Cortex Mappings

Here is a little notebook where I play around with converting images from a polar representation to a Cartesian representation. This is similar to the way our bodies map information from the retina onto the early visual areas.

Mapping from the visual field (A) to the thalamus (B) to the cortex (C)

These ideas are based on information we have about how the visual field is mapped to the cortex. As can be seen in the above figures, we view the world in a polar sense and this is mapped to a two-dimensional grid of values in the lower cortex.

You can play around with mappings between polar and Cartesian space at this website.

To develop some methods in Python I’ve leaned heavily on this great blogpost by Amnon Owed. This gives us some methods in Processing I have adapted for my purposes.

Amnon suggests using a look-up table to speed up the mapping. In this way we build a look-up table that maps co-ordinates in polar space to an equivalent co-ordinate in Cartesian space. We then use this look-up table to look-up the mapping and use the mapping to transform the image data.

import math
import numpy as np
import matplotlib.pyplot as plt

def calculateLUT(radius):
    """Precalculate a lookup table with the image maths."""
    LUT = np.zeros((radius, 360, 2), dtype=np.int16)
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over diameter
        for r in range(0, radius):
            theta = math.radians(angle)
            # Take angles from the vertical
            col = math.floor(r*math.sin(theta))
            row = math.floor(r*math.cos(theta))
            # rows and cols will be +ve and -ve representing
            # at offset from an origin
            LUT[r, angle] = [row, col]
    return LUT

def convert_image(img, LUT):
    Convert image from cartesian to polar co-ordinates.

    img is a numpy 2D array having shape (height, width)
    LUT is a numpy array having shape (diameter, 180, 2)
    storing [x, y] co-ords corresponding to [r, angle]
    # Use centre of image as origin
    centre_row = img.shape[0] // 2
    centre_col = img.shape[1] // 2
    # Determine the largest radius
    if centre_row > centre_col:
        radius = centre_col
        radius = centre_row
    output_image = np.zeros(shape=(radius, 360))
    # Iterate around angles of field of view
    for angle in range(0, 360):
        # Iterate over radius
        for r in range(0, radius):
            # Get mapped x, y
            (row, col) = tuple(LUT[r, angle])
            # Translate origin to centre
            m_row = centre_row - row
            m_col = col+centre_col
            output_image[r, angle] = img[m_row, m_col]
    return output_image

def calculatebackLUT(max_radius):
    """Precalculate a lookup table for mapping from x,y to polar."""
    LUT = np.zeros((max_radius*2, max_radius*2, 2), dtype=np.int16)
    # Iterate around x and y
    for row in range(0, max_radius*2):
        for col in range(0, max_radius*2):
            # Translate to centre
            m_row = max_radius - row
            m_col = col - max_radius
            # Calculate angle w.r.t. y axis
            angle = math.atan2(m_col, m_row)
            # Convert to degrees
            degrees = math.degrees(angle)
            # Calculate radius
            radius = math.sqrt(m_row*m_row+m_col*m_col)
            # print(angle, radius)
            LUT[row, col] = [int(radius), int(degrees)]
    return LUT

def build_mask(img, backLUT, ticks=20):
    """Build a mask showing polar co-ord system."""
    overlay = np.zeros(shape=img.shape, dtype=np.bool)
    # We need to set origin backLUT has origin at radius, radius
    row_adjust = backLUT.shape[0]//2 - img.shape[0]//2
    col_adjust = backLUT.shape[1]//2 - img.shape[1]//2
    for row in range(0, img.shape[0]):
        for col in range(0, img.shape[1]):
            m_row = row + row_adjust
            m_col = col + col_adjust
            (r, theta) = backLUT[m_row, m_col]
            if (r % ticks) == 0 or (theta % ticks) == 0:
                overlay[row, col] = 1
    masked = np.ma.masked_where(overlay == 0, overlay)
    return masked

First build the backwards and forwards look-up tables. We’ll set a max radius of 300 pixels, allowing us to map images of 600 by 600.

backLUT = calculatebackLUT(300)
forwardLUT = calculateLUT(300)

Now we’ll try this out with some test images from skimage. We’ll normalise these to a range of 0 to 255.

from skimage.data import chelsea, astronaut, coffee

img = chelsea()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

img = astronaut()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

img = coffee()[...,0] / 255.

masked = build_mask(img, backLUT, ticks=50)
out_image = convert_image(img, forwardLUT)
fig, ax = plt.subplots(2, 1, figsize=(6,8))
ax[0].imshow(img, cmap=plt.cm.gray, interpolation='bicubic')

ax[0].imshow(masked, cmap=plt.cm.hsv, alpha=0.5)

ax[1].imshow(out_image, cmap=plt.cm.gray, interpolation='bicubic')

In the methods, the positive y axis is the reference for the angle, which is extends clockwise.

Now, within the brain the visual field is actually divided in two. As such, each hemisphere gets half of the bottom image (0-180 to the right hemisphere and 180-360 to the left hemisphere).

Also within the brain, the map on the cortex is rotated clockwise by 90 degrees, such that angle from the horizontal eye line is on the x-axis. The brain receives information from the fovea at a high resolution and information from the periphery at a lower resolution.

The short Jupyter Notebook can be found here.

Extra: proof this occurs in the human brain!

Folding Autoencoders

It’s nice when different ways of seeing things come together. This often occurs when comparing models of computation in the brain and machine learning architectures.

Many of you will be familiar with the standard autoencoder architecture. This takes an input \mathbf{X}, which may be an image (in a 2D or flattened 1D form) or another input array. It applies one or more neural network layers that progressively reduce the dimensionality of the output. These one or more neural network layers are sometimes called an “encoder”. This forms a “bottleneck” to “compress” the input (in a lossy manner) to form a latent representation. This latent representation is sometimes referred to as a hidden layer. It may be seen as an encoded or compressed form of the input.

A standard autoencoder may have a 1D latent representation with a dimensionality that is much less than the input dimensionality. A variational autoencoder may seek to learn values for a set of parameters that represent a latent distribution, such as a probability distribution for the input.

The autoencoder also has another set of one or more neural network layers that receive the latent representation as an input. These one or more neural network layers are sometimes called a “decoder”. The layers generate an output \mathbf{X'}, which is a reconstruction of the input \mathbf{X}.

The whole shebang is illustrated in the figure below.

Standard Autoencoder
Standard Autoencoder

The parameters of the neural network layers, the “encoder” and the “decoder”, are learnt during training. For a set of inputs, the output of the autoencoder \mathbf{X'} is compared with the input \mathbf{X}, and an error is back-propagated through the neural network layers. Using gradient descent to minimise the errors, an optimal set of parameters may be learnt.

As autoencoders do not require a labelled “ground-truth” output to compare with the autoencoder output during training, they provide a form of unsupervised learning. Most are static, i.e. they operate on a fixed instance of the input to generate a fixed instance of the output, and there are no temporal dynamics.

Now, I’m quite partial to unsupervised learning methods. They are hard, but they also are much better at reflecting “natural” intelligence; most animals (outside of school) do not learn based on pairs of inputs and desired outputs. Models of the visual processing pathway in primates, which have been developed over the last 50 years or so, all indicate that some form of unsupervised learning is used.

In the brain, there are several pathways through the cortex that appear similar to the “encoder” neural network layers. With a liberal interpretation, we can see the latent representation of the autoencoder as an association representation formed in the “higher” levels of the cortex. (In reality, latent representations in the brain are not found at a “single” flat level but are distributed over different levels of abstraction.)

If the brain implements some form of unsupervised learning, and seems to “encode” incoming sensory information, this leads to the question: where is the “decoder”?

This is where predictive coding models can help. In predictive coding models, predictions are fed back through the cortex to attempt to reconstruct input sensory information. This seems similar to the “decoder” layers of the autoencoder. However, in this case, the “encoder” and the “decoder” appear related. In fact, one way we can model this is to see the “encoder” and the “decoder” as parts of a common reversible architecture. This looks at things in a similar way to recent work on reversible neural networks, such as the “Glow” model proposed by OpenAI. The “encoder” represents a forward pass through a set of neural network layers, while the “decoder” represents a backward pass through the same set of layers. The “decoder” function thus represents the inverse or “reverse” of the “encoder” function.

This can be illustrated as follows:

Folding the Autoencoder
Folding the Autoencoder

In this variation of the autoencoder, we effectively fold the model in half, and stick together the “encoder” and “decoder” neural network layers to form a single “reversible neural network”.

In fact, the brain is even cooler than this. If we extend across modalities to consider sensory input and motor output, the brain appears to replicate the folded autoencoder shown above, resulting in something that resembles again our original standard autoencoder:

The brain as a form of autoencoder.
The brain as a form of autoencoder.

Here, we have a lower input “encoder” that represents the encoding of sensory input \mathbf{X} into a latent “associative” representation. A forward pass provides the encoding, while a backward pass seeks to predict the sensory input \mathbf{X} by generating a reconstruction \mathbf{X'}. An error between the sensory input \mathbf{X} and the reconstruction \mathbf{X'} is used to “train” the “encoder”.

We also have an upper output “decoder” that begins with the latent “associative” representation and generates a motor output \mathbf{Y}. A forward pass decodes the latent “associative” representation to generate muscle commands.

The “backward” pass of the upper layer is more uncertain. I need to research the muscle movement side – there is quite a lot based on the pathology of Parkinson’s disease and the action of the basal ganglia. (See the link for some great video lectures from Khan Academy.)

In one model, somasensory input may be received as input \mathbf{Y'}, which represents the muscle movements as actuated based on the muscle commands. The backward pass thus seeks to reconstruct the latent “associative” representation from the input \mathbf{Y'}. The “error” may be computed at one or more of the input level (e.g. looking at the difference between \mathbf{Y} and \mathbf{Y'}) and the latent “associative” representation level (e.g. between the “sensory” input encoding and the “motor” input encoding). In any case there seem to be enough error signals to drive optimisation and learning.

Of course, this model leaves out one vital component: time. Our sensory input and muscle output is constantly changing. This means that the model should be indexing all the variables by time (e.g. \mathbf{X(t)}, \mathbf{X'(t)}, \mathbf{Y(t)}, \mathbf{Y'(t)}, and the latent “associative” representations). Also the forward and backward passes will occur at different times (the forward pass being available earlier than the backward pass). This means that our errors are also errors in time, e.g. between \mathbf{X(t=1)} and \mathbf{X'(t=2)} for a common discrete time basis.

Many of the machine learning papers I read nowadays feature a bewildering array of architectures and names (ResNet, UNet, Glow, BERT, Transformer XL etc. etc.). However, most appear to be using similar underlying principles of architecture design. The way I deal with the confusion and noise is to try to look for these underlying principles and to link these to neurological and cognitive principles (if only to reduce the number of things I need to remember and make things simple enough for my addled brain!). Autoencoder origami is one way of making sense of things.