Making Proper Histograms with Numpy and Matplotlib

Often I find myself needing to visualise an array, such as bunch of pixel or audio channel values. A nice way to do this is via a histogram.

When building histograms you have two options: numpy’s histogram or matplotlib’s hist. As you may expect numpy is faster when you just need the data rather than the visualisation. Matplotlib is easier to apply to get a nice bar chart.

So I remember, here is a quick post with an example.

# First import numpy and matplotlib
import numpy as np
import matplotlib.pyplot as plt

I started with a data volume of size 256 x 256 x 8 x 300, corresponding to 300 frames of video at a resolution of 256 by 256 with 8 different image processing operations. The data values were 8-bit, i.e. 0 to 255. I wanted to visualise the distribution of pixel values within this data volume.

Using numpy, you can easily pass in the whole data volume and it will flatten the arrays within the function. Hence, to get a set of histogram values and integer bins between 0 and 255 you can run:

values, bins = np.histogram(data_vol, np.arange(0, 255))

You can then use matplotlib’s bar chart to plot this:

plt.bar(bins[:-1], values, width = 1)

Using matplotlib’s hist function, we need to flatten the data first:

results = plt.hist(data_vol.ravel(), bins=np.arange(0, 255))
plt.show()

The result of both approaches is the same. If we are being more professional, we can also use more of matplotlib’s functionality:

fig, ax = plt.subplots()
results = ax.hist(data_vol.ravel(), bins=np.arange(0, 255))
ax.set_title('Pixel Value Histogram')
ax.set_xlabel('Pixel Value')
ax.set_ylabel('Count')
plt.show()

Things get a little more tricky when we start changing our bin sizes. A good run through is found here. In this case, the slower Matplotlib function becomes easier:

fig, ax = plt.subplots()
results = ax.hist(
    data_vol.ravel(), 
    bins=np.linspace(0, 255, 16)
)
ax.set_title('Pixel Value Histogram (4-bit)')
ax.set_xlabel('Pixel Value')
ax.set_ylabel('Count')
plt.show()
Using 16-bins provides us with 4-bit quantisation. You can see here we could represent a large chunk of the data with just three values (if we subtract 128: <0 = -1, 0 = 0 and >0 = 1).
Advertisements

Getting All the Books

This is a short post explaining how to obtain over 50,000 text books for your natural language processing projects.

books on bookshelves
Photo by Mikes Photos on Pexels.com

The source of these books is the excellent Project Gutenberg.

Project Gutenberg offers the ability to use sync the collection of books. To obtain the collection you can set up a private mirror as explained here. However, I’ve found that a couple of tweaks to the rsync setup can be useful.

First, you can use the --list only option in rsync to first obtain a list of files that will be synced. Based on this random Github issue comment, I initially used the command below to generate a list of the files on the UK mirror server (based at the University of Kent):
rsync -av --list-only rsync.mirrorservice.org::gutenberg.org | awk '{print $5}' > log_gutenberg
(The piping via awk simply takes the 5th column of the list output.)

This file list is around 80MB. We can use this list to add some filters to the rsync command.

On the server books are stored as .txt files. Helpfully, each text file also has a compressed .zip file. Only syncing the .zip files will help to reduce the amount of data that is downloaded. We can either programmatically access the .zip files, or run a script to uncompress (the former is preferred to save disk space).

Some books have accompanying HTML files and/or alternate encodings. We only need ASCII encodings for now. We can thus ignore any file with dash (-) in it (HTML files are *-h* and are zipped; encodings are *-[number].* files).

A book also sometimes has an old folder containing old versions and other rubbish. We can ignore this (as per here). We can use the -m flag to prune empty directories (see here for more details on rsync options).

Also there are some stray .zip files that contain audio readings of books. We want to avoid these as they can be 100s MB. We can thus add an upper size limit of about 10MB (most book files are hundreds of KB).

We can use the --include and --exclude flags in a particular order to filter the files – we first include all subdirectories then exclude files we don’t want before finally only including what we do want.

Bringing this all together gives us the following rsync command-line (i.e. shell) command:

rsync -avm \
--max-size=10m \
--include="*/" \
--exclude="*-*.zip" \
--exclude="*/old/*" \
--include="*.zip" \
--exclude="*" \
rsync.mirrorservice.org::gutenberg.org ~/data/gutenberg

This syncs the data/gutenberg folder in our home directory with the Kent mirror server. All in all we have about 8GB.

The next steps are then to generate a quick Python wrapper that navigates the directory structure and unzips the files on the fly. We also need to filter out non-English texts and remove the standard Project Gutenberg text headers.

There is a useful GUTINDEX.ALL text file which contains a list of each book and its book number. This can be used to determine the correct path (e.g. book 10000 has a path of 1/0/0/0/10000). The index text file also indicates non-English books, which we could use to filter the books. One option is to create a small SQL database which stores title and path information for English books. It would also be useful to filter fiction from non-fiction, but this may need some clever in-text classification.

So there we are, we have a large folder full of books written before 1920ish, including some of the greatest books ever written (e.g. Brothers Karamazov and Anna Karenina).

Accessing Our Sensor Data: PHP + SQLite + SVG

So we have our sensor(s) dutifully logging their data to an SQLite database. Now we need to process and view this data.

LASP

First we need to set our Raspberry Pi up as a webserver. This typically requires: Linux, Apache, MySQL and PHP – a LAMP framework. There is a handy guide: here (thanks Dave!).

In my case I wanted a light-weight alternative to MySQL so I decided to use SQLite. You can also swap Apache for something like Lighttpd if you feel like it (I may in the future but didn’t this time).

On Raspian (and in summary of the guide in Dave’s link) I installed the following packages:

sudo apt-get update
sudo apt-get install apache2 php5 libapache2-mod-php5
sudo apt-get install php5-sqlite

To avoid permission complications I set up my web directory in my home directory. This meant I could easily access and upload files from the iPad. To do this edit: /etc/apache2/sites-available/default, e.g.:

sudo nano edit: /etc/apache2/sites-available/default

In the section:

DocumentRoot /var/www
    <Directory />
        Options FollowSymLinks
        AllowOverride None
    </Directory>
    <Directory /var/www/>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>
 

Change /var/www for /home/[yourusername]/[newfolder], e.g.:

DocumentRoot /home/jimbojones/www
    <Directory />
        Options FollowSymLinks
        AllowOverride None
    </Directory>
    <Directory /home/jimbojones/www/>
        Options Indexes FollowSymLinks MultiViews
        AllowOverride None
        Order allow,deny
        allow from all
    </Directory>
 

Hurray! Now you should be ready to write some PHP.

Accessing the Database

In versions 5.3+ of PHP SQLite is enabled by default. To check run:

<?php phpinfo(); ?>

and search the page for ‘sqlite’.

I found that getting this working took a bit of fiddling. There were several different versions of the access code floating about and not all of them worked. I would only look for recent entries if you google.

However, eventually this bit of PHP worked for me:

$db = new SQLite3('EnergyDB/energymonitor.db');
	
$results = $db->query('SELECT * FROM readings');
while ($row = $results->fetchArray()) 
{
var_dump($row);
}

Fun with SQL

Now we have database access working we can experiment with some SQL queries. While testing it is easier to do this via the SQLite command line. This way we can obtain a set of working queries without the added burden of PHP debugging. These working queries can then be added to our PHP code for testing.

My first attempt was:

SELECT * FROM readings WHERE r_datetime > (SELECT DATETIME('now', '-1 day'))

This worked!

You can change ‘day’ to ‘hour’ to get the last hour’s report:

SELECT * FROM readings WHERE r_datetime > (SELECT DATETIME('now', '-1 hour'))

Of course, when copying to the PHP remember to change the inner single quotes to double quotes (d’oh!).

Graph It Up

No-one likes dry numbers. Everyone loves a pretty graph. Now we will create our own using PHP and SVG (scalable vector graphics).

HTML5 has support for SVGs out the box. This makes life easy for us. I found this summary from IBM useful. W3Schools is also useful. To knock up a quick graph I followed the following steps:

  • Dump the data from the SQL query into data arrays;
  • Set the dimensions of our graph;
  • Set scale factors for our y-axes (basically: chart_height / max(array_value));
  • Draw our x, y axes;
  • For each data value:
    • Draw an x ticker, and
    • Determine a y value; and
  • Plot polylines for each set of data values.

And hey presto:

[12 hours]

[24 hours]

The PHP code is as follows:

<?php
	
	$db = new SQLite3('EnergyDB/energymonitor.db');
	
	if (!$db) die ($error);
	
	/*$results = $db->query('SELECT * FROM readings');*/
	
	$statement = $db->prepare('SELECT * FROM readings WHERE r_datetime > (SELECT DATETIME("now", "-24 hour"))');
	$results = $statement->execute();
	if (!$results) die("Cannot execute statement.");
	$i = 0;
	while ($row = $results->fetchArray()) 
	{
		/*var_dump($row);*/
		/*echo $row['r_temp']  . " : " . $row['r_watts'];
		echo "<br>";*/
		$watts[$i] = $row['r_watts'];
		$temp[$i] = $row['r_temp'];
		$i=$i+1;
	}
	$maxwatts = max($watts);
	$maxtemp = max($temp);
	
	/*Set parameters for width & height of chart*/
	$width = 800;
	$height = 600;
	$tickerlength = 5;
	$scalefactorwatts = $height / $maxwatts;
	$scalefactortemp = $height / $maxtemp;
	$totalheight= $height + (2*$tickerlength);
	$xticker = floor($width/count($watts));
	
	echo '<svg xmlns="http://www.w3.org/2000/svg" version="1.1" width="'.$width.'" height="'.$totalheight.'" >
		<line x1="0" y1="'.$height.'" x2="'.$width.'" y2="'.$height.'" 
					           style="stroke:black;stroke-width:2"/>
		<line x1="0" y1="0" x2="0" y2="'.$height.'" 
					           style="stroke:black;stroke-width:2"/>';
	for ($j = 0; $j < count($watts); $j++) {
		$x = $j*$xticker;
		$tickerheight = $height+$tickerlength;
		echo '<line x1="'.$x.'" y1="'.$height.'" x2="'.$x.'" y2="'.$tickerheight.'" style="stroke:black;stroke-width:1"/>';
		
		$ywatts = $height - $watts[$j]*$scalefactorwatts;
		$polystringwatts = $polystringwatts." ".$x.",".$ywatts;
		
		$ytemp = $height - $temp[$j]*$scalefactortemp;
		$polystringtemp = $polystringtemp." ".$x.",".$ytemp;
		
	}
	echo '<polyline points="'.$polystringwatts.'" style="fill:none;stroke:red;stroke-width:4"/>';
    echo '<polyline points="'.$polystringtemp.'"
         style="fill:none;stroke:green;stroke-width:4"/>';    
	echo '</svg>';
	
?>

OCR Meter Readings using Raspberry Pi?

I have a wireless energy meter and thermostat at home. I could try to hack them, taking them apart and listening to certain key voltages. However, the circuits are likely small and breakable. And I would like to use the units again and not pay for replacements.

So I was wondering whether I could cheat and input data values using OCR from a webcam or camera. The Raspberry Pi would be well placed to do this. My thoughts so far for the process are set out below. I can probably tackle each independently.

  1. Place meters;
  2. Acquire image;
  3. OCR on image;
  4. Output of OCR to DB or file.

1. Place meters

  • Needs to be a set distance from acquisition device;
  • Mark out so can replicate even if need to take meters in and out;
  • Illumination for night time:
    • Low power (LED?)
    • Filter image when LED is lit?

2. Acquire image

  • Frame grab from webcam;
    • Need to get webcam working;
    • Need to learn command to acquire image;
  • Segment image for different data:
    • Set x,y area in image if meters are placed consistently;
      • Is there a command line tool for this?
    • Test with crop in iPhone/iPad;
    • Output image files for different areas – use these as input for OCR.

3. OCR on image

  • No obvious OCR tool on Raspberry Pi – keep looking;
  • Web services? Does Google/Tesseract have a web service? Use URL?
  • Did common sense check on output:
    • Values will be integer (input parameter for OCR)
    • Values will have decimal point;
  • Create own OCR tool?

4. Output of OCR to DB or file

  • MySQL DB?
  • Key field = time stamp (inc. seconds);
  • Other fields for each item of OCR data;
  • Or flat file, e.g. CSV, with {timestamp, data} tuple.