Getting All the Books

This is a short post explaining how to obtain over 50,000 text books for your natural language processing projects.

books on bookshelves
Photo by Mikes Photos on Pexels.com

The source of these books is the excellent Project Gutenberg.

Project Gutenberg offers the ability to use sync the collection of books. To obtain the collection you can set up a private mirror as explained here. However, I’ve found that a couple of tweaks to the rsync setup can be useful.

First, you can use the --list only option in rsync to first obtain a list of files that will be synced. Based on this random Github issue comment, I initially used the command below to generate a list of the files on the UK mirror server (based at the University of Kent):
rsync -av --list-only rsync.mirrorservice.org::gutenberg.org | awk '{print $5}' > log_gutenberg
(The piping via awk simply takes the 5th column of the list output.)

This file list is around 80MB. We can use this list to add some filters to the rsync command.

On the server books are stored as .txt files. Helpfully, each text file also has a compressed .zip file. Only syncing the .zip files will help to reduce the amount of data that is downloaded. We can either programmatically access the .zip files, or run a script to uncompress (the former is preferred to save disk space).

Some books have accompanying HTML files and/or alternate encodings. We only need ASCII encodings for now. We can thus ignore any file with dash (-) in it (HTML files are *-h* and are zipped; encodings are *-[number].* files).

A book also sometimes has an old folder containing old versions and other rubbish. We can ignore this (as per here). We can use the -m flag to prune empty directories (see here for more details on rsync options).

Also there are some stray .zip files that contain audio readings of books. We want to avoid these as they can be 100s MB. We can thus add an upper size limit of about 10MB (most book files are hundreds of KB).

We can use the --include and --exclude flags in a particular order to filter the files – we first include all subdirectories then exclude files we don’t want before finally only including what we do want.

Bringing this all together gives us the following rsync command-line (i.e. shell) command:

rsync -avm \
--max-size=10m \
--include="*/" \
--exclude="*-*.zip" \
--exclude="*/old/*" \
--include="*.zip" \
--exclude="*" \
rsync.mirrorservice.org::gutenberg.org ~/data/gutenberg

This syncs the data/gutenberg folder in our home directory with the Kent mirror server. All in all we have about 8GB.

The next steps are then to generate a quick Python wrapper that navigates the directory structure and unzips the files on the fly. We also need to filter out non-English texts and remove the standard Project Gutenberg text headers.

There is a useful GUTINDEX.ALL text file which contains a list of each book and its book number. This can be used to determine the correct path (e.g. book 10000 has a path of 1/0/0/0/10000). The index text file also indicates non-English books, which we could use to filter the books. One option is to create a small SQL database which stores title and path information for English books. It would also be useful to filter fiction from non-fiction, but this may need some clever in-text classification.

So there we are, we have a large folder full of books written before 1920ish, including some of the greatest books ever written (e.g. Brothers Karamazov and Anna Karenina).

Advertisements