This is a quick post intended to help those trying to understand convolution as applied in Tensorflow.
There are many good blog posts on the Internet explaining convolution as applied in convolutional neural networks (CNNs), e.g. see this one by Denny Britz. However, understanding the theory in one thing, knowing how to implement it is another. This is especially the case when trying to apply CNNs to word or character-level natural language processing (NLP) tasks – here the image metaphors break down a little.
I generally use Tensorflow for neural network modelling. Most of the stuff I want to do is a little bespoke, so I need something a little more expressive than Keras. Two dimensional convolution is explained in the Tensorflow documents here. I also found the examples in the stackoverflow answer here very useful.
To summarise, for our input we have a [a, b, c, d] tensor – i.e. a x b x c x d.
- a is the number of input ‘images’ or examples (this will typically be your batch size);
- b is the input width (e.g. image width, max. word length in characters or max. sequence length in words);
- c is the input height (e.g. image height or embedding dimensionality); and
- d is the number of input channels (grayscale images, characters or words = 1, RGB = 3).
For the convolution kernel filter you also have a [o, p, q, r] tensor – i.e. o x p x q x r.
- o is the filter width (e.g. patch width or ‘n-gram’ size);
- p is the filter height (e.g. patch height or embedding dimensionality);
- q is the number of channels (from the input – i.e. input channels); and
- r is the number of filters (or output channels).
q basically has to match d. r – the number of output channels – is equal to the number of filters you want. For each output channel, a different filter of width * height will be created and applied.
Most of the time the theory is talking about images of b x c, and a filter of o x p.
(This follows Denny’s configuration. However, I note you can transpose b and c and o and p and get the same outcome.)
For NLP the number of output channels becomes an important parameter. This is because you will typically max-pool over the sequence (e.g. word or character) length, such that you get one value for the complete sequence per filter. Each filter can be thought of as representing something like an n-gram, e.g. the learnt parameters of one filter of length 3 could represent the character embeddings for the suffix “ly” (e.g. [l_embedding, y_embedding, wordend_embedding]) or the prefix “un” (including word start token).
I found it instructive to work through the code associated with Denny’s further blog post here.