CNN architecture for word/character n-grams

Question

I've got a task of sequence labeling, and I's like to build a CNN which would take an input of a fixed number of embeddings (character or word-based) and extract n-gram-like features via convolution/pooling.

I haven't previously used convolution (for text or otherwise), so I'm not sure which architecture makes more sense in this setup:

Conv1D/MaxPool1D - extracting n-grams at the Conv stage makes sense, but what does such pooling produce? Is it just 1 dimension with the max value of the embedding?
Conv2D/MaxPool2D - athough I saw it more frequently in the existing approaches, the fact of convolving along the token embedding's dimensions doesn't make sense to me.

Could you please share your intuition on that?

score 0 · Answer 1 · answered Apr 18 '18 at 01:26

I have only done sequence labeling in RNNs (Recurrent Neural Networks), and image classification in CNNs, but I think I can at least describe what the first and second setups are doing:

Conv1D/MaxPool1D: The convolution will take the embedding and will convolve them around a filter which in this case is an n X n X 1. That will then create an n/2 X n/2 X 1 matrix which is now more sophisticated than the input.

Let me describe what I mean here. If you have a gray scale image which is 6 X 6 pixels (gray scale makes it only 1 layer where RGB would have 3 layers) and you convolve that with a filter (kernel) that is
```
[[1, 0, -1],
[1, 0, -1],
[1, 0, -1]]
```
Then the output will be a 3 X 3 X 1 image and the pixels now do not represent a gray scale value, but instead whether it thinks there is a vertical line at that position due to 0 middle column and 1s on either side.
Then the Max Pool will take that 3 X 3 X 1 and take an f sized square and send that square across the new image with s strides. So in this example say f = 2 and s = 2. Then we have a 2 X 2 square going across that image. The result will then be a 2 X 2 X 1 with
```
[[max of upper left 2 squares, max of upper right square and 0],
[max of lower left squares, max of lower right square and zero]]
```
This is then a 2 X 2 matrix with the values of the cells showing a high or low chance of having vertical bars in the original image at those positions.

Now to put that into embedded sequences, that could mean that if you have a n words with a value for say m variables then you would have an n X m X 1 matrix. You would then convolve that with a f X f matrix to create your more sophisticated matrix that will then be run through the max pool sequence with an f X f square (or window) and a stride of s. This could then maybe be the words that are most influenced by a certain variable depending on how you set up your filter.
Conv2D/MaxPool2D: This is the exact same as the above, but now you have an n X m X 2 matrix. Every other step would be the same except now your filter would be an f X f X 2 matrix and your output would be (if we continue from our example) a 2 X 2 X 2 matrix rather than a 2 X 2 X 1 matrix. This could have a few different affects. You could copy the n X m X 1 matrix and have a copy of it in the second dimension, then have the 2 dimensions for the filter be different which in turn could give more practical insight at the end by showing variables that are not only max in 1 but 2 categories of variables.

I hope this helped in some way.

CNN architecture for word/character n-grams

1 Answers1