I've got a task of sequence labeling, and I's like to build a CNN which would take an input of a fixed number of embeddings (character or word-based) and extract n-gram-like features via convolution/pooling.
I haven't previously used convolution (for text or otherwise), so I'm not sure which architecture makes more sense in this setup:
- Conv1D/MaxPool1D - extracting n-grams at the Conv stage makes sense, but what does such pooling produce? Is it just 1 dimension with the max value of the embedding?
- Conv2D/MaxPool2D - athough I saw it more frequently in the existing approaches, the fact of convolving along the token embedding's dimensions doesn't make sense to me.
Could you please share your intuition on that?