How to deal with length variations for text classification using CNN (Keras)

Question

It has been proved that CNN (convolutional neural network) is quite useful for text/document classification. I wonder how to deal with the length differences as the lengths of articles are different in most cases. Are there any examples in Keras? Thanks!!

score 3 · Answer 1 · answered Jun 02 '16 at 01:50

3

Here are three options:

Crop the longer articles.
Pad the shorter articles.
Use a recurrent neural network, which naturally supports variable-length inputs.

answered Jun 02 '16 at 01:50

1''

26,823
32
143
200

Will option 1 and 2 affect the original meaning of the articles after cutting or padding? – Fiong Jun 02 '16 at 02:21
Probably cutting will (not so much padding), but do you really need to read an entire news article to get the gist of it? How disadvantageous cutting is depends on your task. – 1'' Jun 02 '16 at 02:39
Regarding 3, I think it is true if you have a sequence to sequence problem, like pos tagging. In sequence labelling, like sentiment analysis or emotion detection, I believe you have to do truncating/padding in Keras in order to use RNN for sequence labeling. – pedrobisp Jun 02 '16 at 12:21
@pedrobisp Labelling variable-length sequences should definitely be possible with RNNs. – 1'' Jun 02 '16 at 16:40
Do you have any example code where variable-length sequences are given as input to a RNN in Keras? So far to what I have seen, you always have to apply padding/truncating in order to have sequences of same size. – pedrobisp Jul 04 '16 at 15:34
Hm, maybe you do after all. I haven't tried personally. – 1'' Jul 04 '16 at 16:41
@pedrobisp I am really new to keras and deep learning, could you give me a explanation about why `sentiment analysis or emotion detection` need truncating/padding ? – Mithril Jan 12 '17 at 03:09
This has nothing to do with the task of sentiment analysis. We are talking about the RNN implementation in Keras. – pedrobisp Jan 12 '17 at 18:56

score 3 · Answer 2 · answered Jun 02 '16 at 12:22

3

You can see a concrete example here: https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py

answered Jun 02 '16 at 12:22

pedrobisp

677
1
7
14

a11apurva · Answer 3 · 2018-04-27T09:24:14.330

One possible solution is to send your sequences in batches of 1.

n_batch = 1
model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)

This issue at official keras repo gives a good insight and a possible solution: https://github.com/keras-team/keras/issues/85

Quoting patyork's comment:

There are two simple and most often implemented ways of handling this:

Bucketing and Padding

Separate input sample into buckets that have similar length, ideally such that each bucket has a number of samples that is a multiple of the mini-batch size For each bucket, pad the samples to the length of the longest sample in that bucket with a neutral number. 0's are frequent, but for something like speech data, a representation of silence is used which is often not zeros (e.g. the FFT of a silent portion of audio is used as a neutral padding).

Bucketing

Separate input samples into buckets of exactly the same length removes the need for determining what a neutral padding is however, the size of the buckets in this case will frequently not be a multiple of the mini-batch size, so in each epoch, multiple times the updates will not be based on a full mini-batch.

janst · Answer 4 · 2019-04-06T19:32:28.707

I just made a model in Keras using their LSTM RNN model. It forced me to pad my inputs(I.e. the sentences). However, I just added an empty string to the sentence until it was the desired length. Possibly = to the length of the feature with the max length (in words). Then I was able to use glove to transform my features into vector space before running through my model.

def getWordVector(X):
    global num_words_kept
    global word2vec
    global word_vec_dim

    input_vector = []
    for row in X:
        words = row.split()
        if len(words) > num_words_kept:
            words = words[:num_words_kept]
        elif len(words) < num_words_kept:
            for i in range(num_words_kept - len(words)):
                words.append("")
        input_to_vector = []
        for word in words:
            if word in word2vec:
                input_to_vector.append(np.array(word2vec[word]).astype(np.float).tolist())#multidimensional wordvecor
            else:
                input_to_vector.append([5.0] * word_vec_dim)#place a number that is far different than the rest so as not to be to similar
        input_vector.append(np.array(input_to_vector).tolist())
    input_vector = np.array(input_vector)
    return input_vector

Where X is the list of sentences this function will return a word vector(using glove's word_to_vec) of the features with num_words_kept length for each one in the returned array. So I am using both padding and truncating. (Padding for Keras implementation and truncating because when you have such vast differences in the sizes of your inputs Keras also has issues... I'm not entirely sure why. I had issues when I started padding some sentences with more than 100 empty strings.

X = getWordVectors(features)
y = to_categorical(y)# for categorical_crossentropy
model.fit(X, y, batch_size=16, epochs=5, shuffle=False)

Keras requires that you use numpy arrays before feeding your data in therefore both my features and labels are numpy arrays.

How to deal with length variations for text classification using CNN (Keras)

4 Answers4