I'm using Rotten Tomatoes database as my dataset. Following this code I formatted the data. So, every sentence's size is 56 words, if a sentence's size is less than 56 words, this code puts some PAD to the end of sequence. For example, just for comprehension instead sentence is size 56, imagine it is 5:
Before:
complete_sentence = ['a', 'b', 'c', 'd', 'e']
not_complete_sentence = ['a', 'b', 'c']
After:
complete_sentence = ['a', 'b', 'c', 'd', 'e']
not_complete_sentence = ['a', 'b', 'c', 'PAD', 'PAD']
After processing the data, I transform it to caffe datum:
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = txt_array.shape[0] ## 56
datum.width = 1
datum.label = label
datum.data = txt_array.tobytes()
Where label is 0 or 1 (positive or negative review) and txt_array is a formatted sentence as np.array. And finally, I put this datum on two lmdbs, one for training and other for test.
I want known if is this good configuration to my dataset? For images I used something similar to it, but for text this configuration is valid too? Or do I have to do something different for text?