0

I'm using Rotten Tomatoes database as my dataset. Following this code I formatted the data. So, every sentence's size is 56 words, if a sentence's size is less than 56 words, this code puts some PAD to the end of sequence. For example, just for comprehension instead sentence is size 56, imagine it is 5:

Before:

complete_sentence = ['a', 'b', 'c', 'd', 'e']
not_complete_sentence = ['a', 'b', 'c']

After:

complete_sentence = ['a', 'b', 'c', 'd', 'e']
not_complete_sentence = ['a', 'b', 'c', 'PAD', 'PAD']

After processing the data, I transform it to caffe datum:

datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = txt_array.shape[0] ## 56
datum.width = 1
datum.label = label
datum.data = txt_array.tobytes()

Where label is 0 or 1 (positive or negative review) and txt_array is a formatted sentence as np.array. And finally, I put this datum on two lmdbs, one for training and other for test.

I want known if is this good configuration to my dataset? For images I used something similar to it, but for text this configuration is valid too? Or do I have to do something different for text?

Carlos Porta
  • 1,224
  • 5
  • 19
  • 31
  • Did you try this [tutorial](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)? It's using tensorflow, I think you can adapt to caffe. – Pasdf Jun 13 '16 at 14:13
  • Yes, I'm following it. But I don't known how to adapt using caffe. =( – Carlos Porta Jun 13 '16 at 14:17

0 Answers0