2

I read the following blog post and tried to implement it via Keras: https://andriymulyar.com/blog/bert-document-classification

Now, Im quite new to Keras and I do not understand how to use "seq2seq neural networks" to condens a sequence of subchunks (sentences) into a global context vector (document vector). - via LSTM..

For example: I have 10 documents consisting of 100 sentences each and each sentence is represented by a 1x500 vector. So the array would look like this:

X = np.array(Matrix).reshape(10, 100, 500) # reshape to 10 documents with 100 sequence of 500 features

So I know I want to train my network and take the last hidden-layer cause this one represents my document vector/global context vector.

However, the hardest part for me is to imagine the output vector.. do I just enumerate my documents

y = [1,2,3,4,5,6,7,8,9,10]
y = np.array(y)

or do I have to use one-hot-encoded output vectors:

yy = to_categorical(y)

or even something else..?

As far as I understand, the final model should look something like this:

model = Sequential()
model.add(LSTM(50, input_shape=(100,500)))
model.add(Dense(1))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop')
model.fit(X, yy, epochs=100, validation_split=0.2, verbose=1)
Felix
  • 313
  • 1
  • 3
  • 22

1 Answers1

1

It depends only on the data you use:

For one-hot encoding use Categorical Crossentropy Loss.

model.compile(loss='categorical_crossentropy',optimizer='rmsprop')

For label encoding use Sparse Categorical Crossentropy Loss.

model.compile(loss='sparse_categorical_crossentropy',optimizer='rmsprop')

The base approach is the same at both version. So if you have a target data as y like:

Class1 Class2 Class3
0      0      1
1      0      0
1      0      0
0      1      0

You should complile your model like:

model.compile(loss='categorical_crossentropy',optimizer='rmsprop')

In contrary if you have a target data as y like:

labels
2
0
0
1

You should complile your model like:

model.compile(loss='sparse_categorical_crossentropy',optimizer='rmsprop')

The result and performance of your model will be the same only the memory usage can be affected.

Geeocode
  • 5,705
  • 3
  • 20
  • 34
  • Thanks, I see.. for one-hot-encoded label data it works now, but I also forgot to adjust to ```model.add(Dense(10)``` to reflect the 10 labels. But it still does not work for a target data that is just one output-vector ```y=[1,2,...10]```to reflect all documents.. I get following error (if done for 5 documents ```y=[1,2,3,4,5]``` ```InvalidArgumentError: Received a label value of 5 which is outside the valid range of [0, 5). Label values: 5 [[{{node loss_15/dense_15_loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]]``` – Felix Dec 25 '19 at 17:54
  • @Felix If the output Dense layer has 5 outputs it will except only labels in [0,1,2,3,4] as Python's indexing is from `0`. So your labels should have these values, that is why the complaining. Please don't forget to accept my answer with check mark, if you are satisfied, – Geeocode Dec 25 '19 at 21:54