2

I'm looking the ELMo model in tensorflow hub and, I'm not very clear about what does tokens_length = [6, 5] means in the flow example use: (https://tfhub.dev/google/elmo/2)

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                ["dogs", "are", "in", "the", "fog", ""]]
tokens_length = [6, 5]
embeddings = elmo(
    inputs={
        "tokens": tokens_input,
        "sequence_len": tokens_length
    },
    signature="tokens",
    as_dict=True)["elmo"]

It does't like the max length for the input token sentence, does't like [max number of words for each sentence, numbers of sentence] either, that makes me confused. Could someone explain this? Thanks!

xiao
  • 542
  • 1
  • 9
  • 16

1 Answers1

1

The first example has length 6 and the second example has length 5:. i.e.

"the cat is on the mat" is 6 words long but "dogs are in the fog" is only 5 words long. The extra empty string in the input does add a little confusion :-/

If you read the docs on that page it explains why this is needed (bold font is mine)

With the tokens signature, the module takes tokenized sentences as input. The input tensor is a string tensor with shape [batch_size, max_length] and an int32 tensor with shape [batch_size] corresponding to the sentence length. The length input is necessary to exclude padding in the case of sentences with varying length.

Stewart_R
  • 13,764
  • 11
  • 60
  • 106
  • aha,thanks! One more question, does I have to give the valid length for each sentence when using the tokens signature? e.g. the training data-set has 100 samples, and tokens_length is given like a 100 dimensional vector recording the length for every sample. if that is true, it seems like to be some kind of trouble :( – xiao Jun 27 '19 at 03:23
  • Glad to help! :-) That's my understanding, yes. Shouldn't be too much trouble to count the words in each sample though, no? – Stewart_R Jun 27 '19 at 06:08
  • emmm...no, not too much trouble, but still need to do some works. Maybe as a beginner, i'm just 'spoiled', haha – xiao Jun 27 '19 at 06:28
  • Then is it possible to write the code like this? embeddings = elmo(inputs={"tokens": tokens_input, "sequence_len": [len(sample) - sample.count("") for sample in tokens_input]}, signature="tokens", as_dict=True)["elmo"] This returns the length of each sentence without the padding. – ScubaChris Dec 25 '19 at 21:05