11

I am using glove embeddings and I am quite confused about tokens and vocab in the embeddings. Like this one:

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

what does tokens and vocab mean, respectively? What is the difference?

Zhao
  • 2,043
  • 1
  • 17
  • 36

2 Answers2

9

In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".

It should be the case that vocab <= tokens.

aberger
  • 2,299
  • 4
  • 17
  • 29
0

Tokens are obtained after training your corpus and they are not the same size as words.

A word of length 10, tokens of this word maybe 2 or 3 tokens, it basically represents how better you can represent your word and make it mean something to your model.

Daniel Rudy
  • 1,411
  • 12
  • 23