0

I have a tokenized list of words in a vocabulary. (It's been passed through a set, so there are no duplicates.)

My problem

I want to generate a method which creates a dictionary that allows a mapping from the word to its index in the vocabulary.

My attempt

My current method is like so:

mapping = { w : vocabulary.index(w) for w in vocabulary }

This should work but it is far too inefficient, probably due to repeatedly using vocabulary.index(w) for thousands of words.

Question

Is there a library that I can use that does this more efficiently? Or just more efficient methods?

Thanks.

POSSIBLE SOLUTION 1

Currently, each time a word is reached in 'vocabulary', vocabulary.index() is implemented, which required a pass through 'vocabulary' to identify the index, which is done for every word. As suggested in an answer, a possibility is to enumerate 'vocabulary' first. This allows one pass through it to identify the index, like so:

mapping = { w : i for i, w in enumerate(vocabulary) }
quanty
  • 824
  • 1
  • 12
  • 21

1 Answers1

1

Try by changing your code as follows. mapping = { w : i for i, w in enumerate(vocabulary) } where i is the index of the word w.

user3687197
  • 181
  • 1
  • 4
  • Oh great, that sounds good. So in doing this, you enumerate the vocab, which allows you to pass through it once to create the dict? – quanty Feb 17 '18 at 15:12