2

I'm using spacy tokenizer to tokenize my data, and then build vocab.

This is my code:

import spacy
nlp = spacy.load("en_core_web_sm")

def build_vocab(docs, max_vocab=10000, min_freq=3):
 stoi = {'<PAD>':0, '<UNK>':1}
 itos = {0:'<PAD>', 1:'<UNK>'}
 word_freq = {}
 idx = 2
 for sentence in docs:
  for word in [i.text.lower() for i in nlp(sentence)]:
   
   if word not in word_freq:
    word_freq[word] = 1
   else:
    word_freq[word] += 1

   if word_freq[word] == min_freq:
    if len(stoi) < max_vocab:
     stoi[word] = idx
     itos[idx] = word
     idx += 1
 return stoi, itos

But it takes hours to complete since I have more than 800000 sentences.

Is there a faster and better way to achieve this? Thanks.

update: tried to remove min_freq:

def build_vocab(docs, max_vocab=10000):
  stoi = {'<PAD>':0, '<UNK>':1}
  itos = {0:'<PAD>', 1:'<UNK>'}
  idx = 2
  for sentence in docs:
    for word in [i.text.lower() for i in nlp(sentence)]:
      if word not in stoi:
        if len(stoi) < max_vocab:
          stoi[word] = idx
          itos[idx] = word
          idx += 1
  return stoi, itos

still takes a long time, does spacy have a function to build vocab like in torchtext (.build_vocab).

testaja
  • 43
  • 4
  • Do you need to do `if word_freq[word] == min_freq` inside the second loop? – Hernán Alarcón Mar 28 '21 at 03:51
  • I tried to remove min_freq and set it to if word not in stoi: but that doesn't help much, still takes hours to complete. – testaja Mar 28 '21 at 03:56
  • I meant that if you have to check the frequency for every word in every sentece or if you can do it at the end. If you have 800,000 sentences and let's say every sentence has 10 words, you are doing that comparison 8'000,000 times. But I guess words are repeated among sentences, so If the 800,000 sentences are written using 8,000 unique words, you will only need to do that comparison 8,000 times (1,000 times fewer comparisons). – Hernán Alarcón Mar 28 '21 at 04:09
  • sorry, I am afraid I miss understood what you meant, please check my updated question. Is that what you meant? – testaja Mar 28 '21 at 04:38
  • I think it is best to use something like `sentencepiece` to build a vocab file that can be used to tokenize corpora for training models. – Wiktor Stribiżew Mar 28 '21 at 14:54

1 Answers1

0

There are a couple of things you can do to make this faster.

import spacy
from collections import Counter

def build_vocab(texts, max_vocab=10000, min_freq=3):
    nlp = spacy.blank("en") # just the tokenizer
    wc = Counter()
    for doc in nlp.pipe(texts):
        for word in doc:
            wc[word.lower_] += 1

    word2id = {}
    id2word = {}
    for word, count in wc.most_common():
        if count < min_freq: break
        if len(word2id) >= max_vocab: break
        wid = len(word2id)
        word2id[word] = wid
        id2word[wid] = word
    return word2id, id2word

Explanation:

  1. If you only use the tokenizer you can use spacy.blank
  2. nlp.pipe is fast for lots of text (less important, maybe irrelevant with blank model though)
  3. Counter is optimized for this kind of counting task

Another thing is that the way you are building your vocab in your initial example, you will take the first N words that have enough tokens, not the top N words, which is probably wrong.

Another thing is that if you're using spaCy you shouldn't build your vocab this way - spaCy has its own built-in vocab class that handles converting tokens to IDs. I guess you might need this mapping for a downstream task or something but look at the vocab docs to see if you can use that instead.

polm23
  • 14,456
  • 7
  • 35
  • 59