7

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?

sophros
  • 14,672
  • 11
  • 46
  • 75
Keshinko
  • 318
  • 1
  • 11

1 Answers1

6

The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:

  1. If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as playing is present-tense and played is past-tense, which doesn't happen in word-piece tokenization.
  2. Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.

Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
  • Thanks for the help. If I was to run a key term extraction step using something like TF-IDF, would the extra tokens be a liability, because now "##ing" could come up as a key term? The idea I had was to run token+lemma, put the lemmas through TF-IDF to get key terms, and then retokenize with wordpiece to embed with Bert. – Keshinko Jul 18 '19 at 13:17
  • 2
    I don't know about TF-IDF as I have never used it. Sorry for this. – Ashwin Geet D'Sa Jul 18 '19 at 13:28
  • With just TF-IDF only doing Lemmatization would be a better choice. The additional ## tokens would be just an unnecessary noise in this case. But if you are selecting key terms with TF-IDF and then tokenizing with word piece, you are losing much of the power of the transformers architecture of BERT in the process, because you may be leaving out important information which may have turned out as a key feature in the language understanding task that you are performing. – Benison Sam Feb 12 '23 at 22:41