Glove text pre-processing

Question

I noticed in techniques, people convert text URLs, number, and dates to . Does the glove dataset has embedding trained for these placeholders. Can I feed them directly into the dataset?

score 0 · Answer 1 · answered May 12 '20 at 00:33

You can feed any tokens you want nito a word2vec/glove training sessions.

But, often tokens with a lot of internal variety, but perhaps little or diffuse semantic meaning (or too few examples of each individual variant), are either elided or coalesced into a synthetic replacement token.

For example, every number might become '__NUM__'. (Or, into ranged buckets like '__1DIGITNUM__', '__2DIGITNUM__', etc.) And dates might become '__DATE__'. (Or, bucketed like '__1700s__', '__1990s', etc.)

What any particular pre-trained model might have done needs to be checked directly with the model's creators, or via probing the tokens in the model. You should of course supply similar canonicalization on any entities/tokens you intend to look up against a pre-trained vector set.

So, what your set dos is completely up to you, if doing your own training, or up to the prior decisions made by a specific project, so only answerable with a specific project/dataset/codebase identified.

Glove text pre-processing

1 Answers1