Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

Question

I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.

It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.

Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.

The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:

...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
    article = " ".join(text) + "\n"
    output.write(article)
...

This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).

Does anyone know if I can "disable" gensim's preprocessing so that it doesn't remove punctuation etc. but still parses the wikipedia articles to text directly from the compressed wikipedia dump? Or does someone know a better way to accomplish this? Thanks in advance!

Update: I copied the WikiCorpus class from Gensim to add the spaCy operations myself where I need them. For now this seems to be the way to go if u want to change the way gensim preprocesses the wikipedia text — marlonfl, Apr 20 '17 at 00:44
By the way, if you already implemented this and solved the issue it would be great if you could share the approach here. — sophros, Sep 06 '17 at 09:05
Possible duplicate of [How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?](https://stackoverflow.com/questions/50697092/how-to-get-the-wikipedia-corpus-text-with-punctuation-by-using-gensim-wikicorpus) — Ali Abul Hawa, Mar 10 '19 at 15:53

score 1 · Answer 1 · answered Sep 06 '17 at 09:04

I wouldn't be surprised if spacy was operating on the level of sentences. For that it is very likely using sentence boundaries (dot, question mark, etc.). That is why spacy NER (or maybe even a POS Tagger earlier in the pipeline) might be failing for you.

As for the way to represent named entities for gensim's LSI - I would recommend adding an artificial identifier (a non-existent word). From the perspective of a model it does not make any difference and it may save you the burden of reworking gensim's preprocessing.

You may want to refer to the model.wv.vocab where model = gensim.models.Word2Vec(...) For that you would have to train the model twice. Alternatively, try creating a vocabulary set from the raw text and pick a random set of letters that does not exist already in the vocabulary.

score 1 · Answer 2 · answered Oct 07 '19 at 14:14

You can use a gensim word2vec pretrained model in spaCy, but the problem here is your processing pipeline's order:

You pass the texts to gensim
Gensim parses and tokenizes the strings
You normalize the tokens
You pass the tokens back to spaCy
You make a w2v corpus (with spaCy) (?)

That means the docs are already tokenized when spaCy gets them, and yes, its NER is... complex: https://www.youtube.com/watch?v=sqDHBH9IjRU

What you'd probably like to do is:

You pass the texts to spaCy
spaCy parses them with NER
spaCy tokenizes them accordingly, keeping entities as one token
you load the gensim w2v model with spacy.load()
you use the loaded model to create the w2v corpus in spaCy

All you need to do is download the model from gensim and tell spaCy to look for it from the command line:

wget [url to model]
python -m init-model [options] [file you just downloaded]

Here is the command line documentation for init-model: https://spacy.io/api/cli#init-model

then load it just like en_core_web_md, e.g. You can use .txt, .zip or .tgz models.

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

2 Answers2

Linked