I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.
It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.
Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.
The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:
...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
article = " ".join(text) + "\n"
output.write(article)
...
This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).
Does anyone know if I can "disable" gensim's preprocessing so that it doesn't remove punctuation etc. but still parses the wikipedia articles to text directly from the compressed wikipedia dump? Or does someone know a better way to accomplish this? Thanks in advance!