4

I am currently trying to train a text classifier using spacy and I got stuck with following question: what is the difference between creating a blank model using spacy.blank('en') and using a pretrained model spacy.load('en_core_web_sm'). Just to see the difference I wrote this code:

text = "hello everyone, it's a wonderful day today"

nlp1 = spacy.load('en_core_web_sm')
for token in nlp1(text):
    print(token.text, token.lemma_, token.is_stop, token.pos_)

and it gave me the following result:

hello hello False INTJ

everyone everyone True PRON

, , False PUNCT

it -PRON- True PRON

's be True AUX

a a True DET

wonderful wonderful False ADJ

day day False NOUN

today today False NOUN

Then I tried this (for the same text)

nlp2 = spacy.blank('en')
for token in nlp2(text):
    print(token.text, token.lemma_, token.is_stop, token.pos_)

and the result was

hello hello False

everyone everyone True

, , False

it -PRON- True PRON

's 's True

a a True

wonderful wonderful False

day day False

today today False

Not only are the results different (for example, lemma for 's is different) but there are also no POS tagging for most of words in blank model.

So obviously I need a pretrained model for normalizing my data. But I still don't understand how it should be with my data classifier. Should I 1) create a blank model for training text classifier (using nlp.update()) and load a pretrained model for removing stop words, lemmatization and POS tagging or 2) only load a pretrained model for both: normalizing and training my text classifier?

Thanks in advance for any advice!

Community
  • 1
  • 1
Oleg Ivanytskyi
  • 959
  • 2
  • 12
  • 28

1 Answers1

6

If you are using spacy's text classifier, then it is fine to start with a blank model. The TextCategorizer doesn't use features from any other pipeline components.

If you're using spacy to preprocess data for another text classifier, then you would need to decide which components make sense for your task. The pretrained models load a tagger, parser, and NER model by default.

The lemmatizer, which isn't implemented as a separate component, is the most complicated part of this. It tries to provide the best results with the available data and models:

  • If you don't have the package spacy-lookups-data installed and you create a blank model, you'll get the lowercase form as a default/dummy lemma.

  • If you have the package spacy-lookups-data installed and you create a blank model, it will automatically load lookup lemmas if they're available for that language.

  • If you load a provided model and the pipeline includes a tagger, the lemmatizer switches to a better rule-base lemmatizer if one is available in spacy for that language (currently: Greek, English, French, Norwegian Bokmål, Dutch, Swedish). The provided models also always include the lookup data for that language so they can be used when the tagger isn't run.

If you want to get the lookup lemmas from a provided model, you can see them by loading the model without the tagger:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tagger"])

In general, the lookup lemma quality is not great (there's no information to help with ambiguous cases) and the rule-based lemmas will be a lot better, however it does take additional time to run the tagger, so you can choose lookup lemmas to speed things up if the quality is good enough for your task.

And if you're not using the parser or NER model for preprocessing, you can speed things up by disabling them:

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])
aab
  • 10,858
  • 22
  • 38
  • Thanks for your answer! Just to summarize: if I want to build a text categorization model for German with following preprocessing steps (removing stop words, using lemmatization and deleting some parts of speech), I should create a blank model for training and load a pretrained one for preprocessing? But I also should disable a NER component in this pretrained model and install the lookup table? Is that correct? – Oleg Ivanytskyi Apr 08 '20 at 09:23
  • 1
    You only really need to disable things if speed turns into a problem. If you're using spacy's `TextCategorizer` I would try it first without any preprocessing. It doesn't expect any preprocessing and I'm not sure if this kind of preprocessing (the German lookup lemmatizer is not great) is going to help. (It might for a particular task, though. I don't really know since I've never used it with any preprocessing.) – aab Apr 08 '20 at 09:29
  • the problem is that I have more than 15 millions articles in German and 21 categories for them. So it works really slowly. Accuracy is also worse when using no preprocessing. Now (using the steps I described in a previous comment) I have accuracy that equals 57%. Without lemmatization it is less than 40%. Maybe some advice what else I can do to improve my model? – Oleg Ivanytskyi Apr 08 '20 at 09:41
  • 1
    Spacy's `TextCategorizer` is mainly intended for much smaller datasets. We'd typically recommend using something like Vowpal Wabbit instead. You might still want to do the preprocessing with spacy, of course. – aab Apr 08 '20 at 10:14