I am currently trying to train a text classifier using spacy
and I got stuck with following question: what is the difference between creating a blank model using spacy.blank('en')
and using a pretrained model spacy.load('en_core_web_sm')
. Just to see the difference I wrote this code:
text = "hello everyone, it's a wonderful day today"
nlp1 = spacy.load('en_core_web_sm')
for token in nlp1(text):
print(token.text, token.lemma_, token.is_stop, token.pos_)
and it gave me the following result:
hello hello False INTJ
everyone everyone True PRON
, , False PUNCT
it -PRON- True PRON
's be True AUX
a a True DET
wonderful wonderful False ADJ
day day False NOUN
today today False NOUN
Then I tried this (for the same text)
nlp2 = spacy.blank('en')
for token in nlp2(text):
print(token.text, token.lemma_, token.is_stop, token.pos_)
and the result was
hello hello False
everyone everyone True
, , False
it -PRON- True PRON
's 's True
a a True
wonderful wonderful False
day day False
today today False
Not only are the results different (for example, lemma for 's
is different) but there are also no POS tagging for most of words in blank model.
So obviously I need a pretrained model for normalizing my data. But I still don't understand how it should be with my data classifier. Should I 1) create a blank model for training text classifier (using nlp.update()
) and load a pretrained model for removing stop words, lemmatization and POS tagging or 2) only load a pretrained model for both: normalizing and training my text classifier?
Thanks in advance for any advice!