I'm searching for a Lemmatizer/PoS-tagger for the Italian language, that works on Python. I tried with Spacy, it works but it's not very precise, expecially for verbs it often returns the wrong lemma. NLKT has only english as language. I'm searching for an optimized tool for the Italian language, does it exists? If it doesn't exist, is it possible, given a corpus, to create it? Whats the work needed to create it?
Asked
Active
Viewed 435 times
1 Answers
2
I also found myself into this problem. I found that one of the best italian lemmatizers is TreeTagger. I preferred it to Spacy's lemmatizer for some projects (I also think that it could be better at POS-tagging). You can also test it online to find out if it is ok for your use case.
I found very useful to use it inside my Spacy pipeline, just for lemmatization, to keep the infrastructure that Spacy provides. This is how you can replace Spacy's lemmatizer with TreeTagger in Python thanks to treetaggerwrapper
(you could easily do the same with the POS-tagger):
from treetaggerwrapper import TreeTagger
...
nlp = spacy.load("it_core_news_lg")
TREETAGGER = TreeTagger(TAGDIR="path_to_treetagger", TAGLANG="it")
@Language.component("treetagger")
def treetagger(doc):
tokens = [token.text for token in doc if not token.is_space]
tags = TREETAGGER.tag_text(tokens, tagonly=True)
lemmas = [tag.split("\t")[2].split("|")[0] for tag in tags]
j = 0
for token in doc:
if not token.is_space:
token.lemma_ = lemmas[j]
j += 1
else:
token.lemma_ = " "
return doc
nlp.replace_pipe("lemmatizer", "treetagger")
This could be a useful temporaneous solution.

Nicola Fanelli
- 502
- 5
- 11