1

I'm trying to lemmatize some Korean sentences using some pretrained models. I'm very much a beginner with this sort of thing so I'm sure I could be missing something obvious but following examples I found for other languages and the Korean model's docs (https://spacy.io/models/ko#ko_core_news_sm) I tried:

# loading model
nlp = spacy.load("ko_core_news_sm")

# test on first sentence 
doc = nlp(sentences[0])
print(doc)
for token in doc:
print(token.lemma_)

I would expect that it would provide the base form of the word, like if it were English for example something like apples->apple.

For the Korean however, the output of this code is providing WORD+affix. I cannot post with Korean due to anti-spam measures but basically it appears to be rather than providing the lemma simply telling me how the word is composed. Am I doing something wrong is this simply how the model works? Is there any way to get the actual base word? Sorry if it's obvious and thanks everyone for the help.

Anteater
  • 11
  • 1
  • Not a Korean speaker, so apologies if I am missing specific details, but it seems like your code is perfectly correct but as you guessed, `ko_core_news_sm` model does not do what you'd expect it to do. The model's page has the following detail in parentheses: `lemmatizer (trainable_lemmatizer)` https://spacy.io/models/ko. In addition, the lemmatizer API reference page does not have `ko` listed in the default lemmatizers list: https://spacy.io/api/lemmatizer My guess is that you'd either need to find a third-party lemmatizer to add toyour pipeline or develop your own. – umit1010 Apr 15 '23 at 19:02

0 Answers0