Questions tagged [lemmatization]

Lemmatization in linguistics is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.

436 questions
10
votes
4 answers

How to solve Spanish lemmatization problems with SpaCy?

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish…
Y4RD13
  • 937
  • 1
  • 16
  • 42
10
votes
2 answers

Lemmatization with apache lucene

I'm developing a text analysis project using apache lucene. I need to lemmatize some text (transform the words to their canonical forms). I've already written the code that makes stemming. Using it, I am able to convert the following sentence The…
Kirill Simonov
  • 8,257
  • 3
  • 18
  • 42
9
votes
2 answers

Lemmatizing Italian sentences for frequency counting

I would like to lemmatize some Italian text in order to perform some frequency counting of words and further investigations on the output of this lemmatized content. I am preferring lemmatizing than stemming because I could extract the word meaning…
TPPZ
  • 4,447
  • 10
  • 61
  • 106
8
votes
1 answer

How to inverse lemmatization process given a lemma and a token?

Generally, in natural language processing, we want to get the lemma of a token. For example, we can map 'eaten' to 'eat' using wordnet lemmatization. Is there any tools in python that can inverse lemma to a certain form? For example, we map 'go' to…
Shifeng.Liu
  • 105
  • 2
  • 7
8
votes
2 answers

Lemmatization of non-English words?

I would like to apply lemmatization to reduce the inflectional forms of words. I know that for English language WordNet provides such a functionality, but I am also interested in applying lemmatization for Dutch, French, Spanish and Italian words.…
7
votes
1 answer

Wordpiece tokenization versus conventional lemmatization?

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for…
Keshinko
  • 318
  • 1
  • 11
7
votes
2 answers

Analyze text (lemmatization, edit distance)

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I…
user348173
  • 8,818
  • 18
  • 66
  • 102
7
votes
1 answer

Solr/Lucene query lemmatization with context

I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or…
dedek
  • 7,981
  • 3
  • 38
  • 68
7
votes
1 answer

Getting the root word using the Wordnet Lemmatizer

I need to find a common root word matched for all related words for a keyword extractor. How to convert words into the same root using the python nltk lemmatizer? Eg: generalized, generalization -> general optimal, optimized -> optimize…
Shanika Ediriweera
  • 1,975
  • 2
  • 24
  • 31
7
votes
1 answer

Faster Lemmatization techniques in Python

I am trying to find out a faster way to lemmatize words in a list (named text) using the NLTK Word Net Lemmatizer. Apparently this is the most time consuming step in my whole program(used cProfiler to find the same). Following is the piece of code…
7
votes
1 answer

Why NLTK lemmatization has wrong output even if verb.exc has added right value?

When I open verb.exc, I can see saw see While I use lemmatization in code >>>print lmtzr.lemmatize('saw', 'v') saw How can this happen? Do I misunderstand in revising wordNet?
Leo Hsieh
  • 351
  • 4
  • 12
7
votes
1 answer

Stemming unstructured text in NLTK

I tried the regex stemmer, but I get hundreds of unrelated tokens. I'm just interested in the "play" stem. Here is the code I'm working with: import nltk from nltk.book import * f = open('tupac_original.txt', 'rU') text = f.read() text1 =…
user2221429
  • 71
  • 1
  • 4
7
votes
1 answer

Looking for a database or text file of english words with their different forms

I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project…
Majid Darabi
  • 731
  • 6
  • 15
6
votes
2 answers

Ho to do lemmatization on German text?

I have a German text that I want to apply lemmatization to. If lemmatization is not possible, then I can live with stemming too. Data: This is my German text: mails=['Hallo. Ich spielte am frühen Morgen und ging dann zu einem Freund. Auf…
PParker
  • 1,419
  • 2
  • 10
  • 25
6
votes
1 answer

Does keras-tokenizer perform the task of lemmatization and stemming?

Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?
ASingh
  • 133
  • 1
  • 4
1
2
3
29 30