Questions tagged [stemming]

The process for reducing inflected words to their stem.

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form

531 questions
2
votes
1 answer

Removing hyphens in http but preserving hyphenated words in corpus

I am trying to modify a stemming function that is able to 1) remove hyphens in http (that appeared in the corpus) but, meanwhile, 2) preserve hyphens that appeared in meaningful hyphenated expressions (e.g., time-consuming, cost-prohibitive,…
Chris T.
  • 1,699
  • 7
  • 23
  • 45
2
votes
0 answers

Stemming french text with NLTK

I'm trying to stemming a text in French with NLTK europe amérique nord fruits espèce fraisier bois petite taille connus depuis antiquité romains consommaient utilisaient produits cosmétiques raison odeur agréable cultivée jardins européens fraisier…
marin
  • 923
  • 2
  • 18
  • 26
2
votes
2 answers

Remove punctuation but keep hyphenated phrases in R text cleaning

Is there any effective way to remove punctuation in text but keeping hyphenated expressions, such as "accident-prone"? I used the following function to clean my text clean.text = function(x) { # remove rt x = gsub("rt ", "", x) # remove at x…
Chris T.
  • 1,699
  • 7
  • 23
  • 45
2
votes
1 answer

How to apply a custom stemmer before passing the training corpus to TfidfVectorizer in sklearn?

Here is my code, I have a sentence and I want to tokenize and stem it before passing it to TfidfVectorizer to finally to get a tf-idf representation of the sentence: from sklearn.feature_extraction.text import TfidfVectorizer import nltk from…
2
votes
1 answer

How to use new .sbl Snowball algorithm in Python?

I want to use Lithuanian language stemmer in Python, however, there is no Lithuanian language in common tools like NLTK. However, I could find snowball .sbl files of Lithuanian stemmers here and here. But how to use them in Python? What I was able…
Lukas
  • 160
  • 2
  • 8
2
votes
2 answers

Get the word from stem (stemming)

I am using porter stemmer as follows to get the stem of my words. from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() def stem_tokens(tokens, stemmer): stemmed = [] for item in tokens: …
user8871463
2
votes
2 answers

Russian Porter stemming in JavaScript

Does someone have an example of Russian Porter stemming in JavaScript?
Semen
  • 41
  • 1
2
votes
0 answers

R language - stem completion in italian

I have a large corpus of text, in italian, to analyze using the R-language. Almost all the preprocessing method is easily writable to adapt to my native language, with a couple of default libraries. Problem is I can't find a way to implement a…
Cristiano
  • 21
  • 2
2
votes
3 answers

Ruby: is there a stemmer that "knows" English irregular verbs?

There is a ruby stemmer https://github.com/aurelian/ruby-stemmer, but it 1) does not stem English irregular verbs 2) fails to build native extensions on Windows. Is there an alternative that fixes at least one of the problems?
Alexey
  • 9,197
  • 5
  • 64
  • 76
2
votes
2 answers

MarkLogic generic language support

As per the documentation: The generic language support only stems words to themselves, queries in these languages will not include variations of words based on their meanings in the results. xdmp:document-insert("/lang/test.xml",
Yash
  • 510
  • 2
  • 6
  • 14
2
votes
1 answer

Word stemming in R

I am working on a text mining project and trying to clean the text - words in singular/plural forms, verbs in different tenses and misspelling words. My sample looks like this: test <-…
Ran Tao
  • 311
  • 1
  • 4
  • 13
2
votes
0 answers

Avoiding specific words in word stemming with tm package

A previous post addressed this issue here: Text-mining with the tm-package - word stemming However I am still running into challenges with the tm package. My goal is to stem a large corpus of words, however I wish to avoid stemming specific words.…
kdudeIA
  • 57
  • 1
  • 6
2
votes
1 answer

Python Snowball Stemmer + RAKE: generates 'u's

I am trying to get the keywords from a text file containing a text, and I'm stemming the text first. The code below works, but for some reason it generates the letter 'u' in front of the keyword list. E.g. this is what I get: [(u'keyword1', 5),…
user7443687
2
votes
1 answer

How to provide (or generate) tags for nltk lemmatizers

I have a set of documents, and I would like to transform those into such form, that it would allow me to count tfidf for words in those documents (so that each document is being represented by vector of tfidf-numbers). I thought that it is enough to…
Zbyszek M.
  • 85
  • 1
  • 8
2
votes
1 answer

Snowball Stemming: defining Null Region

I'm trying to understand the snowball stemming algorithmus. HW90 has had a similar question with examples, but not mine. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel…
NewbieXXL
  • 155
  • 1
  • 1
  • 11