Questions tagged [stemming]

The process for reducing inflected words to their stem.

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form

531 questions
3
votes
3 answers

Stemming does not work properly for MongoDB text index

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a…
Michael Smolyak
  • 593
  • 2
  • 6
  • 21
3
votes
1 answer

Stemming in Text Classification - Degrades Accuracy?

I am implementing a text classification system using Mahout. I have read stop-words removal and stemming helps to improve accuracy of Text classification. In my case removing stop-words giving better accuracy, but stemming is not helping much. I…
3
votes
2 answers

Stemming some plurals with wordnet lemmatizer doesn't work

Hi i've a problem with nltk (2.0.4): I'm trying to stemming the word 'men' or 'teeth' but it doesn't seem to work. Here's my code: ############################################################################ import nltk from nltk.corpus import…
BlackOwl
  • 99
  • 1
  • 1
  • 8
3
votes
1 answer

Text Classification - using stemmer degrades results?

There's this article about sentiment analysis of Arabic. In the beginning of page 5 it says that: "Experiments also show that stemming words before feature extraction and classification nearly always degrades the results". Later on in the same…
Cheshie
  • 2,777
  • 6
  • 32
  • 51
3
votes
1 answer

Sphinx morphology stem_en not working

I have a single-field Sphinx index with stemming set up as follows: index main_sphinxalert { # Options: type = rt path = /var/lib/sphinxsearch/data/main_sphinxalert morphology = stem_en #…
awidgery
  • 1,896
  • 1
  • 22
  • 36
3
votes
1 answer

Configuring Custom Lucene Analyzer to accept certain stop words

I need to modify the lucene analyzer for it to be able to recognize the word "Ben" (Dutch stop word). Kindly guide me further. How do I make Lucene Analyzer accept this word as a regular word? Repository.xml for…
3
votes
1 answer

How does Word find matching word forms in Advanced Search?

I have a word document that has occurrences of both "perform" and "performance". When I use the advanced find tool in the Word UI (goal to eventually translate this to the Find.Execute command for C# programmatic searching), I get difference results…
Chris W.
  • 63
  • 7
3
votes
1 answer

Multi language full text: Which stemming [Snowball] language should be used?

Which stemming language I should be using if I want to support all language full text search. As far as I know the index need to created using that specific stemming language to support search with that language, but this is not possible for me as…
ManojMarathayil
  • 712
  • 11
  • 28
3
votes
2 answers

Does stemming harm precision in text classification?

I have read stemming harms precision but improves recall in text classification. How does that happen? When you stem you increase the number of matches between the query and the sample documents right?
samsamara
  • 4,630
  • 7
  • 36
  • 66
2
votes
1 answer

Strange behavior of Lucene SpanishAnalyzer class with accented words

I'm using the SpanishAnalyzer class in Lucene 3.4. When I want to parse accented words, I'm having a strange result. If I parse, for example, these two words: "comunicación" and "comunicacion", the stems I'm getting are "comun" and "comunicacion".…
Max
  • 81
  • 1
  • 4
2
votes
3 answers

How to use stemDocument in the R language tm (text mining) package?

I am trying to stem a Corpus using stemDocument in the R language tm package which calls Java. I have tried the example in the tm manual: data("crude") crude[[1]] stemDocument(crude[[1]]) and get the following error: Could not initialize the…
user974490
  • 31
  • 1
  • 4
2
votes
1 answer

NLP - Worse result when adding stemming or lemmitization for Sentiment Analysis

I'm trying to create a full pipeline of results for sentiment analysis for a smaller subset of the IMDB reviews (only 2k pos, 2k neg) so I'm tryna show results at each stage i.e. without any pre-processing, then basic cleaning (remove specials,…
2
votes
1 answer

Solr does not provide existing result

I hope you can help me, because this problem drives me crazy. To make it simple I have documents with fields named name_text_de_de which has following…
Fide
  • 109
  • 1
  • 3
  • 8
2
votes
1 answer

How to perform stemming and put back the words in the orginal review format?

I have a dataset with one column being full_text that contains review text from an online website. I wanted to clean these reviews, by removing stop words and stemming and putting them back to their original format (having all stemmed words forming…
Adrianna
  • 45
  • 3
2
votes
1 answer

Trying to convert plural words to singular words using regex but want to ignore a few words

I am currently trying to replace some of the plural words like removing "s" from "birds" and replacing it as "bird" in bigquery but I want them to ignore a few words like "less", "james", "this". I was able to come up with this which ignores the…
Kishan Kumar
  • 173
  • 1
  • 13