Questions tagged [snowball]

Snowball is a small language for writing stemming algorithms, used primarily in information retrieval and natural language processing.

Created by Dr. Martin Porter, Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. It was created partially to provide a canonical implementation of Porter's stemming algorithm, and partially to facilitate the creation of stemmers for languages other than English.

A further aim of Porter's was to provide a way of creating and defining stemmers that could readily or automatically be translated into C, Java, or other programming languages. The Snowball compiler translates a Snowball script (a .sbl file) into either a thread-safe ANSI C program or a Java program. For ANSI C, each Snowball script produces a program file and corresponding header file (with .c and .h extensions).

The name "Snowball" is a tribute to the SNOBOL programming language.

73 questions
3
votes
2 answers

Python NLTK snowball stemmer UnicodeDecodeError in terminal but not Eclipse PyDev

I'm using the snowball stemmer to stem words in documents as shown in below code snippet. stemmer = EnglishStemmer() # Stem, lowercase, substitute all punctuations, remove stopwords. attribute_names = [stemmer.stem(token.lower()) for…
Maal
  • 481
  • 6
  • 19
3
votes
1 answer

Multi language full text: Which stemming [Snowball] language should be used?

Which stemming language I should be using if I want to support all language full text search. As far as I know the index need to created using that specific stemming language to support search with that language, but this is not possible for me as…
ManojMarathayil
  • 712
  • 11
  • 28
2
votes
1 answer

error using snowball in lucene

I have added lucene 3.5.0 and when i added a seperate jar for the snowball analyzer i get the following error : Exception in thread "main" java.lang.NoSuchMethodError:…
CTsiddharth
  • 907
  • 12
  • 21
2
votes
1 answer

how to write code for Lucene snowball in Java

Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29); IndexSearcher indexSearcher; File file = new File("/sdcard/index/"); Directory indexDir = FSDirectory.open(file); indexSearcher = new IndexSearcher(indexDir, true); QueryParser parser =…
Joan
  • 61
  • 2
  • 10
2
votes
1 answer

How to perform stemming and put back the words in the orginal review format?

I have a dataset with one column being full_text that contains review text from an online website. I wanted to clean these reviews, by removing stop words and stemming and putting them back to their original format (having all stemmed words forming…
Adrianna
  • 45
  • 3
2
votes
1 answer

How is the correct use of stemDocument?

I have already read this and this questions, but I still didn't understand the use of stemDocument in tm_map. Let's follow this example: q17 <- VCorpus(VectorSource(x = c("poder", "pode")), readerControl = list(language = "pt", …
2
votes
1 answer

How to use new .sbl Snowball algorithm in Python?

I want to use Lithuanian language stemmer in Python, however, there is no Lithuanian language in common tools like NLTK. However, I could find snowball .sbl files of Lithuanian stemmers here and here. But how to use them in Python? What I was able…
Lukas
  • 160
  • 2
  • 8
2
votes
1 answer

Snowball Stemming: defining Null Region

I'm trying to understand the snowball stemming algorithmus. HW90 has had a similar question with examples, but not mine. The algorithmus is using two regions R1 and R2 that are definied as follows: R1 is the region after the first non-vowel…
NewbieXXL
  • 155
  • 1
  • 1
  • 11
2
votes
3 answers

Making a wordcloud, but with combined words?

I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc My current code is as the following: KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords,…
Lian Ahmad
  • 109
  • 1
  • 10
2
votes
1 answer

stemDocment in tm package not working on past tense word

I have a file 'check_text.txt' that contains "said say says make made". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a…
yuqian
  • 257
  • 3
  • 10
2
votes
1 answer

Snowball Stemmer [Java]

I am currently using the Snowball Stemmer (Porter2) in my Java Project to stem words etc. However, it stems words that either don't necessarily need to be stemmed or stem's them too much? For example, online -> onlin, why -> whi, raise-> rais,…
John Lewis
  • 139
  • 1
  • 2
  • 15
2
votes
0 answers

PostgreSQL showball algorithm does not work on synonyms

I created custom config and synonyms for this config. Here is my synonym_custom.syn file contents gate door These are the creation scripts: CREATE TEXT SEARCH CONFIGURATION icons (copy='english'); CREATE TEXT SEARCH DICTIONARY my_synonym ( …
Jeff_Alieffson
  • 2,672
  • 29
  • 34
2
votes
2 answers

steamming words with r

I'm having a difficulties to understand R stemming word process. In my example, i created the following corpus object a <- Corpus(VectorSource("device so much more funand unlike most android torrent download clients")) So a is a[[1]]$content [1]…
Tomer
  • 23
  • 3
2
votes
0 answers

(Lucene.Net) Turkish stemmer is causing SnowballProgram to throw an index out of range exception. How to fix it?

Certain words in the Turkish stemmer is causing SnowballProgram to throw an index out of range exception. Can anybody help me to solve this problem?
1
vote
2 answers

ElasticSearch: strange search behaviour when using snowball analyzer

So let's say I have an ElasticSearch index defined like this: curl -XPUT 'http://localhost:9200/test' -d '{ "mappings": { "example": { "properties": { "text": { "type": "string", "analyzer": "snowball" …
tycooon
  • 398
  • 1
  • 11