Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

1 answer

creating Trigrams using LinkedHashMap java

I am trying to create a trigram model using LinkedHashMap> where Entry is the entry of last inputed bigram (whose structure is: LinkedHashMap Now the problem is, being a map it does not store multiple keys (overwrites the existing key-value pair…

asked Feb 24 '13 at 15:06

mag443

votes

1 answer

Multiple mappings per ActiveModel/Record?

Lets say I want to create two separate indexes on something like BlogPosts, so that I can do a quick search using one index (for autocomplete purposes for example) then use the other index for full blown search querying. Is that something I can do…

autocomplete elasticsearch tire n-gram

asked Nov 30 '12 at 08:14

concept47

30,257
12
52
74

votes

2 answers

Java Lucene Ngrams

I want to use the Lucene API to extract ngrams from sentences. However I seem to be running into a peculiar problem. In the JavaDoc there is a class called NGramTokenizer. I have downloaded both the 3.6.1 and 4.0 API's and I do not see any trace of…

java api lucene n-gram

asked Nov 10 '12 at 05:24

CodeKingPlusPlus

15,383
51
135
216

votes

1 answer

Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

If a Lucene ShingleFilter can be used to tokenize a string into shingles, or ngrams, of different sizes, e.g.: "please divide this sentence into shingles" Becomes: shingles "please divide", "divide this", "this sentence", "sentence into", and "into…

lucene filtering n-gram

asked Sep 03 '12 at 15:01

Mr Morgan

2,215
15
48
78

votes

1 answer

How to prevent discounting to zero in calculating ngrams?

I'm using SRILM's ngram-count command line utility in an attempt to calculate a trigram model for a subset of the Gutenberg corpus. The command line is: -order 3 -kndiscount -text {$text} -lm {$lm} -gt2min 10 -gt3min 5 -vocab {$vocab}…

nlp n-gram

asked Jul 19 '12 at 15:40

saigafreak

votes

1 answer

Exact match in SOLR

I am using NGramFilterFactory. My schema is as given below …

solr lucene n-gram

asked May 31 '12 at 14:50

Tarun Nagpal

votes

2 answers

Linux dictionaries

I need files containing wordlist for every possible language available. I searched for that and found that ftp.gnu.org hosts aspell directory that contains lots of dictionaries but as i extracted them i did not found any raw files with words data.…

linux dictionary n-gram

asked May 25 '12 at 05:39

5et

-1

votes

1 answer

bi-gram probability

Trying to find the probability of a phrase using bi-gram filename.txt # how many times bigram occurs bg_count = bigrams.count(('word1', 'word2')) # probabilty of bigram in text P(word1 word2) bg_count/number_of_bigrams

python nlp artificial-intelligence probability n-gram

asked Jul 18 '22 at 11:46

Nabila Eusha

-1

votes

2 answers

How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?

so I have two Trigram-lists (20 Wordcombination each) e.g. l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ... l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ... Now I want to visualize these two list in one diagramm (maybe pairplot) to see…

python matplotlib seaborn visualization n-gram

asked Dec 07 '21 at 10:42

NeedPythonHelp

-1

votes

1 answer

How to make Dict of Ngram of my dataframe start with some string Python

I have dataframe like this id name cat subcat ------------------------------- 1 aa bb cc A a-a 2 bb cc dd B b-a 3 aa bb ee C c-a 4 aa gg cc D d-a I want to make dict of this dataframe Which…

python scikit-learn n-gram

asked Jul 03 '21 at 07:07

miladjurablu

-1

votes

1 answer

java.io.IOException: Spill failed in MapReduce with Combiber

I'm using HADOOP mapReduce. When running the project without local aggregation i.e. Combiner Class, it runs without problems. When i add the combiner class i get this message: java.lang.Exception: java.io.IOException: Spill failed In addition , the…

java hadoop mapreduce amazon-emr n-gram

asked May 31 '20 at 21:31

Maor Rocky

-1

votes

1 answer

Phrase detection using PhrasesTransformer

from gensim.sklearn_api.phrases import PhrasesTransformer # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured. m = PhrasesTransformer(min_count=1, threshold=3) text = [['I', 'love', 'computer', 'science',…

nlp gensim n-gram phrase

asked Apr 28 '20 at 04:26

John

-1

votes

1 answer

How can we make this python code more efficient to run huge text files?

I have created a python file with the following code. I want the code do the following: Extract the content from a text file, clean it for punctuation, remove non-alphabetic, shift to lower case Create Unigrams and Bigrams and combine them Remove…

python n-gram stop-words

asked Apr 25 '20 at 06:00

Moses

-1

votes

1 answer

PySpark - Remove white space in n-grams

I am trying to produce n-grams of 3 letters but Spark NGram inserts a white space between each letter. I want to remove (or not produce) this white space. I could explode the array, remove the white space, then reassemble the array, but this is…

python apache-spark pyspark n-gram

asked Mar 02 '20 at 20:54

Béatrice Moissinac

-1

votes

1 answer

NGram on dataset with one word

I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark,…

apache-spark nlp apache-spark-mllib apache-spark-ml n-gram

asked Feb 25 '20 at 01:49

Sahas

3,046
6
32
53

Prev 1 2 3

…

58 59 Next