Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
0
votes
1 answer

creating Trigrams using LinkedHashMap java

I am trying to create a trigram model using LinkedHashMap> where Entry is the entry of last inputed bigram (whose structure is: LinkedHashMap Now the problem is, being a map it does not store multiple keys (overwrites the existing key-value pair…
mag443
  • 191
  • 1
  • 4
  • 12
0
votes
1 answer

Multiple mappings per ActiveModel/Record?

Lets say I want to create two separate indexes on something like BlogPosts, so that I can do a quick search using one index (for autocomplete purposes for example) then use the other index for full blown search querying. Is that something I can do…
concept47
  • 30,257
  • 12
  • 52
  • 74
0
votes
2 answers

Java Lucene Ngrams

I want to use the Lucene API to extract ngrams from sentences. However I seem to be running into a peculiar problem. In the JavaDoc there is a class called NGramTokenizer. I have downloaded both the 3.6.1 and 4.0 API's and I do not see any trace of…
CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216
0
votes
1 answer

Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

If a Lucene ShingleFilter can be used to tokenize a string into shingles, or ngrams, of different sizes, e.g.: "please divide this sentence into shingles" Becomes: shingles "please divide", "divide this", "this sentence", "sentence into", and "into…
Mr Morgan
  • 2,215
  • 15
  • 48
  • 78
0
votes
1 answer

How to prevent discounting to zero in calculating ngrams?

I'm using SRILM's ngram-count command line utility in an attempt to calculate a trigram model for a subset of the Gutenberg corpus. The command line is: -order 3 -kndiscount -text {$text} -lm {$lm} -gt2min 10 -gt3min 5 -vocab {$vocab}…
saigafreak
  • 405
  • 6
  • 14
0
votes
1 answer

Exact match in SOLR

I am using NGramFilterFactory. My schema is as given below
Tarun Nagpal
  • 964
  • 1
  • 9
  • 25
0
votes
2 answers

Linux dictionaries

I need files containing wordlist for every possible language available. I searched for that and found that ftp.gnu.org hosts aspell directory that contains lots of dictionaries but as i extracted them i did not found any raw files with words data.…
5et
  • 366
  • 1
  • 3
  • 10
-1
votes
1 answer

bi-gram probability

Trying to find the probability of a phrase using bi-gram filename.txt # how many times bigram occurs bg_count = bigrams.count(('word1', 'word2')) # probabilty of bigram in text P(word1 word2) bg_count/number_of_bigrams
-1
votes
2 answers

How do I visualize two columns/lists of trigrams to see if the same wordcombination occur in both columns/lists?

so I have two Trigram-lists (20 Wordcombination each) e.g. l1 = ('hello', 'its', 'me'), ('I', 'need', 'help') ... l2 = ('I', 'need', 'help'), ('What', 'is', 'this') ... Now I want to visualize these two list in one diagramm (maybe pairplot) to see…
-1
votes
1 answer

How to make Dict of Ngram of my dataframe start with some string Python

I have dataframe like this id name cat subcat ------------------------------- 1 aa bb cc A a-a 2 bb cc dd B b-a 3 aa bb ee C c-a 4 aa gg cc D d-a I want to make dict of this dataframe Which…
miladjurablu
  • 91
  • 1
  • 6
-1
votes
1 answer

java.io.IOException: Spill failed in MapReduce with Combiber

I'm using HADOOP mapReduce. When running the project without local aggregation i.e. Combiner Class, it runs without problems. When i add the combiner class i get this message: java.lang.Exception: java.io.IOException: Spill failed In addition , the…
Maor Rocky
  • 167
  • 1
  • 7
-1
votes
1 answer

Phrase detection using PhrasesTransformer

from gensim.sklearn_api.phrases import PhrasesTransformer # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured. m = PhrasesTransformer(min_count=1, threshold=3) text = [['I', 'love', 'computer', 'science',…
John
  • 331
  • 2
  • 11
-1
votes
1 answer

How can we make this python code more efficient to run huge text files?

I have created a python file with the following code. I want the code do the following: Extract the content from a text file, clean it for punctuation, remove non-alphabetic, shift to lower case Create Unigrams and Bigrams and combine them Remove…
Moses
  • 1
  • 1
  • 4
-1
votes
1 answer

PySpark - Remove white space in n-grams

I am trying to produce n-grams of 3 letters but Spark NGram inserts a white space between each letter. I want to remove (or not produce) this white space. I could explode the array, remove the white space, then reassemble the array, but this is…
Béatrice Moissinac
  • 934
  • 2
  • 16
  • 41
-1
votes
1 answer

NGram on dataset with one word

I'm dabbling with SparkML, trying to build out a fuzzy match using Spark's OOB capabilities. Along the way, I'm building NGrams with n=2. However, some lines in my dataset contains single words where Spark pipeline fails. Regardless of Spark,…
Sahas
  • 3,046
  • 6
  • 32
  • 53