Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
0
votes
1 answer

Solr - Return word NGrams, even with mixed word order

I haven't been able to find a resource which explains a means by which I can return the most common word NGrams which do not depend on word order, and have flexible word position boundaries. I think this concept is analogous to having slop in the…
0
votes
1 answer

Database schema for storing ngrams with multiple element search

I want to store a large number of ngrams on disk in such a way that I can perform the following queries on it: Fetch all ngrams Fetch all ngrams of a certain size Fetch all ngrams which contain all these given elements in any position…
mtanti
  • 794
  • 9
  • 25
0
votes
2 answers

Next-Word Prediction Engines - which branch of AI do they belong

Next-word prediction or phrase-prediction engines used in modern keyboards of mobiles and tablets, like swift key & XT9, which predict the next word the user is going to type based on some pre-defined or dynamic corpus, based on n-grams (maximum…
0
votes
1 answer

Write output of two different Hadoop jobs to same set of reducers

I have a scenario where I need to run two Hadoop jobs calculating n-gram statistics for two different corpora and make sure that they write each n-gram (and it's score) to the same reducer (so that in future I can read the data locally and compare…
abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54
0
votes
2 answers

ES Search partial word - ngram?

I am using Elastic Search to index entities that contain two fields: agencyName and agencyAddress. Let's say I have indexed one entity: { "agencyName": "Turismo Viajes", "agencyAddress": "Av. Maipú 500" } I would like to be able to search…
Agustin Lopez
  • 1,355
  • 4
  • 18
  • 34
0
votes
2 answers

elastic search ngram special characters

I am having a elastic search node with the following default config index : analysis : analyzer : default_index : type : custom tokenizer : whitespace filter : - lowercase - asciifolding -…
anishek
  • 1,675
  • 2
  • 13
  • 19
0
votes
1 answer

Digits being neglected while performing N-gram in R

I want to get the counts of all character level Ngrams presnt in a text file. Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue. Here is the code…
Aravind Asok
  • 514
  • 1
  • 7
  • 18
0
votes
3 answers

Fastest way to store n-grams (strings with variable amount of words) in python

I have an input file consisting of lines with numbers and word sequences, structured like this: \1-grams: number w1 number number w2 number \2-grams: number w1 w2 number number w1 w3 number number w2 w3 number \end\ I want to…
niefpaarschoenen
  • 560
  • 1
  • 8
  • 19
0
votes
1 answer

how to return/search for documents using nltk bigrams?

What I want to do is loop through my database searching each document for the presence of certain listed terms -- some of which I would like to be bigram and trigram if necessary. If the terms are present I will submit the document's index and blah…
Bee Smears
  • 803
  • 3
  • 12
  • 22
0
votes
2 answers

What is the calculation of Ngram?

I'm doing a projcet of dating books, and my main idea is to do it with "ngram". I entered here http://books.google.com/ngrams and I found the ngrams that have the most unequivocal graphs (unconstant value over the years). Then I wrote a code in…
Doron
  • 161
  • 2
  • 5
  • 13
0
votes
1 answer

Validating length of ngrams in varchar field in mySQL

I have a column in a table in MySQL which contains ngrams, and I want to validate that each ngram is of the correct length. For e.g. if the ngram is 'sophisticated hacking scheme' then the number of blanks should be two. Is there anything in MySQL…
Mr Morgan
  • 2,215
  • 15
  • 48
  • 78
0
votes
1 answer

How to get the array of all ngrams in Perl Text::Ngrams

As you know the module Text::Ngrams in Perl can give Ngrams analysis. There is the following function to retrieve the array of Ngrams and…
Losa
  • 3
  • 2
0
votes
0 answers

How to calculate N-Grams

I encounter a problem when trying to figure out how n-grams is calculated. I am wondering that when calculating n-grams(not the frequency), can the positions of common elements be exchanged ? These are several examples: (assume that extra symbol is…
PinkiePie-Z
  • 525
  • 1
  • 6
  • 28
0
votes
0 answers

Getting User Token for Microsoft Web N-Gram Service (Public Beta)

I want to use the Python Library of the Microsoft Web N-Gram Service (Public Beta) but I have to retrieve a user token from Microsoft Research in order to do so. I cannot however find how to retrieve it. It is stated at the Microsoft Web N-Gram…
stois21
  • 59
  • 1
  • 4
0
votes
1 answer

What are the most feasible options to do processing on google books n-gram dataset using modest resources?

I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books Below is the link of the full dataset: Google Ngram Viewer As evident database…
anshuman
  • 98
  • 7