Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
20
votes
5 answers

How to compute skipgrams in python?

A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as…
stackit
  • 3,036
  • 9
  • 34
  • 62
20
votes
4 answers

"Anagram solver" based on statistics rather than a dictionary/table?

My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words. I have created an N-gram model (for now, N=2) based on the letters in a bunch of text.…
user132748
18
votes
4 answers

Fast n-gram calculation

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
18
votes
2 answers

N-grams: Explanation + 2 applications

I want to implement some applications with n-grams (preferably in PHP). Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP? First, I would like to…
caw
  • 30,999
  • 61
  • 181
  • 291
17
votes
3 answers

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the…
user8566323
16
votes
4 answers

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is…
venkatttaknev
  • 669
  • 1
  • 7
  • 21
16
votes
2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…
Zach
  • 29,791
  • 35
  • 142
  • 201
16
votes
3 answers

Effective 1-5 grams extraction with python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . There is a comma , in this sentence . Such is an…
alvas
  • 115,346
  • 109
  • 446
  • 738
16
votes
3 answers

Fast/Optimize N-gram implementations in python

Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util import ngrams as nltkngram import this, time def…
alvas
  • 115,346
  • 109
  • 446
  • 738
14
votes
7 answers

What algorithm I need to find n-grams?

What algorithm is used for finding ngrams? Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use? I'm asking for code, with preference for R. The data is stored in database, so can be a…
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
14
votes
2 answers

Creating ARPA language model file with 50,000 words

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?
Vipin
  • 4,718
  • 12
  • 54
  • 81
14
votes
1 answer

Ngram model and perplexity in NLTK

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with…
zermelozf
  • 545
  • 1
  • 8
  • 16
13
votes
1 answer

Is there a more efficient way to find most common n-grams?

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?
bendl
  • 1,583
  • 1
  • 18
  • 41
13
votes
4 answers

Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return…
CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216
12
votes
3 answers

How to generate bi/tri-grams using spacy/nltk

The input text are always list of dish names where there are 1~3 adjectives and a noun Inputs thai iced tea spicy fried chicken sweet chili pork thai chicken curry outputs: thai tea, iced tea spicy chicken, fried chicken sweet pork, chili…
samol
  • 18,950
  • 32
  • 88
  • 127
1
2
3
58 59