Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

5 answers

How to compute skipgrams in python?

A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as…

asked Aug 06 '15 at 05:44

stackit

3,036
9
34
62

votes

4 answers

"Anagram solver" based on statistics rather than a dictionary/table?

My problem is conceptually similar to solving anagrams, except I can't just use a dictionary lookup. I am trying to find plausible words rather than real words. I have created an N-gram model (for now, N=2) based on the letters in a bunch of text.…

algorithm machine-learning mathematical-optimization n-gram markov

asked Apr 16 '10 at 06:12

user132748

votes

4 answers

Fast n-gram calculation

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's…

python nlp nltk n-gram

asked Sep 29 '11 at 00:49

Trindaz

17,029
21
82
111

votes

2 answers

N-grams: Explanation + 2 applications

I want to implement some applications with n-grams (preferably in PHP). Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP? First, I would like to…

php nlp analysis n-gram

asked Jun 23 '09 at 12:37

caw

30,999
61
181
291

votes

3 answers

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the…

python tokenize word2vec gensim n-gram

asked Sep 09 '17 at 09:49

user8566323

votes

4 answers

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("Cloud computing is…

python-3.x nlp tokenize spacy n-gram

asked Dec 03 '18 at 16:50

venkatttaknev

votes

2 answers

Really fast word ngram vectorization in R

edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a…

r vectorization text-mining n-gram text2vec

asked Jul 22 '15 at 17:50

Zach

29,791
35
142
201

votes

3 answers

Effective 1-5 grams extraction with python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . There is a comma , in this sentence . Such is an…

python nlp nltk information-retrieval n-gram

asked Oct 13 '14 at 13:45

alvas

115,346
109
446
738

votes

3 answers

Fast/Optimize N-gram implementations in python

Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util import ngrams as nltkngram import this, time def…

python nlp nltk information-retrieval n-gram

asked Feb 19 '14 at 14:16

alvas

115,346
109
446
738

votes

7 answers

What algorithm I need to find n-grams?

What algorithm is used for finding ngrams? Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use? I'm asking for code, with preference for R. The data is stored in database, so can be a…

r n-gram

asked Nov 17 '11 at 01:53

Renato Dinhani

35,057
55
139
199

votes

2 answers

Creating ARPA language model file with 50,000 words

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

speech-recognition cmusphinx n-gram language-model

asked Apr 21 '11 at 11:24

Vipin

4,718
12
54
81

votes

1 answer

Ngram model and perplexity in NLTK

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with…

python nltk n-gram

asked May 12 '13 at 16:40

zermelozf

votes

1 answer

Is there a more efficient way to find most common n-grams?

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?

algorithm nlp n-gram

asked Feb 21 '17 at 17:12

bendl

1,583
1
18
41

votes

4 answers

Java Lucene NGramTokenizer

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see two methods in the NGramTokenizer class that return…

java lucene tokenize n-gram

asked Nov 17 '12 at 18:50

CodeKingPlusPlus

15,383
51
135
216

votes

3 answers

How to generate bi/tri-grams using spacy/nltk

The input text are always list of dish names where there are 1~3 adjectives and a noun Inputs thai iced tea spicy fried chicken sweet chili pork thai chicken curry outputs: thai tea, iced tea spicy chicken, fried chicken sweet pork, chili…

python nlp nltk n-gram spacy

asked Aug 31 '16 at 05:53

samol

18,950
32
88
127

Prev 1

…

58 59 Next