Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
6
votes
2 answers

NLP algorithm to 'fill out' search terms

I'm trying to write an algorithm (which I'm assuming will rely on natural language processing techniques) to 'fill out' a list of search terms. There is probably a name for this kind of thing which I'm unaware of. What is this kind of problem…
Trindaz
  • 17,029
  • 21
  • 82
  • 111
6
votes
2 answers

n-gram modeling with java hashmap

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear near the n-gram along with their frequency). My idea of was this: public class Ngram { private String[] words; private HashMap
Nikola
  • 694
  • 8
  • 15
6
votes
1 answer

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it? This is my code: sequence = nltk.tokenize.word_tokenize(raw) bigram = ngrams(sequence,2) freq_dist = nltk.FreqDist(bigram) prob_dist =…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
6
votes
1 answer

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two…
HMan06
  • 755
  • 2
  • 9
  • 23
6
votes
1 answer

ElasticSearch Edge NGram vs Prefix query

Let's say we have a text field that is relatively short, let's say maximum 10 characters and is saved as a keyword. I want my users to be able to prefix-search this field (not autocomplete / search-as-you-type). I have read on Elastic's…
DotnetProg
  • 790
  • 9
  • 24
6
votes
3 answers

most common 2-grams using python

Given a string: this is a test this is How can I find the top-n most common 2-grams? In the string above, all 2-grams are: {this is, is a, test this, this is} As you can notice, the 2-gram this is appears 2 times. Hence the result should be: {this…
stfd1123581321
  • 163
  • 1
  • 2
  • 6
6
votes
2 answers

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm. Through much trial and error I discovered that proper function was achieved using 'VCorpus'…
Paul_J
  • 61
  • 1
  • 4
6
votes
1 answer

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the…
Ahmet Keskin
  • 1,025
  • 1
  • 15
  • 25
6
votes
5 answers

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a…
Madhu Sareen
  • 549
  • 1
  • 8
  • 20
6
votes
3 answers

Overcoming MemoryError / Slow Runtime in Ashton String task

In the Ashton String task, the goal is to: Arrange all the distinct substrings of a given string in lexicographical order and concatenate them. Print the Kth character of the concatenated string. It is assured that given value of K will be …
alvas
  • 115,346
  • 109
  • 446
  • 738
6
votes
1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…
Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
6
votes
1 answer

How to extract character ngram from sentences? - python

The following word2ngrams function extracts character 3grams from a word: >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar'] This post shows the character ngrams extraction for a single word, Quick…
alvas
  • 115,346
  • 109
  • 446
  • 738
6
votes
2 answers

N-grams vs other classifiers in text categorization

I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization. i want to know which one is better, does n-grams…
6
votes
2 answers

NLTK makes it easy to compute bigrams of words. What about letters?

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words. What about letters? What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter…
isthmuses
  • 1,316
  • 1
  • 17
  • 27
5
votes
3 answers

Package to generate n-gram language models with smoothing? (Alternatives to NLTK)

I'd like to find some type of package or module (preferably Python or Perl, but others would do) that automatically generate n-gram probabilities from an input text, and can automatically apply one or more smoothing algorithms as well. That is, I am…
Alan H.
  • 263
  • 3
  • 8