Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
12
votes
1 answer

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence…
Ketan
  • 1,467
  • 13
  • 16
12
votes
4 answers

Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple…
Markus D
  • 187
  • 1
  • 3
  • 10
12
votes
1 answer

Simulating a Markov Chain with Neo4J

A Markov chain is composed of a set of states which can transition to other states with a certain probability. A Markov chain can be easily represented in Neo4J by creating a node for each state, a relationship for each transition, and then…
JnBrymn
  • 24,245
  • 28
  • 105
  • 147
11
votes
3 answers

Extract keyphrases from text (1-4 word ngrams)

What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this. I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript…
Carter Cole
  • 918
  • 9
  • 16
11
votes
2 answers

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from…
Ana_Sam
  • 469
  • 2
  • 4
  • 12
11
votes
4 answers

Can Drupal's search module search for a substring? (Partial Search)

Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that?
Dan Albey
  • 559
  • 1
  • 8
  • 14
10
votes
2 answers

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True, split='…
Simplex
  • 1,723
  • 2
  • 17
  • 26
10
votes
1 answer

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search. It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the…
Jedi
  • 3,088
  • 2
  • 28
  • 47
10
votes
2 answers

Python interface to ARPA files

I'm looking for a pythonic interface to load ARPA files (back-off language models) and use them to evaluate some text, e.g. get its log-probability, perplexity etc. I don't need to generate the ARPA file in Python, only to use it for querying. Does…
Beka
  • 725
  • 6
  • 22
10
votes
1 answer

n-grams with Naive Bayes classifier

Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ Ive tried this one from nltk import…
Aikin
  • 319
  • 2
  • 5
  • 13
9
votes
2 answers

ElasticSearch n-gram tokenfilter not finding partial words

I have been playing around with ElasticSearch for a new project of mine. I have set the default analyzers to use the ngram tokenfilter. This is my elasticsearch.yml file: index: analysis: analyzer: default_index: tokenizer:…
asleepysamurai
  • 1,362
  • 2
  • 14
  • 23
9
votes
2 answers

ElasticSearch use "best match" of ngram terms instead of "synonym"?

Is it possible to tell ElasticSearch to use "best match" of all grams instead of using grams as synonyms? By default ElasticSearch uses grams as synonyms and returns poorly matching documents. It's better to showcase with example, let's say we have…
Alex Craft
  • 13,598
  • 11
  • 69
  • 133
9
votes
5 answers

Creating a dictionary for each word in a file and counting the frequency of words that follow it

I am trying to solve a difficult problem and am getting lost. Here's what I'm supposed to do: INPUT: file OUTPUT: dictionary Return a dictionary whose keys are all the words in the file (broken by whitespace). The value for each word is a…
Kristie
  • 241
  • 3
  • 7
9
votes
2 answers

Finding conditional probability of trigram in python nltk

I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram =…
Riken Shah
  • 3,022
  • 5
  • 29
  • 56
9
votes
8 answers

The n-gram that is the most frequent one among all the words

I came across the following programming interview problem: Challenge 1: N-grams An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and…
andrestoga
  • 619
  • 3
  • 9
  • 19
1 2
3
58 59