Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

1 answer

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence…

python nltk n-gram

asked Oct 18 '14 at 18:24

Ketan

1,467
13
16

votes

4 answers

Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple…

r text-mining n-gram tm

asked Oct 27 '13 at 06:08

Markus D

votes

1 answer

Simulating a Markov Chain with Neo4J

A Markov chain is composed of a set of states which can transition to other states with a certain probability. A Markov chain can be easily represented in Neo4J by creating a node for each state, a relationship for each transition, and then…

neo4j cypher n-gram markov-chains

asked May 17 '13 at 04:07

JnBrymn

24,245
28
105
147

votes

3 answers

Extract keyphrases from text (1-4 word ngrams)

What's the best way to extract keyphrases from a block of text? I'm writing a tool to do keyword extraction: something like this. I've found a few libraries for Python and Perl to extract n-grams, but I'm writing this in Node so I need a JavaScript…

javascript keyword n-gram

asked Aug 16 '11 at 21:47

Carter Cole

votes

2 answers

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from…

python-2.7 nlp nltk n-gram language-model

asked Oct 21 '15 at 18:48

Ana_Sam

votes

4 answers

Can Drupal's search module search for a substring? (Partial Search)

Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that?

search drupal partial n-gram

asked Apr 16 '10 at 15:17

Dan Albey

votes

2 answers

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True, split='…

nlp keras tokenize text-processing n-gram

asked Sep 12 '17 at 10:02

Simplex

1,723
2
17
26

votes

1 answer

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search. It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the…

algorithm autocomplete n-gram phrases

asked Mar 22 '17 at 20:46

Jedi

3,088
2
28
47

votes

2 answers

Python interface to ARPA files

I'm looking for a pythonic interface to load ARPA files (back-off language models) and use them to evaluate some text, e.g. get its log-probability, perplexity etc. I don't need to generate the ARPA file in Python, only to use it for querying. Does…

python nlp n-gram language-model

asked May 26 '14 at 04:05

Beka

votes

1 answer

n-grams with Naive Bayes classifier

Im new to python and need help! i was practicing with python NLTK text classification. Here is the code example i am practicing on http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ Ive tried this one from nltk import…

python nltk n-gram

asked Dec 22 '12 at 13:40

Aikin

votes

2 answers

ElasticSearch n-gram tokenfilter not finding partial words

I have been playing around with ElasticSearch for a new project of mine. I have set the default analyzers to use the ngram tokenfilter. This is my elasticsearch.yml file: index: analysis: analyzer: default_index: tokenizer:…

n-gram elasticsearch

asked Feb 18 '11 at 17:43

asleepysamurai

1,362
2
14
23

votes

2 answers

ElasticSearch use "best match" of ngram terms instead of "synonym"?

Is it possible to tell ElasticSearch to use "best match" of all grams instead of using grams as synonyms? By default ElasticSearch uses grams as synonyms and returns poorly matching documents. It's better to showcase with example, let's say we have…

elasticsearch n-gram trigram

asked Dec 09 '17 at 13:17

Alex Craft

13,598
11
69
133

votes

5 answers

Creating a dictionary for each word in a file and counting the frequency of words that follow it

I am trying to solve a difficult problem and am getting lost. Here's what I'm supposed to do: INPUT: file OUTPUT: dictionary Return a dictionary whose keys are all the words in the file (broken by whitespace). The value for each word is a…

python dictionary nltk counter n-gram

asked Jun 23 '17 at 20:22

Kristie

votes

2 answers

Finding conditional probability of trigram in python nltk

I have started learning NLTK and I am following a tutorial from here, where they find conditional probability using bigrams like this. import nltk from nltk.corpus import brown cfreq_brown_2gram =…

python nlp nltk n-gram

asked Jun 28 '16 at 06:25

Riken Shah

3,022
5
29
56

votes

8 answers

The n-gram that is the most frequent one among all the words

I came across the following programming interview problem: Challenge 1: N-grams An N-gram is a sequence of N consecutive characters from a given word. For the word "pilot" there are three 3-grams: "pil", "ilo" and "lot". For a given set of words and…

c algorithm n-gram

asked Sep 04 '14 at 00:27

andrestoga

Prev 1 2

…

58 59 Next