Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
9
votes
5 answers

When are n-grams (n>3) important as opposed to just bigrams or trigrams?

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough? If so, what is the…
Legend
  • 113,822
  • 119
  • 272
  • 400
8
votes
3 answers

How to generate n-grams in scala?

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". First it has to pick a random n-gram. For example, the…
user1002579
  • 129
  • 2
  • 8
8
votes
2 answers

Trying to set the max_gram and min_gram in Elasticsearch

Im trying to deploy a Ruby on Rails app on a Ubuntu 16.04 EC2 server but is giving a error about the difference between max_gram and min_gram on Elasticsearch, i don't have any experience with Elasticsearch yet so im totally lost here and i need…
Cesar Rodriguez
  • 83
  • 1
  • 1
  • 5
8
votes
3 answers

Sequence prediction of characters?

I am new to machine learning, so please go easy in case the problem is trivial. I have been given a sequence of observed characters say, ABABBABBB..... (n characters). My goal is to predict the next characters by some "learning" mechanisms. My…
suzee
  • 563
  • 4
  • 25
8
votes
2 answers

Counting bigrams real fast (with or without multiprocessing) - python

Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times). According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be…
alvas
  • 115,346
  • 109
  • 446
  • 738
8
votes
5 answers

Detecting random keyboard hits considering QWERTY keyboard layout

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout". Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh Is there…
Nicolas Raoul
  • 58,567
  • 58
  • 222
  • 373
8
votes
1 answer

Train NGramModel in Python

I am using Python 3.5, installed and managed with Anaconda. I want to train an NGramModel (from nltk) using some text. My installation does not find the module nltk.model There are some possible answers to this question (pick the correct one, and…
Trylks
  • 1,458
  • 2
  • 18
  • 31
8
votes
3 answers

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read…
usual me
  • 8,338
  • 10
  • 52
  • 95
8
votes
2 answers

Understanding cyclic polynomial hash collisions

I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i…
csprajeeth
  • 237
  • 2
  • 10
7
votes
0 answers

Mysql ngram fulltext index doesn't work with utf8mb4_bin

I'm using utf8mb4_bin for title column so i expected it's fulltext case-sensitive search. But actually the query return empty. CREATE TABLE `test_table` ( `id` int NOT NULL AUTO_INCREMENT, `title` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin…
7
votes
2 answers

TF-IDF vectorizer to extract ngrams

How can I use TF-IDF vectorizer from the scikit-learn library to extract unigrams and bigrams of tweets? I want to train a classifier with the output. This is the code from scikit-learn: from sklearn.feature_extraction.text import…
ECub Devs
  • 165
  • 3
  • 10
7
votes
1 answer

Token pattern for n-gram in TfidfVectorizer in python

Does TfidfVectorizer identify n-grams using python regular expressions? This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is…
nikosd
  • 919
  • 3
  • 16
  • 26
7
votes
2 answers

How to get n-gram collocations and association in python nltk?

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder. There is example method find nbest based on pmi for bigram and…
Fahmi Rizal
  • 137
  • 2
  • 9
7
votes
2 answers

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum…
Sebastien Lorber
  • 89,644
  • 67
  • 288
  • 419
7
votes
3 answers

n-gram name analysis in non-english languages (CJK, etc)

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each…
Matt Luongo
  • 14,371
  • 6
  • 53
  • 64