Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

5 answers

When are n-grams (n>3) important as opposed to just bigrams or trigrams?

I am just wondering what is the use of n-grams (n>3) (and their occurrence frequency) considering the computational overhead in computing them. Are there any applications where bigrams or trigrams are simply not enough? If so, what is the…

asked Apr 23 '12 at 18:20

Legend

113,822
119
272
400

votes

3 answers

How to generate n-grams in scala?

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". First it has to pick a random n-gram. For example, the…

scala n-gram

asked Nov 24 '11 at 14:55

user1002579

votes

2 answers

Trying to set the max_gram and min_gram in Elasticsearch

Im trying to deploy a Ruby on Rails app on a Ubuntu 16.04 EC2 server but is giving a error about the difference between max_gram and min_gram on Elasticsearch, i don't have any experience with Elasticsearch yet so im totally lost here and i need…

elasticsearch n-gram elasticsearch-analyzers

asked Aug 07 '19 at 13:44

Cesar Rodriguez

votes

3 answers

Sequence prediction of characters?

I am new to machine learning, so please go easy in case the problem is trivial. I have been given a sequence of observed characters say, ABABBABBB..... (n characters). My goal is to predict the next characters by some "learning" mechanisms. My…

text machine-learning neural-network lstm n-gram

asked Mar 12 '17 at 13:30

suzee

votes

2 answers

Counting bigrams real fast (with or without multiprocessing) - python

Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times). According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be…

python optimization mapreduce counter n-gram

asked Nov 02 '16 at 06:03

alvas

115,346
109
446
738

votes

5 answers

Detecting random keyboard hits considering QWERTY keyboard layout

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout". Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh Is there…

algorithm n-gram qwerty text-classification

asked Sep 27 '10 at 08:41

Nicolas Raoul

58,567
58
222
373

votes

1 answer

Train NGramModel in Python

I am using Python 3.5, installed and managed with Anaconda. I want to train an NGramModel (from nltk) using some text. My installation does not find the module nltk.model There are some possible answers to this question (pick the correct one, and…

python python-3.x nltk n-gram

asked May 28 '16 at 22:41

Trylks

1,458
2
18
31

votes

3 answers

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read…

nlp data-mining text-mining n-gram

asked Feb 09 '14 at 08:18

usual me

8,338
10
52
95

votes

2 answers

Understanding cyclic polynomial hash collisions

I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i…

hash n-gram hash-collision

asked May 03 '13 at 18:38

csprajeeth

votes

0 answers

Mysql ngram fulltext index doesn't work with utf8mb4_bin

I'm using utf8mb4_bin for title column so i expected it's fulltext case-sensitive search. But actually the query return empty. CREATE TABLE `test_table` ( `id` int NOT NULL AUTO_INCREMENT, `title` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin…

mysql full-text-search uppercase case-sensitive n-gram

asked May 26 '22 at 08:03

Trung Pham

votes

2 answers

TF-IDF vectorizer to extract ngrams

How can I use TF-IDF vectorizer from the scikit-learn library to extract unigrams and bigrams of tweets? I want to train a classifier with the output. This is the code from scikit-learn: from sklearn.feature_extraction.text import…

python scikit-learn n-gram tfidfvectorizer

asked Oct 28 '20 at 08:10

ECub Devs

votes

1 answer

Token pattern for n-gram in TfidfVectorizer in python

Does TfidfVectorizer identify n-grams using python regular expressions? This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is…

python regex scikit-learn n-gram

asked Mar 26 '15 at 23:51

nikosd

votes

2 answers

How to get n-gram collocations and association in python nltk?

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder. There is example method find nbest based on pmi for bigram and…

python nlp nltk n-gram collocation

asked Sep 07 '13 at 09:58

Fahmi Rizal

votes

2 answers

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum…

java search lucene elasticsearch n-gram

asked Jul 03 '12 at 02:19

Sebastien Lorber

89,644
67
288
419

votes

3 answers

n-gram name analysis in non-english languages (CJK, etc)

I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each…

python nlp similarity n-gram cjk

asked Apr 05 '12 at 19:34

Matt Luongo

14,371
6
53
64

Prev 1 2 3

…

58 59 Next