Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

2 answers

NLP algorithm to 'fill out' search terms

I'm trying to write an algorithm (which I'm assuming will rely on natural language processing techniques) to 'fill out' a list of search terms. There is probably a name for this kind of thing which I'm unaware of. What is this kind of problem…

python nlp n-gram

asked Sep 29 '11 at 23:30

Trindaz

17,029
21
82
111

votes

2 answers

n-gram modeling with java hashmap

I need to model a collection of n-grams (sequences of n words) and their contexts (words that appear near the n-gram along with their frequency). My idea of was this: public class Ngram { private String[] words; private HashMap

java string hashmap n-gram

asked May 05 '11 at 15:09

Nikola

votes

1 answer

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it? This is my code: sequence = nltk.tokenize.word_tokenize(raw) bigram = ngrams(sequence,2) freq_dist = nltk.FreqDist(bigram) prob_dist =…

python python-3.x nltk n-gram

asked Mar 02 '19 at 20:07

Ahmad

8,811
11
76
141

votes

1 answer

String Matching Using TF-IDF, NGrams and Cosine Similarity in Python

I am working on my first major data science project. I am attempting to match names between a large list of data from one source, to a cleansed dictionary in another. I am using this string matching blog as a guide. I am attempting to use two…

python tf-idf n-gram cosine-similarity

asked Dec 18 '18 at 06:14

HMan06

votes

1 answer

ElasticSearch Edge NGram vs Prefix query

Let's say we have a text field that is relatively short, let's say maximum 10 characters and is saved as a keyword. I want my users to be able to prefix-search this field (not autocomplete / search-as-you-type). I have read on Elastic's…

elasticsearch prefix n-gram

asked Oct 23 '17 at 14:10

DotnetProg

votes

3 answers

most common 2-grams using python

Given a string: this is a test this is How can I find the top-n most common 2-grams? In the string above, all 2-grams are: {this is, is a, test this, this is} As you can notice, the 2-gram this is appears 2 times. Hence the result should be: {this…

python python-2.7 pyspark n-gram python-collections

asked Apr 18 '17 at 13:33

stfd1123581321

votes

2 answers

Creating N-Grams with tm & RWeka - works with VCorpus but not Corpus

Following the many guides to creating biGrams using the 'tm' and 'RWeka' packages, I was getting frustrated that only 1-Grams were being returned in the tdm. Through much trial and error I discovered that proper function was achieved using 'VCorpus'…

r tm n-gram term-document-matrix rweka

asked Mar 13 '17 at 05:33

Paul_J

votes

1 answer

n-gram sentence similarity with cosine similarity measurement

I have been working on a project about sentence similarity. I know it has been asked many times in SO, but I just want to know if my problem can be accomplished by the method I use by the way that I am doing it, or I should change my approach to the…

similarity trigonometry n-gram

asked Oct 27 '10 at 19:59

Ahmet Keskin

1,025
1
15
25

votes

5 answers

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a…

r n-gram

asked May 18 '16 at 06:38

Madhu Sareen

votes

3 answers

Overcoming MemoryError / Slow Runtime in Ashton String task

In the Ashton String task, the goal is to: Arrange all the distinct substrings of a given string in lexicographical order and concatenate them. Print the Kth character of the concatenated string. It is assured that given value of K will be …

python string out-of-memory n-gram

asked Dec 30 '15 at 14:08

alvas

115,346
109
446
738

votes

1 answer

How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ["I really like python, it's pretty awesome."] vect =…

python nlp scikit-learn tokenize n-gram

asked Aug 20 '15 at 21:35

Franck Dernoncourt

77,520
72
342
501

votes

1 answer

How to extract character ngram from sentences? - python

The following word2ngrams function extracts character 3grams from a word: >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar'] This post shows the character ngrams extraction for a single word, Quick…

python regex string nlp n-gram

asked Mar 15 '14 at 18:32

alvas

115,346
109
446
738

votes

2 answers

N-grams vs other classifiers in text categorization

I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization. i want to know which one is better, does n-grams…

machine-learning data-mining classification n-gram text-classification

asked Dec 01 '13 at 18:54

wudpecker

votes

2 answers

NLTK makes it easy to compute bigrams of words. What about letters?

I've seen tons of documentation all over the web about how the python NLTK makes it easy to compute bigrams of words. What about letters? What I want to do is plug in a dictionary and have it tell me the relative frequencies of different letter…

python nlp nltk n-gram

asked Jan 05 '13 at 04:33

isthmuses

1,316
1
17
27

votes

3 answers

Package to generate n-gram language models with smoothing? (Alternatives to NLTK)

I'd like to find some type of package or module (preferably Python or Perl, but others would do) that automatically generate n-gram probabilities from an input text, and can automatically apply one or more smoothing algorithms as well. That is, I am…

nlp nltk n-gram

asked Jul 13 '11 at 23:09

Alan H.

Prev 1 2 3

…

58 59 Next