Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
-1
votes
1 answer

Python: Perform Count Operations on a List of Lists

I'm trying to count the number of bigrams in a large group of text. I have already taken the text line by line from standard input, cleaned the text, and generated by bigrams. Now I have a nested loop that looks like this line by…
chaztize7
  • 63
  • 5
-1
votes
2 answers

Most common sentences extractions with count using Python

I want to write a Python Script that searches all Excel rows and returns top 10 most common sentences. I have written the basics of ngrams for a txt file. The file contains csv text with dj is best 4 times and gd is cool 3 times. import nltk import…
DJKarma
  • 172
  • 9
-1
votes
1 answer

Comparison of two documents in python

Given two documents, I wish to calculate the similarity between them. I have measures to find out the cosine distance, N-Gram and tf-idf using this: This is a previously asked question I wish to know, what further needs to be done using these…
Chinmay Joshi
  • 89
  • 1
  • 9
-1
votes
1 answer

Count frequency of n-gram in text using r

I am using R to read the text. A passage consists of 100 sentences,then it is put in a list, the list is like: [[1]] [1] "WigWagCo: For #TBT here's a video of Travis McCollum (Co-Founder and COO of WigWag) at #SXSW2016 [[2]] [1] "chrisreedfilm:…
Paul
  • 3
  • 3
-1
votes
1 answer

how to create the bigram matrix?

I want to make a matrix of the bigram model. How can I do it? Any suggestions which match my code, please? import nltk from collections import Counter import codecs with codecs.open("Pezeshki339.txt",'r','utf8') as file: for line in…
marysd
  • 99
  • 1
  • 10
-1
votes
4 answers

PHP find n-grams in an array

I have a PHP array: $excerpts = array( 'I love cheap red apples', 'Cheap red apples are what I love', 'Do you sell cheap red apples?', 'I want red apples', 'Give me my red apples', 'OK now where are my apples?' ); I would…
mattspain
  • 723
  • 9
  • 18
-1
votes
1 answer

could some one explain me about how to write ngram query in java using lucene

i have requirement to incorporate N Gram in my search engine and am using lucene 4.4 as my search engine. basically am finding some hard time to learn NGram, could some one help me out by showing some simple steps? thanks in advance!!
-1
votes
1 answer

Counting di-Amino Acid frequencies (Bigram frequencies) from FASTA files

Given a large amount of FASTA files (the peptidome for various organisms for secreted peptides), how can I read the FASTA files (from UNIProt) with Python (Or Matlab), and count the frequencies of each Amino Acid, and of amino-acid "double"…
GrimSqueaker
  • 412
  • 5
  • 17
-2
votes
3 answers

generating bi-gram from a sentence in the list

i have a list which contains sentences split up from a test paragraph. I'm trying to generate bi-grams from this list of sentences. but i'm getting : My code: ..... print (words3) print (words4)
user1052462
  • 168
  • 3
  • 12
-2
votes
1 answer

How could I implement this for four elements, instead of 2?

for l in f: w1, w2 = l.strip().split('\t') I'm doing computerized text analysis in Python. This is the original code for splitting bigram elements (words) in a list of bigrams deemed significant (and the overall code is of course longer and…
-2
votes
1 answer

Convert ngrams into a frequency dictionary in Python

Can anybody help with a function to convert the following ngram into the result below? The return should concatenate the first N-1 elements of the ngram and count how often the different successors (Nth element) occur. I was thinking of some nested…
Nicolas
  • 31
  • 2
-2
votes
1 answer

What are N-grams?

What are N-grams? I want to find N-grams for n=4 (fourgram), n=5 (fivegram), n=6 (sixgram), n=7(sevengram) for the Sentence - "dog that barks does not bite" I know- Unigrams(n=1): dog, that, barks, does, not, bite Bigrams(n=2): dog that, that barks,…
user8487038
-2
votes
3 answers

read from txt file and divide words

I would like to create a program in python that reads a txt file as input from the user. Then I would like for the program to seperate the words as follows in the example below: At the time of his accession, the Swedish Riksdag held more power than…
-2
votes
1 answer

Using known python packages for implementing N-Gram, TF-IDF and Cosine similarity

I'm trying to implement a similarity function using N-Grams TF-IDF Cosine Similaity Example Concept: words = [...] word = '...' similarity = predict(words,word) def predict(words,word): words_ngrams = create_ngrams(words,range=(2,4)) …
Sahar Millis
  • 801
  • 2
  • 13
  • 21
-2
votes
1 answer

Generate n-1 grams and n grams using python

Hi I am trying to generate n and n-1 grams and to compute the probabilities of the ngrams. However, the n-1 grams generated is not taking the last element of each sublist. Can somebody help me figure out where I am going wrong. Input: input1 =…
user3320097
  • 11
  • 1
  • 7
1 2 3
58
59