Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

  1. Google's Ngram Viewer
  2. Wikipedia article
874 questions
0
votes
1 answer

Create NGram chart from Solr data with Shingles?

I have been given the task of creating a google like Ngram view/chart of a data set. The chart is just a line chart of basically terms (ngrams) over time. I dont have any experience with SOLR but have been given a core containing a lot of data and…
Paul M
  • 3,937
  • 9
  • 45
  • 53
0
votes
1 answer

solr use both n-gram search and default search

I'm trying to create a corpus using Solr. I have a field named "content" and I need to index and search bigrams and trigrams. Also need to index and search using the default searching. How to configure these things?
Lahiru
  • 2,609
  • 3
  • 18
  • 29
0
votes
0 answers

NLTK Python: finding subcategories

I am new to NLP and NLTK in particular. I have a list of products such as: shampoo, toothpaste, vitamins, brushes, soap, etc. 1- or 4-grams. What I am trying to do is to walk several levels up the taxonomy tree and output a set of categories, like…
mel
  • 1,566
  • 5
  • 17
  • 29
0
votes
2 answers

Count of an ngram in the brown news corpus?

I know nltk can tell you the likelihood of a word within a given context nltk language model (ngram) calculate the prob of a word from context But can it tell you the count (or likelyhood) of a given ngram within the Brown corpus? For instance, can…
bernie2436
  • 22,841
  • 49
  • 151
  • 244
0
votes
1 answer

Elasticsearch - nGram on documents but not the search terms

I apparently misunderstood how nGram works with Elasticsearch. I wanted to be able to efficiently search for a substring. That way I could type 'loud' and still find words like 'clouds'. I have my nGram tokenizer set up to have min=2 and…
Travis Parks
  • 8,435
  • 12
  • 52
  • 85
0
votes
2 answers

How to estimate ngram probability?

I want to build a language model where I want to estimate the ngram probabilities. So, my question is: What are the best corpora and tools that we could use to estimate the ngram probabilities?. thanks
Riadh Belkebir
  • 797
  • 1
  • 12
  • 34
0
votes
1 answer

Document multi-label classification - where do you get the labels? Ontology?

I am familiar with data mining techniques but not so much with text mining or Web mining. Here is a simple task: classify articles into a set of categories. Let us assume, I extracted text content of the article and processed it. How and where do…
mel
  • 1,566
  • 5
  • 17
  • 29
0
votes
3 answers

SOLR eDISMAX product search

I'm new to SOLR and am implementing it to search our product catalog. I'm creating ngrams and edge ngrams on the brand name, display name and category fields. I'm using edismax and have qf defined as displayname_nge displayname_ng category_nge…
whitemtnelf
  • 181
  • 2
  • 10
0
votes
1 answer

Can I create the full text as an index along with NGramFilterFactory

I defined a field type text_ngram.
buddy86
  • 1,434
  • 12
  • 20
0
votes
1 answer

How to search a corpus to find frequency of a string?

I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For…
user3163073
  • 11
  • 1
  • 3
0
votes
1 answer

sklearn CountVectorizer TypeError: refuses 'ngram_range' other than (1,1)

Is there a bug in Python 2.7.3 in sklearn CountVectorizer? A previous post mentioned an earlier bug. Here is my simple input and I get a TypeError. >>> from sklearn.feature_extraction.text import CountVectorizer >>> ngram_vectorizer =…
brendan8229
  • 47
  • 1
  • 2
  • 9
0
votes
1 answer

grouping all Named entities in a Document

I would like to group all named entities in a given document. For Example, **Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office. I do not want to use OpenNLP APIs as…
Yogi
  • 1,035
  • 2
  • 13
  • 39
0
votes
1 answer

How to use Lucene ShingleFilter: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute

Code is here: github link Error is: ren: null at []]: java.lang.IllegalArgumentException: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute at…
rjurney
  • 4,824
  • 5
  • 41
  • 62
0
votes
2 answers

Elasticsearch starts with, multiple words

I'm trying to implement an autocomplete feature from phrases that contain multiple words. I want to be able to match only the beginning of words (edgeNGram?), but for every word searched. For example if I search for "monitor", I should receive all…
user3172790
0
votes
0 answers

ElasticSearch: a gibberish query still returns results. How to ensure quality?

I implemented a custom filter which uses the EdgeGram tokenizer. The problem I face is that whether I search for something relevant or total garbage I get a large number of hits. I suspect that this is due that fact that I'm using an EdgeNgram…
Karan Verma
  • 1,721
  • 1
  • 15
  • 24