Questions tagged [n-gram]

An N-gram is an ordered collection of N elements of the same kind, usually presented in a large collection of many other similar N-grams. The individual elements are commonly natural language words, though N-grams have been applied to many other data types, such as numbers, letters, genetic proteins in DNA, etc. Statistical N-gram analysis is commonly performed as part of natural language processing, bioinformatics, and information theory.

N-grams may be derived for any positive integer N. 1-grams are called "unigrams," 2-grams are called "digrams," 3-grams are called "trigrams," and higher order N-grams are simply called by number, e.g. "4-grams". N-gram techniques may be applied to any kind of ordered data. Metadata such as end-of-sentence markers may or may not be included.

For example, using words as the elements and an N of 2, the English sentence "Three cows eat grass." could be broken into the 2-grams [{Three cows}, {cows eat}, {eat grass}, {grass #}], where # is a metadata marker denoting the end of the sentence.

As N-gram analysis embeds the data set into a vector space, it allows the application of many powerful statistical techniques to data for prediction, classification, and discernment of various properties.

More information:

874 questions

votes

1 answer

Create NGram chart from Solr data with Shingles?

I have been given the task of creating a google like Ngram view/chart of a data set. The chart is just a line chart of basically terms (ngrams) over time. I dont have any experience with SOLR but have been given a core containing a lot of data and…

asked Aug 26 '14 at 11:38

Paul M

3,937
9
45
53

votes

1 answer

solr use both n-gram search and default search

I'm trying to create a corpus using Solr. I have a field named "content" and I need to index and search bigrams and trigrams. Also need to index and search using the default searching. How to configure these things?

solr n-gram

asked Jul 17 '14 at 05:01

Lahiru

2,609
3
18
29

votes

0 answers

NLTK Python: finding subcategories

I am new to NLP and NLTK in particular. I have a list of products such as: shampoo, toothpaste, vitamins, brushes, soap, etc. 1- or 4-grams. What I am trying to do is to walk several levels up the taxonomy tree and output a set of categories, like…

nltk taxonomy ontology similarity n-gram

asked Jun 19 '14 at 19:25

mel

1,566
5
17
29

votes

2 answers

Count of an ngram in the brown news corpus?

I know nltk can tell you the likelihood of a word within a given context nltk language model (ngram) calculate the prob of a word from context But can it tell you the count (or likelyhood) of a given ngram within the Brown corpus? For instance, can…

python nltk n-gram

asked Jun 06 '14 at 20:42

bernie2436

22,841
49
151
244

votes

1 answer

Elasticsearch - nGram on documents but not the search terms

I apparently misunderstood how nGram works with Elasticsearch. I wanted to be able to efficiently search for a substring. That way I could type 'loud' and still find words like 'clouds'. I have my nGram tokenizer set up to have min=2 and…

elasticsearch full-text-search n-gram

asked May 29 '14 at 14:10

Travis Parks

8,435
12
52
85

votes

2 answers

How to estimate ngram probability?

I want to build a language model where I want to estimate the ngram probabilities. So, my question is: What are the best corpora and tools that we could use to estimate the ngram probabilities?. thanks

nlp stanford-nlp n-gram

asked May 28 '14 at 15:47

Riadh Belkebir

votes

1 answer

Document multi-label classification - where do you get the labels? Ontology?

I am familiar with data mining techniques but not so much with text mining or Web mining. Here is a simple task: classify articles into a set of categories. Let us assume, I extracted text content of the article and processed it. How and where do…

ontology n-gram document-classification vowpalwabbit

asked May 17 '14 at 15:03

mel

1,566
5
17
29

votes

3 answers

SOLR eDISMAX product search

I'm new to SOLR and am implementing it to search our product catalog. I'm creating ngrams and edge ngrams on the brand name, display name and category fields. I'm using edismax and have qf defined as displayname_nge displayname_ng category_nge…

solr n-gram edismax

asked May 15 '14 at 15:06

whitemtnelf

votes

1 answer

Can I create the full text as an index along with NGramFilterFactory

I defined a field type text_ngram.

solr full-text-search solr4 n-gram

asked Apr 22 '14 at 06:40

buddy86

1,434
12
20

votes

1 answer

How to search a corpus to find frequency of a string?

I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair. The aim would be to find which verb-object pair is most likely when given a few different possibilities. For…

java nlp n-gram corpus

asked Apr 12 '14 at 12:14

user3163073

votes

1 answer

sklearn CountVectorizer TypeError: refuses 'ngram_range' other than (1,1)

Is there a bug in Python 2.7.3 in sklearn CountVectorizer? A previous post mentioned an earlier bug. Here is my simple input and I get a TypeError. >>> from sklearn.feature_extraction.text import CountVectorizer >>> ngram_vectorizer =…

python scikit-learn typeerror n-gram

asked Apr 06 '14 at 22:40

brendan8229

votes

1 answer

grouping all Named entities in a Document

I would like to group all named entities in a given document. For Example, **Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office. I do not want to use OpenNLP APIs as…

n-gram named-entity-recognition part-of-speech

asked Feb 04 '14 at 12:30

Yogi

1,035
2
13
39

votes

1 answer

How to use Lucene ShingleFilter: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute

Code is here: github link Error is: ren: null at []]: java.lang.IllegalArgumentException: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.OffsetAttribute at…

hadoop lucene n-gram

asked Jan 11 '14 at 16:05

rjurney

4,824
5
41
62

votes

2 answers

Elasticsearch starts with, multiple words

I'm trying to implement an autocomplete feature from phrases that contain multiple words. I want to be able to match only the beginning of words (edgeNGram?), but for every word searched. For example if I search for "monitor", I should receive all…

elasticsearch words n-gram startswith

asked Jan 08 '14 at 15:08

user3172790

votes

0 answers

ElasticSearch: a gibberish query still returns results. How to ensure quality?

I implemented a custom filter which uses the EdgeGram tokenizer. The problem I face is that whether I search for something relevant or total garbage I get a large number of hits. I suspect that this is due that fact that I'm using an EdgeNgram…

elasticsearch n-gram

asked Jan 06 '14 at 23:24

Karan Verma

1,721
1
15
24

Prev 1 2 3

…

58 59 Next