Questions tagged [information-retrieval]

Information Retrieval is an area of study concerning with retrieving documents, information or metadata from a collection of unstructured or semi-structured data.

Information Retrieval is an area of study concerned with retrieving documents, information or metadata from a collection of unstructured or semi-structured data.

It usually has 3 parts:

Crawling: Identifying the documents that we want to search when our document collection is not clearly defined, especially important to web search engines.
Indexing: Parsing and inverting the documents into an index, either a static offline process or a set of incremental updates for frequently changing document collections.
Searching: retrieving the most relevant documents to the given query. This step requires that we rank documents with scoring functions which measure how relevant the documents are to each query.

1163 questions

219

votes

11 answers

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a…

algorithm tags information-retrieval

asked Apr 24 '09 at 20:40

caw

30,999
61
181
291

112

votes

6 answers

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the…

python machine-learning nltk information-retrieval tf-idf

asked Aug 25 '12 at 02:41

add-semi-colons

18,094
55
145
232

votes

2 answers

How to specify two Fields in Lucene QueryParser?

I read How to incorporate multiple fields in QueryParser? but i didn't get it. At the moment i have a very strange construction like: parser = New QueryParser("bodytext", analyzer) parser2 = New QueryParser("title", analyzer) query =…

java parsing lucene lucene.net information-retrieval

asked Jan 05 '10 at 09:30

Tyzak

2,430
8
38
52

votes

5 answers

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t. Where N = total documents in collection, and df t = document frequency of term t. Log is said to be used because it “dampens” the effect of IDF. What does this mean? Also, why do we…

information-retrieval tf-idf

asked Nov 21 '14 at 18:33

stevetronix

1,231
2
16
32

votes

6 answers

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity. I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will…

information-retrieval vsm cosine-similarity tf-idf

asked Jun 06 '11 at 17:36

N00programmer

1,111
4
13
17

votes

3 answers

How to parse the data from Google Alerts?

Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? It seems that there is no Google Alerts API. If you must parse text, how would you go about parsing out…

database information-retrieval google-alerts

asked May 13 '09 at 21:08

John Scipione

2,360
4
25
28

votes

8 answers

Wikipedia text download

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the…

text wikipedia web-crawler information-retrieval

asked Apr 21 '10 at 13:56

Boolean

14,266
30
88
129

votes

3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…

machine-learning classification information-retrieval text-mining document-classification

asked Apr 01 '14 at 15:59

smwikipedia

61,609
92
309
482

votes

1 answer

What is the default list of stopwords used in Lucene's StopFilter?

Lucene have a default stopfilter (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html), does anyone know which are words in the list?

java apache lucene information-retrieval stop-words

asked Jul 08 '13 at 13:20

alvas

115,346
109
446
738

votes

5 answers

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata…

text machine-learning information-retrieval document-classification

asked Aug 10 '09 at 12:38

kch

77,385
46
136
148

votes

7 answers

Computing similarity between two lists

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other. Eg, 1,7,4,5,8,9 1,7,5,4,9,6 What is a good measure of similarity between these two…

algorithm search statistics probability information-retrieval

asked Feb 20 '12 at 17:03

user1221572

votes

4 answers

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these?

machine-learning data-mining information-retrieval

asked Aug 05 '10 at 18:04

Boris Yeltz

2,341
5
21
20

votes

2 answers

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.

pdf information-retrieval

asked Feb 01 '12 at 16:32

user1183057

votes

9 answers

How can I extract only the main textual content from an HTML page?

Update Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I…

java html information-retrieval jsoup

asked Aug 11 '11 at 05:36

Renato Dinhani

35,057
55
139
199

votes

1 answer

Lucene's algorithm

I read the paper by Doug Cutting; "Space optimizations for total ranking". Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking). Particularly, the total ranking…

algorithm indexing lucene information-retrieval inverted-index

asked Apr 25 '12 at 21:25

teddy teddy

3,025
6
31
48

2 3

…

77 78 Next