Questions tagged [information-retrieval]

Information Retrieval is an area of study concerning with retrieving documents, information or metadata from a collection of unstructured or semi-structured data.

Information Retrieval is an area of study concerned with retrieving documents, information or metadata from a collection of unstructured or semi-structured data.

It usually has 3 parts:

  1. Crawling: Identifying the documents that we want to search when our document collection is not clearly defined, especially important to web search engines.
  2. Indexing: Parsing and inverting the documents into an index, either a static offline process or a set of incremental updates for frequently changing document collections.
  3. Searching: retrieving the most relevant documents to the given query. This step requires that we rank documents with scoring functions which measure how relevant the documents are to each query.
1163 questions
219
votes
11 answers

What is the best way to compute trending topics or tags?

Many sites offer some statistics like "The hottest topics in the last 24h". For example, Topix.com shows this in its section "News Trends". There, you can see the topics which have the fastest growing number of mentions. I want to compute such a…
caw
  • 30,999
  • 61
  • 181
  • 291
112
votes
6 answers

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the…
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
73
votes
2 answers

How to specify two Fields in Lucene QueryParser?

I read How to incorporate multiple fields in QueryParser? but i didn't get it. At the moment i have a very strange construction like: parser = New QueryParser("bodytext", analyzer) parser2 = New QueryParser("title", analyzer) query =…
Tyzak
  • 2,430
  • 8
  • 38
  • 52
48
votes
5 answers

Why is log used when calculating term frequency weight and IDF, inverse document frequency?

The formula for IDF is log( N / df t ) instead of just N / df t. Where N = total documents in collection, and df t = document frequency of term t. Log is said to be used because it “dampens” the effect of IDF. What does this mean? Also, why do we…
stevetronix
  • 1,231
  • 2
  • 16
  • 32
42
votes
6 answers

Cosine similarity and tf-idf

I am confused by the following comment about TF-IDF and Cosine Similarity. I was reading up on both and then on wiki under Cosine Similarity I find this sentence "In case of of information retrieval, the cosine similarity of two documents will…
N00programmer
  • 1,111
  • 4
  • 13
  • 17
33
votes
3 answers

How to parse the data from Google Alerts?

Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? It seems that there is no Google Alerts API. If you must parse text, how would you go about parsing out…
John Scipione
  • 2,360
  • 4
  • 25
  • 28
27
votes
8 answers

Wikipedia text download

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the…
Boolean
  • 14,266
  • 30
  • 88
  • 129
25
votes
3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…
25
votes
1 answer

What is the default list of stopwords used in Lucene's StopFilter?

Lucene have a default stopfilter (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html), does anyone know which are words in the list?
alvas
  • 115,346
  • 109
  • 446
  • 738
25
votes
5 answers

What tried and true algorithms for suggesting related articles are out there?

Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related. Let's assume very little metadata…
kch
  • 77,385
  • 46
  • 136
  • 148
24
votes
7 answers

Computing similarity between two lists

EDIT: as everyone is getting confused, I want to simplify my question. I have two ordered lists. Now, I just want to compute how similar one list is to the other. Eg, 1,7,4,5,8,9 1,7,5,4,9,6 What is a good measure of similarity between these two…
24
votes
4 answers

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these?
Boris Yeltz
  • 2,341
  • 5
  • 21
  • 20
22
votes
2 answers

How to extract Highlighted Parts from PDF files

Is there any way to extract highlighted text from a PDF file programmatically? Any language is welcome. I have found several libraries with Python, Java, and also PHP but none of them do the job.
user1183057
  • 229
  • 1
  • 2
  • 3
22
votes
9 answers

How can I extract only the main textual content from an HTML page?

Update Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I…
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
19
votes
1 answer

Lucene's algorithm

I read the paper by Doug Cutting; "Space optimizations for total ranking". Since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking). Particularly, the total ranking…
1
2 3
77 78