Web crawling evaluation?

Question

I have seen in focused web crawling (a.k.a. topical web crawling), the evaluation metric - harvest ratio - is defined as: after crawling 't' pages, harvest ratio = number_of_relevant_pages/pages_crawled(t).

So for example after crawling 100 pages I get 80 true positives then the harvest ratio of the crawler at that point is 0.9. But the crawler might have ignored some pages off crawling that are totally relevant to the crawling domain but is not accounted in the evaluation ratio. What is this? Can we improve that evaluation metric to include the missed pages that are totally relevant? Is this consideration important?

Julien Bourdon · Answer 1 · 2012-06-26T13:01:18.490

2

The most basic evaluation for a focused crawl is Precision and recall which can be aggregated into F-measure.

http://en.wikipedia.org/wiki/Precision_and_recall

http://en.wikipedia.org/wiki/F1_score

If you are more interested into how a page is relevant to a specific keyword, you want to use tf/idf (term frequency–inverse document frequency)

http://en.wikipedia.org/wiki/Tf*idf

edited Jun 26 '12 at 13:01

answered Jun 25 '12 at 07:14

Julien Bourdon

1,713
17
28

but they all make evaluations based on the crawled collection right? But what about relevant pages that were not crawled? I mean, I might get a high evaluation score but may be I haven't crawled some pages that are very relevant. So this is a problem with the crawler that is not shown in evaluation. What is the solution for that? – samsamara Jun 25 '12 at 07:35
Edited my answer to show how to evaluate the relevance of a page for a specific keyword. – Julien Bourdon Jun 26 '12 at 13:01
hey no you didn't get my question. please read my comment above – samsamara Jun 26 '12 at 16:45
Well, I'm afraid I didn't get your question then. Try to edit it to make it clearer and you might get a more appropriate answer then. – Julien Bourdon Jun 26 '12 at 16:48

Web crawling evaluation?

1 Answers1

Linked