0

There was a question about this, but the user was satisfied (probably?) with knowing about precision, recall and F1 score, so I'll extend it:

To compute precision & recall, you need the TP, FN, TN and FP values. Out of the box, after a crawl, you know:

  • TP + FP (those were selected as relevant)
  • TN + FN (the rest which were crawled and discarded)

The hard part seems to be separating those sums by finding the truly relevant pages out of the crawled set (TP and FN - not added up)

Verifying a document's relevancy, I can do that manually, aside from the crawler's relevancy function which should actually be tested. In my case it is the cosine similarity between the TF-IDFs of the crawled page and a user-given on-topic document.

As I want test it on more than a couple hundred crawled pages, how do you make the correctness evaluation using precision and recall, without actually manually verifying every crawled page? Also, is there any other way to evaluate a focused web crawler?

Community
  • 1
  • 1
clausavram
  • 546
  • 8
  • 14
  • 1
    Have you considered using a pre-recorded corpus rather than running the crawler in the wild? The corpus https://commoncrawl.org/ seems to be a good start. It is indexed, so you can compute the cosine similarity for every page. The links to the pages outside the corpus must be ignored. In this case you can compute recall; I think, in the wild Web you can compute only precision, since there's no way to count the number of relevant documents. – Mike Bessonov Sep 12 '15 at 23:45
  • Looks interesting. I'll take a look into it. Thanks – clausavram Sep 13 '15 at 11:02

0 Answers0