0

We would like to know whether commoncrawl database can be used as legitimate dataset for URL classification.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
test M
  • 9
  • 3

1 Answers1

3

The Common Crawl archives may include all kinds of malicious content at a low rate. At present, only link spam is classified and partially blocked from being crawled.

In general, a broad sample web crawl may include spam, malicious sites etc. The Common Crawl archives are also used for research on web security, cf. https://scholar.google.de/scholar?q=commoncrawl+vulnerability

This topic has already been discussed on https://groups.google.com/d/msg/common-crawl/xmSZX85cRjg/zwi5vn4NBAAJ

Sebastian Nagel
  • 2,049
  • 10
  • 10