We would like to know whether commoncrawl database can be used as legitimate dataset for URL classification.
Asked
Active
Viewed 292 times
0
-
We're all left guessing what you mean by "legitimate dataset" -- every sample of URLs has selection effects. – Greg Lindahl Feb 12 '19 at 16:38
1 Answers
3
The Common Crawl archives may include all kinds of malicious content at a low rate. At present, only link spam is classified and partially blocked from being crawled.
In general, a broad sample web crawl may include spam, malicious sites etc. The Common Crawl archives are also used for research on web security, cf. https://scholar.google.de/scholar?q=commoncrawl+vulnerability
This topic has already been discussed on https://groups.google.com/d/msg/common-crawl/xmSZX85cRjg/zwi5vn4NBAAJ

Sebastian Nagel
- 2,049
- 10
- 10