-2

I recently wrote a custom web crawler/spider using Java and the JSoup (http://jsoup.org/) HTML parser. The web crawler is very rudimentary - it uses the Jsoup connect and get methods to get the source of pages and then other JSoup methods to parse the content. It randomly follows almost any links it finds, but no point does it attempt to download files or execute scripts.

The crawler picks seed pages from a long list of essentially random webpages, some of which probably contain adult content and/or malicious code. Recently while I was running the crawler my anti virus (Avast) flagged down one of the requests as a "threat detected". The offending URL looked malicious.

My question is, can my computer get a virus or any sort of malware through my web crawler? Are there any precautions or checks I should put in place?

1 Answers1

4

In theory, it can.

However, as you don't execute Flash and similar plugins, but only process the text data, chances are pretty high that your HTML parser does not have a known vulnerability.

Furthermore, all the viruses and mailicious web sites target the big user groups. There are only so few users using JSoup. Most are using Internet Exploder, for example. That is why the viruses target these platforms. These days, Mac OSX is becoming more and more attractive. I just read about a new malware that infects Mac OSX users only, via some old Java security issue, when they visit a web site. It was found on Dalai Lama related web sites, so maybe it's Chinese.

If you really are paranoid, set up a "nobody" user on your system, which you heavily restrict. This works best with Linux. In particular with SELinux you can narrow down the permissions of the web crawler up to the point where you can stop it from being able to access anything except load an external web site and send the result to a database. An attacker can then only crash your crawler, or maybe abuse it for a DDoS attack, but not corrupt or take over your system.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194