I want to learn web crawling with Java EE. I don't know where to start.
What are good books or tutorials?
I want to learn web crawling with Java EE. I don't know where to start.
What are good books or tutorials?
A web crawler can also be known as bot. Its a small program which crawls web pages using the links which are in the web pages. It involves parsing the HTML pages, extracting the links which can be used to traverse the web pages, you can refer this post for some basic explanation about web crawler and how it works.
There are various libraries available to implement a simple web crawler. JSoup is Java based library. It is one of the widely used library to parse HTML pages as it provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
Also there are various tutorials available on web. Refer to this simple tutorial for some simple java programs which demonstrates the use of JSoup in various ways.
A web crawler is an application that browses the Internet in general to index links, pages and so on. I can suggest you crawler4j which is Java based and open source.
A very good book about Web Data Mining in general is "Web Data Mining Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu
Besides crawler4j, which is a really handsome crawler framework (and can easily be integraded in a Java EE environment).
Moreover you can take a look at Apache Nutch, which is a scalable and distributed crawler framework.