0

I'm interested to build a program to get all latest articles in a specific domain ("computer science") from a specific set of websites ("ScienceDirect" for example). As you know, some websites publish a page for each research article, such as: http://www.sciencedirect.com/science/article/pii/S108480451400085X Each page contains the information of a specific article.

I'm interested to know what is a best tool (open source) for this purpose? General web crawlers (such as Apache Nutch) provide a general framework to crawl the whole web but in my case I need a website(s)-specific crawler.

AmirHJ
  • 827
  • 1
  • 11
  • 21
  • try [Scrapy](http://scrapy.org) – aalbahem Oct 10 '14 at 15:49
  • You can easily do that by applying regular expression on regex-urlfilters.txt file in nutch. (of course if url format for your required pages is different from other pages) – Ali Oct 18 '14 at 13:26

0 Answers0