How to crawl latest articles in a specific domain using a specific set of websites?

Asked Oct 07 '14 at 09:21

Active Oct 07 '14 at 09:21

Viewed 391 times

I'm interested to build a program to get all latest articles in a specific domain ("computer science") from a specific set of websites ("ScienceDirect" for example). As you know, some websites publish a page for each research article, such as: http://www.sciencedirect.com/science/article/pii/S108480451400085X Each page contains the information of a specific article.

I'm interested to know what is a best tool (open source) for this purpose? General web crawlers (such as Apache Nutch) provide a general framework to crawl the whole web but in my case I need a website(s)-specific crawler.

asked Oct 07 '14 at 09:21

AmirHJ

try [Scrapy](http://scrapy.org) – aalbahem Oct 10 '14 at 15:49
You can easily do that by applying regular expression on regex-urlfilters.txt file in nutch. (of course if url format for your required pages is different from other pages) – Ali Oct 18 '14 at 13:26

How to crawl latest articles in a specific domain using a specific set of websites?

0 Answers0