How to crawl updated webpage by scrapy?

Asked May 13 '16 at 07:56

Active May 14 '16 at 02:52

Viewed 813 times

I'm using scrapy to get data from a website.The website But there is a problem that I don't know how to get the increment data after the website has been updated in server or how to know the website has been updated? The table in webpage is what I want to crawl, like this: Just as you can see, there is a column named "Add Date". So when the data has been updated, I just want to get the data that has been added lately. And there is a problem that after updated the url of website won't have any changes. It's still https://gold.jgi.doe.gov/projects.

I've read this Q&A Strategy for how to crawl/index frequently updated webpages?. I understand a little bit of the theory. But I still don't know how to implement this when using scrapy, can anybody give an example or some detailed information?

edited May 23 '17 at 11:44

Community

asked May 13 '16 at 07:56

Coding_Rabbit

1,287
3
22
44

Exactly what did you try so far ? Can you share some code ? – aberna May 13 '16 at 08:50
Sorry, I don't know how to deal with this, so I'm asking for an example. – Coding_Rabbit May 13 '16 at 09:07
1

follow the [tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html) – eLRuLL May 13 '16 at 14:28
I do know how to write a spider to get data from a website, but I don't know how to get the data added recently after it has been updated. – Coding_Rabbit May 13 '16 at 15:46
1

@Coding_Rabbit I guess you are looking for this https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages – Weihang Jian Aug 09 '16 at 17:54
Possible duplicate of [Strategy for how to crawl/index frequently updated webpages?](https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages) – Gallaecio Jan 31 '19 at 14:10

How to crawl updated webpage by scrapy?

0 Answers0