4

I'm using scrapy to get data from a website.The website But there is a problem that I don't know how to get the increment data after the website has been updated in server or how to know the website has been updated? The table in webpage is what I want to crawl, like this: enter image description here Just as you can see, there is a column named "Add Date". So when the data has been updated, I just want to get the data that has been added lately. And there is a problem that after updated the url of website won't have any changes. It's still https://gold.jgi.doe.gov/projects.

I've read this Q&A Strategy for how to crawl/index frequently updated webpages?. I understand a little bit of the theory. But I still don't know how to implement this when using scrapy, can anybody give an example or some detailed information?

Community
  • 1
  • 1
Coding_Rabbit
  • 1,287
  • 3
  • 22
  • 44
  • Exactly what did you try so far ? Can you share some code ? – aberna May 13 '16 at 08:50
  • Sorry, I don't know how to deal with this, so I'm asking for an example. – Coding_Rabbit May 13 '16 at 09:07
  • 1
    follow the [tutorial](http://doc.scrapy.org/en/latest/intro/tutorial.html) – eLRuLL May 13 '16 at 14:28
  • I do know how to write a spider to get data from a website, but I don't know how to get the data added recently after it has been updated. – Coding_Rabbit May 13 '16 at 15:46
  • 1
    @Coding_Rabbit I guess you are looking for this https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages – Weihang Jian Aug 09 '16 at 17:54
  • Possible duplicate of [Strategy for how to crawl/index frequently updated webpages?](https://stackoverflow.com/questions/10331738/strategy-for-how-to-crawl-index-frequently-updated-webpages) – Gallaecio Jan 31 '19 at 14:10

0 Answers0