I was looking into testing the scrapy pipeline, (I already know the spider works) when it occurred to me I could just use a local copy of a page from the target website instead of repeatedly hitting it with my spider online. But I did not see anything suggesting that option. Is there some reason why this won't work, or is not a best practice?
Asked
Active
Viewed 307 times
1
-
How are you proposing to store and access the local copy of the page? At the moment you question is probably going to be closed as "opinion based" unless you add some more detail. – Tony Feb 24 '18 at 22:48
-
? It is just an html file. Why is storing and accessing that a mystery? Either people do this or they don't, and if they don't, it's because it doesn't work for x or y reason. – Malik A. Rumi Feb 27 '18 at 14:45
-
Without knowing which website you are scraping it’s hard to say. Many sites now use JavaScript or do processing in the server depending on what you requested. I guess you could store the resulting HTML but it may not be rendered the same when when you scrape the live site. – Tony Feb 27 '18 at 20:39
-
The reason I said your question might be downvoted or closed is because you are asking for people’s opinions rather than asking for the answer to a specific programming problem, as defined by the Stackoverflow rules. Have you tried downloading the HTML and scraping that? If that works for you, then great. If not, you could ask for help getting it to work, if you see what I mean. – Tony Feb 27 '18 at 20:46
-
Ok. I didn't think 'how do you handle this situation' would be considered an opinion, but I see what you are saying. Not sure how I would re-word it. – Malik A. Rumi Feb 27 '18 at 22:49
-
1Found this https://doc.scrapy.org/en/latest/topics/shell.html which explicitly talks about using local copies, if that helps anyone else. – Malik A. Rumi Feb 28 '18 at 19:37
-
Hey @Tony, why is this such a crazy idea for you.... & do you need the website; I assume that HTML is HTML, if theres an issue of JS then theres other tools to handle this, we're talking about scrapy & mocking results. – A H Bensiali Feb 29 '20 at 15:26
-
@AHBensiali - I didn't say it was a crazy idea, just that questions on SO are usually about solving a specific problem and are usually supported with some research or an attempt at solving it. When first posted this question did not include either, but the OP later posted a comment with a link to documentation about local copies. This question is now 2 years old and is yet to be answered, so to me that says it needs more detail. – Tony Mar 01 '20 at 17:29
-
@AHBensiali - I didn't say it was a crazy idea, just that questions on SO are usually about solving a specific problem and are supported with some research or an attempt at solving it. When first posted this question did not include either, but the OP later posted a comment with a link to documentation about local copies. This question is now 2 years old and is yet to be answered, so to me that says it needs more detail. – Tony Mar 01 '20 at 17:29