How to avoid 'HTTP error code:429' while web scraping?

Question

I'm trying to web scrape a information from Google and they aren't liking it. The vector contains 2487 Google sites and from which one of them I want to get the text of the first result.

I tried to create a loop to slow down the process but I'm very bad at it.

b is the value that contain all the web sites. First, I tried:

ContentScraper(b, CssPatterns = ".st") -> b

But then, I tried to loop and slow it down, but I have no idea how to.

b[i] <- ContentScraper(i, CssPatterns = ".st")}

From the 55th and on all that I get is the error. Any thoughts on how to avoid it? Thanks.

429 is "Too Many Requests", likely due to rate-limiting. Slow your request rate with artificial `Sys.sleep(...)` or some other method of ensuring you do exceed your quota. If you don't know what your limit is, then I suggest you take a look at the user licensing for the website you are scraping and determine either (a) how to increase those limits, or (b) what the limits are so that you don't violate them. — r2evans, May 23 '19 at 20:02
Well, how am I supposed to use `Sys.sleep(...)` or `tryCatch`' in this loop that I created? I don't know where to put it. Thanks. — Rodf, May 24 '19 at 15:31

score 0 · Answer 1 · edited Mar 06 '22 at 14:01

0

Insert Sys.sleep(...) inside the loop at the beginning of it

edited Mar 06 '22 at 14:01

Nad Pat

3,129
3
10
20

answered Jul 08 '20 at 21:23

Mariano

341
2
4

score 0 · Answer 2 · answered Mar 06 '22 at 14:31

0

One way is to use

Sys.sleep(...)

Another way if you're using puppeteer or playwright you can adjust the interval of the scrapes with celery beat.

Is that what you're looking for?

answered Mar 06 '22 at 14:31

Yusuf Ganiyu

842
9
8

How to avoid 'HTTP error code:429' while web scraping?

2 Answers2