-1

I am using this list of user agents: https://developers.whatismybrowser.com/useragents/explore/hardware_type_specific/computer/

and in the mean time I am webscraping multiple pages from amazon. I want to change user agent during the scraping, so amazon doesn't block me after 100 pages of scraping with a 503 error.

the problem I am trying to solve is that the code pick only one user agent from the list and then use it for all the loop, I want at least that the code change the user agent 2 or 3 time during the loop, choosing this 2/3 user agents from the list in the link.

tell me if you have further question. I leave the code below:

library(rvest)
library(tidyverse)


ua_links <- read_html(paste0("https://developers.whatismybrowser.com/useragents/explore/hardware_type_specific/computer/"))
ua <- ua_links %>% html_nodes(".code") %>% html_text(trim = TRUE)



df_monitors <- list()
for (i in 2:400) { 
  #read page
  page <- read_html(paste0("https://www.amazon.it/s?i=computers&rh=n%3A460159031&fs=true&page=", i), user_agent = Sample(ua))
  
  Sys.sleep(4)
  #read the parent nodes
  monitors <- page  %>% html_nodes(xpath= "//div[@class='a-section a-spacing-small s-padding-left-small s-padding-right-small']")
  
  # parse information from each of the parent nodes
  description <- monitors %>% html_node(xpath= ".//*[@class='a-size-base-plus a-color-base a-text-normal']") %>% html_text(trim = TRUE)
  price <- monitors %>% html_node(xpath= ".//*[@class='a-price-whole']") %>% html_text(trim = TRUE)
  
  # put the data together into a data frame add to list                  
  df_monitors[[i]] <- data.frame(description,price)
  print(paste("Page:",i))
}
#combine all data frames into 1
monitor_final <- bind_rows(df_monitors)``
Gio 255
  • 1
  • 1

1 Answers1

0

The user_agent changes every time with this method though.

library(httr2)
library(tidyverse)
library(rvest)

ua <-
  read_html(
    paste0(
      "https://developers.whatismybrowser.com/useragents/explore/hardware_type_specific/computer/"
    )
  ) %>%
  html_nodes(".code") %>%
  html_text(trim = TRUE)

get_amazon <- function(page_number) {
  
  user_agent <- sample(ua, 1)
  
  page <- paste0(
    "https://www.amazon.it/s?i=computers&rh=n%3A460159031&fs=true&page=",
    page_number
  ) %>%  
    request() %>%  
    req_user_agent(user_agent) %>% 
    req_perform() %>%  
    resp_body_html()
  
  cat("Scraping page", page_number, "\n", 
      "with user_agent", user_agent)
  
  tibble(
    title = page %>%
      html_elements(".s-card-border") %>%
      map(~ html_element(.x, ".a-size-base-plus") %>%
            html_text2) %>%
      unlist(),
    price = page %>%
      html_elements(".s-card-border") %>%
      map(~ html_element(.x, ".a-price-whole") %>%
            html_text2) %>%
      unlist
  )
  
  
}
    
df <- map_dfr(1:200, get_amazon)
Chamkrai
  • 5,912
  • 1
  • 4
  • 14
  • thank you for the reply, I run it and at the page 143 it stop and it gave me 503 error. i will try to use more user agent i found a 1000 agents list. do you know also how i can fight reCAPTCHA when webscraping, i am trying to learn webscraping trick. using Sys.sleep(4) to fight reCAPTCHA doesn't work – Gio 255 Jul 08 '22 at 21:49
  • @Gio255 I have editted my answer to solve it. I scraped 200 pages without any issues. I changed the method from `rvest` to `httr2` to request the site with different user_agent every time. Hope this helps :) – Chamkrai Jul 09 '22 at 09:27
  • now it works, I need to be careful because it works only one time for website then they detect me. – Gio 255 Jul 09 '22 at 11:49
  • @Gio255 Yes. Also, there alot of duplicates of items as some of them are "sponsored" to appear almost at every page. Feel free to accept my answer if it satisfy the problem – Chamkrai Jul 09 '22 at 13:03