0

Maybe this subject is treated in other posts but I cannot find a solution to my issue. I am trying to scrape data from https://tradingeconomics.com/indicators website. I am trying to scrape data regarding indicators, in particular the country names and the plots included in any country link.

tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
html_text() %>% 
trimws %>% 
gsub(" ", "-", .)


tradec_df = data.frame()

for (i in country_list) {
link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
page = read_html(link)

country = page %>% html_nodes("#SelectCountries") %>% html_text()
tradec_charts = page %>% html_nodes("#ImageChart") %>% html_text

tradec_df = rbind(tradec_df, data.frame(country, tradec_charts, stringsAsFactors = FALSE))
print(paste("Page:", country_list)) 

} 

In an ideal world, I would like to have a page printed for each country including country name and the plot. I am pretty sure that plots might be scraped in some way and displayed though I have no idea about how. Any suggestion?

Il Forna
  • 23
  • 5

1 Answers1

1

It's not working because each element in the countries variable contains illegal characters:

 [1] "\r\n                                        South Africa\r\n                                    "          
 [2] "\r\n                                        Peru\r\n                                    "                  
 [3] "\r\n                                        Botswana\r\n                                    "   

So all you need to do is remove those characters with trimws(), so they look like this instead:

country_list
 [1] "South Africa"           "Peru"                   "Botswana"               "India"                  "Turkey"                
 [6] "New Zealand"            "Argentina"              "Malta"                  "Slovenia"               "El Salvador"           
[11] "Ireland"                "Rwanda"                 "Albania"                "Luxembourg"             "Nigeria"               
[16] "Canada"                 "Jamaica"                "Uruguay"                "Brazil"                 "Paraguay"  

This works. The only line I changed was to add the pipe to trimws():

library(tidyverse)
library(rvest)


tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
  html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
  html_text() %>% 
  trimws


tradec_df = data.frame()

for (i in country_list) {
  link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
  page = read_html(link)
  
  country = page %>% html_nodes("#SelectCountries") %>% html_text()
  tradec_links = page %>% html_nodes("#ImageChart") %>% html_text
}
stevec
  • 41,291
  • 27
  • 223
  • 311
  • Thanks for your reply and suggestions. I am receiving this error message, though: Error in open.connection(x, "rb") : HTTP error 400. In any case, in which format the chart is downloaded, this way? – Il Forna Feb 19 '21 at 15:40
  • @IlForna a 400 error means page not found. So the next thing you must solve is ensuring the that the URLs you generate with your code are precisely the same as the ones the website uses. – stevec Feb 19 '21 at 15:41
  • @IlForna I checked, and the countries in your list are space separated, but in the URLs, they're dash separated. So you'll need to add this after `trimws %>% gsub(" ", "-", .)`. That will replace all spaces with dashes, then the scrape works. – stevec Feb 19 '21 at 15:45
  • @IlForna that will solve the problem you've got outlined in this question. I can't work out exactly what you're doing in the loop, but if you think about what you're doing there, you could ask it in another quesiton – stevec Feb 19 '21 at 15:47
  • 1
    Thanks a lot for the help! I did not notice the dash indeed. I have edited the code following both your suggetsions and my needs and I edited the question as well. That is a domain I have never faced in R hence I am moving awkwardly. – Il Forna Feb 19 '21 at 19:07
  • @IlForna no problems. You are doing really well, this is a tough problem. Regarding the plots, I think you have two options i) try to get the raw data you need to recreate the plots, or ii) screen shot the page when you visit it (something like RSelenium can do that) – stevec Feb 20 '21 at 05:51