0

I would like to scrape the precipitation data from the meteogram of this page : https://www.ventusky.com/-14.868;-71.332#forecast.

What I am trying to do is to work with rvest, because RSelenium produces an error. The code is:

library(rvest)

library(httr)

link <- read_html("https://www.ventusky.com/-14.868;-71.332")

PP1 <- link %>% 
  html_node(xpath='//*[@id="meteogram"]/div[2]/div/div[1]/svg/g[2]/text[1]') %>% 
  html_text()

The xpath has been obtained from the inspecting of the website for the first value. However, when I run it, it returns "NA". Please, I would appreciate your help.

Bill
  • 5,600
  • 15
  • 27

1 Answers1

0

In dev tools of your browser start by checking the page source. Or if you want to use inspector to assist you with CSS selectors or XPaths, first turn off javascript for that page. This allows you to work on the same content that is (hopefully) received by rvest and from the chart placeholder you can extract XPath for the <a> element that includes data for the graph: enter image description here

URL in data-link provides parameters for the chart generator, though data parameter happens to include not one but multiple data series, one for each meteogram chart. Each of those 14-value series are separated by :, so we need to extract just one of those. And also the date of for the first value in series.

library(rvest)
library(stringr)

link <- read_html("https://www.ventusky.com/-14.868;-71.332")

chart_url_params <- link %>% 
  html_element(xpath='//*[@id="graph_rain"]/a') %>% 
  html_attr("data-link") %>% 
  # split by all separators: "&", "=", ":"
  str_split('&|=|:',simplify = T)

chart_url_params[1:9]
#> [1] "rain"                                       
#> [2] "data"                                       
#> [3] "-1;0;0;0;5;4;4;-1;4;5;5;5;4;4"              
#> [4] "17;19;19;18;16;18;19;18;17;17;12;16;15;17"  
#> [5] "0.1;0;0;1;10.1;0;0;0;0;0.1;9.9;0.1;10.2;0.6"
#> [6] "36;32;36;40;32;40;40;22;25;25;14;18;25;29"  
#> [7] "0;0;0;0;0;0;0;0;0;0;0;0;0;0"                
#> [8] "time"                                       
#> [9] "1666418400"

# 9: starting date as Unix timestamp
# 5: precip. data, identifyed by checking rendered chart

start_date <- as.Date(as.POSIXct(as.numeric(chart_url_params[9]), origin="1970-01-01"))
precip <- chart_url_params[5] %>% 
  str_split(";", simplify = T) %>% 
  as.numeric()

tibble::tibble(
  date = seq(from = start_date, by = "day",  along.with=precip), 
  precip = precip)

Result:

#> # A tibble: 14 × 2
#>    date       precip
#>    <date>      <dbl>
#>  1 2022-10-22    0.1
#>  2 2022-10-23    0  
#>  3 2022-10-24    0  
#>  4 2022-10-25    1  
#>  5 2022-10-26   10.1
#>  6 2022-10-27    0  
#>  7 2022-10-28    0  
#>  8 2022-10-29    0  
#>  9 2022-10-30    0  
#> 10 2022-10-31    0.1
#> 11 2022-11-01    9.9
#> 12 2022-11-02    0.1
#> 13 2022-11-03   10.2
#> 14 2022-11-04    0.6

Created on 2022-10-22 with reprex v2.0.2

margusl
  • 7,804
  • 2
  • 16
  • 20
  • Thanks for yout response. It was really useful. I just want to know how did you make the Xpath show like that. Thank You. – César Carvallo Oct 23 '22 at 19:37
  • Not sure if I got it right, but it's unchanged XPath from Chrome Dev Tools Copy XPath - https://i.stack.imgur.com/G45BE.png . Highlight comes from a search (ctrl+f). – margusl Oct 23 '22 at 20:07