how to extract information from a dynamic url using r?

Question

I'm failing to extract values displayed on a dynamic url. The issue appears to be the dynamism of the url.

When I inspect the source code & use that as the html contenct, I can extract it correctly. When I use the url live, html_node seems to return empty & my code fails.

library(rvest)
library(tidyverse)

# 2 sources of html
url_source <- '<span>Earnings on <span>Thu, Aug 03</span></span><span class="Mstart(15px) Fw(500) Fz(s)"><span>1-100 of 1270 results</span></span>'
url_live <- "https://finance.yahoo.com/calendar/earnings?from=2023-07-30&to=2023-08-05&day=2023-08-03"

# HTML content to parse
#html_content <- url_source
html_content <- url_live

# Parse the HTML content
webpage <- read_html(html_content)

# Extract the value using CSS selector
value <- webpage %>%
  html_node(xpath = '//span[contains(@class, "Mstart") and contains(@class, "Fw") and contains(@class, "Fz")]/span') %>%
  html_text()

# Extract the numeric part from the text
numeric_value <- as.numeric(str_extract(value, "\\d+(?= results)"))

# Print the extracted value
print(numeric_value)
#[1] 1270 from url_source
#[1] NA from url_live

maybe you can try using whole xpath if you expect always that element in that position — lasagna, Aug 03 '23 at 14:43
could it be variant of this issue? https://stackoverflow.com/q/76809554/20513099 — I_O, Aug 03 '23 at 14:53

Till · Accepted Answer · 2023-08-03T16:11:57.207

1

It's working for me with html_nodes() instead of html_node().

Note: rvest 1.0.0 introduced html_element() and html_elements() to supersede html_node*()`

library(rvest)
library(tidyverse)

url_live <- "https://finance.yahoo.com/calendar/earnings?from=2023-07-30&to=2023-08-05&day=2023-08-03"
html_content <- url_live
webpage <- read_html(html_content)

# Extract the value using CSS selector
value <- webpage %>%
  html_elements(xpath = '//span[contains(@class, "Mstart") and contains(@class, "Fw") and contains(@class, "Fz")]/span') %>%
  html_text()

# Extract the numeric part from the text
as.numeric(str_extract(value, "\\d+(?= results)")) |> 
  na.omit()
#> [1] 1266
#> attr(,"na.action")
#> [1] 1 2 3
#> attr(,"class")
#> [1] "omit"

edited Aug 03 '23 at 16:11

answered Aug 03 '23 at 15:20

Till

3,845
1
11
18

1

Duh! Just an "s". Sorry TIll but thank you for quick reply. Guess I'll use html_elements from now on. Thanks again. – user22322433 Aug 03 '23 at 16:03
Happy to be helpful! Feel free to upvote/accept my answer. – Till Aug 03 '23 at 16:12

how to extract information from a dynamic url using r?

1 Answers1