0

I am using rvest to scrape some information off websites as a little hobby project. However, for one particular node I try to extract, it seems to append CSS styling code to the beginning.

URL <- 'https://www.thepioneerwoman.com/food-cooking/recipes/a41138141/apple-pie-cookies-recipe/'
recipe <- rvest::read_html(URL)
directions <- rvest::html_nodes(recipe, ".et3p2gv0") %>%
    rvest::html_text() %>%
    trimws()

This is what appears in the output:

[1] ".css-dt22uw{display:none;visibility:hidden;}Step .css-6ds1rq{border-right:thin solid #b20039;height:1rem;left:-3rem;position:absolute;top:0.45rem;width:1.4rem;}1.css-1baulvz{display:inline-block;}Melt the butter in a medium saucepan over medium-high heat. Add the apples and cook until they start to soften, 3 to 4 minutes. Stir in the brown sugar and lemon juice, bring to a simmer and cook until the apples are soft and the liquid is starting to reduce, 3 to 4 more minutes. Whisk the apple juice and cornstarch in a small bowl and add it to the pan. Cook, stirring, until the mixture thickens, about 1 more minute. Remove from the heat and let cool. "

I have tried a variety of different nodes, and used different CSS selectors but regardless, that still appears in the output.

I might end up just using gsub() to remove this from the string, but would rather not.

Cole Baril
  • 55
  • 6

2 Answers2

2

XPath text() is quite handy at times, you can mix and match it with css selectors or rewrite selector as XPath:

URL <- 'https://www.thepioneerwoman.com/food-cooking/recipes/a41138141/apple-pie-cookies-recipe/'
recipe <- rvest::read_html(URL)

# get a list of <li> elements with css selector and extract text from each elemnet with XPath
directions_1 <- rvest::html_elements(recipe, "ol.et3p2gv0 li") %>%
  html_nodes(xpath="./text()") %>% 
  rvest::html_text() %>%
  trimws()

# or use only XPath
directions_2 <- rvest::html_elements(recipe, xpath='//ol[contains(@class, "et3p2gv0")]/li/text()') %>%
  rvest::html_text() %>%
  trimws()
margusl
  • 7,804
  • 2
  • 16
  • 20
0

Perhaps xml2::xml_remove() might help.

URL <- 'https://www.thepioneerwoman.com/food-cooking/recipes/a41138141/apple-pie-cookies-recipe/'
recipe <- rvest::read_html(URL)
directions <- rvest::html_nodes(recipe, ".et3p2gv0")

toremove <- directions %>%
  rvest::html_node("style")

xml2::xml_remove(toremove)

directions %>%
  rvest::html_text(trim = T)
anpami
  • 760
  • 5
  • 17
  • `toremove <- directions %>% rvest::html_nodes("style")` - however, you likely want to retain the whitespace between step number and instruction so a replacement before remove it a possibility. – QHarr Oct 12 '22 at 18:00