Extracting innerHTML using rvest

Question

I would like to extract the html content of a tag in R. For instance, in the following HTML,

<html><body>Hi <b>name</b></body></html>

suppose I'd like to extract the content of the <body> tag, which would be:

Hi <b>name</b>

In this question, the answer (using as.character()) will include the enclosing tag, which is not what I want. eg,

library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
read_html(html) |>
    html_element('body') |>
    as.character()

returns outerHTML:

[1] "<body>Hi <b>name</b>\n</body>"

...but I want the innerHTML. How can I get the content of a HTML tag in R, without the enclosing tag?

Do you only want the text? `read_html(html) |> html_element('body') |> html_text()` — Ronak Shah, Dec 15 '22 at 09:42
That will remove the tags in my example. That won't work. — richarddmorey, Dec 15 '22 at 09:50

Ronak Shah · Answer 1 · 2022-12-15T23:17:48.000

0

I could not find an inbuilt function to do this so here's a custom one.

library(rvest)

html = '<html><body>Hi <b>name</b></body></html>'

turn_to_character_keeping_inner_tags <- function(x, tag) {
  gsub(sprintf('^<%s.*?>|</%s>$', tag, tag), '', as.character(x))
}

read_html(html) |> 
  html_element('body') |> 
  turn_to_character_keeping_inner_tags('body')

[1] "Hi <b>name</b>\n"

edited Dec 15 '22 at 23:17

answered Dec 15 '22 at 09:59

Ronak Shah

377,200
20
156
213

Is this guaranteed to work, eg, when there are attributes? The opening body tag: "" would throw this code off. – richarddmorey Dec 15 '22 at 16:34
You are right. The updated answer should take care of that. I also added another condition to remove the tag from only the beginning and end so in case you have the same tag in inner HTML that would still remain. @richarddmorey – Ronak Shah Dec 15 '22 at 23:19
Isn't "< body>" (note space after opening "<") a valid body tag that will be missed by this? – richarddmorey Dec 16 '22 at 17:43

Extracting innerHTML using rvest

1 Answers1