0

I would like to extract the html content of a tag in R. For instance, in the following HTML,

<html><body>Hi <b>name</b></body></html>

suppose I'd like to extract the content of the <body> tag, which would be:

Hi <b>name</b>

In this question, the answer (using as.character()) will include the enclosing tag, which is not what I want. eg,

library(rvest)
html = '<html><body>Hi <b>name</b></body></html>'
read_html(html) |>
    html_element('body') |>
    as.character()

returns outerHTML:

[1] "<body>Hi <b>name</b>\n</body>"

...but I want the innerHTML. How can I get the content of a HTML tag in R, without the enclosing tag?

richarddmorey
  • 976
  • 6
  • 19

1 Answers1

0

I could not find an inbuilt function to do this so here's a custom one.

library(rvest)

html = '<html><body>Hi <b>name</b></body></html>'

turn_to_character_keeping_inner_tags <- function(x, tag) {
  gsub(sprintf('^<%s.*?>|</%s>$', tag, tag), '', as.character(x))
}

read_html(html) |> 
  html_element('body') |> 
  turn_to_character_keeping_inner_tags('body')

[1] "Hi <b>name</b>\n"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Is this guaranteed to work, eg, when there are attributes? The opening body tag: "" would throw this code off. – richarddmorey Dec 15 '22 at 16:34
  • You are right. The updated answer should take care of that. I also added another condition to remove the tag from only the beginning and end so in case you have the same tag in inner HTML that would still remain. @richarddmorey – Ronak Shah Dec 15 '22 at 23:19
  • Isn't "< body>" (note space after opening "<") a valid body tag that will be missed by this? – richarddmorey Dec 16 '22 at 17:43