Scraping HTML file in R to extract specific lines

Question

I am trying to develop an R script that can extract specific lines of downloaded HTML files. Here is a file example:

<html>
<head>
<title>ARMS Email System</title>
<meta name="record_type" content="FEDERAL  (NOTES MAIL)">
<meta name="creator" content="redacted">
<meta name="creation_date" content="2000-11-22">
<meta name="to" content="redacted">
<meta name="cc" content="   ">
<meta name="bcc" content="   ">
<meta name="subject" content=" fwd: re: fwd: Accomplishments section of Progress Report ">
</head>
<body>
[redacted]
</body>
</html>

Ideally I would like it to extract Record Type, Creator, Creation, Subject, To (which all seemed to have meta tags) How can I scrape the "creation_date" of each record type in the html file?

html <- read_html(x ="/Users/.../A1.html")`
text = html %>% 
  html_element('creation_date') %>%
  html_text2()

score 0 · Answer 1 · answered May 02 '23 at 17:27

If you want to extract the values form the meta tags, you can do

library(rvest)
html %>% 
  html_elements('meta') %>% 
  {
    data.frame(
      name = html_attr(., "name", ""),
      value = html_attr(., "content", "")
    )
  }

If you wanted just the creation_date, you could do something like

html %>% 
  html_element('meta[name="creation_date"]') %>% 
  html_attr("content")

Scraping HTML file in R to extract specific lines

1 Answers1