Parse HTML into text with Div level in R

Question

library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))

The above code reads text twice because of div level/structure, I need to read text only once. Thank you for your time and help. i.e.

doc.text[2] # contains all the text which repeats again in 3 to 59

Are you sure you are employing XML? The function read_html is likely from rvest, it is not from XML. — Nicolás Velasquez, Jul 02 '18 at 21:10
I want all the text inside document but just once not repeated. I have seen some approach to achieving this is to use a smart regular expression which matches anything between “<” and “> just a thought to tag regex if any one can help with it — Janjua, Jul 02 '18 at 21:12
I don't know much about R, yap its my mistake was trying different approaches to get desired results @NicolásVelásquez — Janjua, Jul 02 '18 at 21:14
No worries mate. This is the right social network to come to learn and make mistakes so we learn to correct them. XML's xmlParse and xmlTreeParse might not yield exact equivalents of what you'd get with rvest read_html. So it would be useful for the community to know that to reproduce your exmaple we need to load the object called 'html' through the function read_html from the package rvest. — Nicolás Velasquez, Jul 02 '18 at 21:22

Nicolás Velasquez · Accepted Answer · 2018-07-02T22:18:28.143

1

Try this:

library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>% 
         html_nodes(xpath = "//text/div") %>%
         html_text(trim = TRUE) %>% 
         paste( collapse = ' ')

edited Jul 02 '18 at 22:18

answered Jul 02 '18 at 21:33

Nicolás Velasquez

5,623
11
22

1

much better and preserved with levels > text = paste(text, collapse = ' ') < just an addition working perfectly. – Janjua Jul 02 '18 at 22:15
just for the knowledge, how to figure out xpath = "//text/div" don't know much about R and its packages – Janjua Jul 02 '18 at 22:22
In this case it was just an educated guess mad by exploring the page's source code. A good tutorial is here: http://www.bernhardlearns.com/2017/04/webscraping-with-r-and-rvest-how-can-i.html Just note that this tutorial employs CSS selectors rather than XPATH. In most cases they are equivalent. – Nicolás Velasquez Jul 02 '18 at 22:26

Parse HTML into text with Div level in R

1 Answers1