0
library(XML)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
doc.html = htmlTreeParse(html, useInternal = TRUE)
doc.text = unlist(xpathApply(doc.html, '//div', xmlValue))

The above code reads text twice because of div level/structure, I need to read text only once. Thank you for your time and help. i.e.

doc.text[2] # contains all the text which repeats again in 3 to 59

Janjua
  • 235
  • 2
  • 13
  • What if you just read the first `div`? `'//div[1]'` – Wiktor Stribiżew Jul 02 '18 at 20:56
  • I'm not sure that this is a regex question... – emsimpson92 Jul 02 '18 at 20:59
  • Are you sure you are employing XML? The function read_html is likely from rvest, it is not from XML. – Nicolás Velasquez Jul 02 '18 at 21:10
  • I want all the text inside document but just once not repeated. I have seen some approach to achieving this is to use a smart regular expression which matches anything between “<” and “> just a thought to tag regex if any one can help with it – Janjua Jul 02 '18 at 21:12
  • I don't know much about R, yap its my mistake was trying different approaches to get desired results @NicolásVelásquez – Janjua Jul 02 '18 at 21:14
  • No worries mate. This is the right social network to come to learn and make mistakes so we learn to correct them. XML's xmlParse and xmlTreeParse might not yield exact equivalents of what you'd get with rvest read_html. So it would be useful for the community to know that to reproduce your exmaple we need to load the object called 'html' through the function read_html from the package rvest. – Nicolás Velasquez Jul 02 '18 at 21:22
  • thanks for your time and help @NicolásVelásquez – Janjua Jul 02 '18 at 21:39

1 Answers1

1

Try this:

library(rvest)
library(tidyverse)
html <- read_html("https://www.sec.gov/Archives/edgar/data/1011290/000114036105007405/body.htm")
text <- html %>% 
         html_nodes(xpath = "//text/div") %>%
         html_text(trim = TRUE) %>% 
         paste( collapse = ' ')
Nicolás Velasquez
  • 5,623
  • 11
  • 22
  • 1
    much better and preserved with levels > text = paste(text, collapse = ' ') < just an addition working perfectly. – Janjua Jul 02 '18 at 22:15
  • just for the knowledge, how to figure out xpath = "//text/div" don't know much about R and its packages – Janjua Jul 02 '18 at 22:22
  • In this case it was just an educated guess mad by exploring the page's source code. A good tutorial is here: http://www.bernhardlearns.com/2017/04/webscraping-with-r-and-rvest-how-can-i.html Just note that this tutorial employs CSS selectors rather than XPATH. In most cases they are equivalent. – Nicolás Velasquez Jul 02 '18 at 22:26