0

I am trying to extract some reviews from a secure webpage as below:

# Attempt to extract information from a online secure page
library(rvest)
URL <- "https://www.bankbazaar.com/insurance/religare-health-insurance.html"
mainPage <- read_html(URL)
reviewsHTML <- html_nodes(mainPage, ".ellipsis_text")
reviewsHTML

Above codes give me output as {xml_nodeset (0)}. But when I save that webpage (using ctrl + S) in my local system first as "Religare Health Insurance.html" and then try to extract the reviews, I am able to extract the reviews.

# Attempt to extract information from a offline secure page
library(rvest)
URL <- "Religare Health Insurance.html"
mainPage <- read_html(URL)
reviewsHTML <- html_nodes(mainPage, ".ellipsis_text")
reviewsHTML
{xml_nodeset (20)}
[1] <span itemprop="description" class="ellipsis_text">I have taken my health insurance from Religare......

Questions:

  1. Why there is a different behavior when I try to extract the information from the same online and offline page?
  2. How can we use R, to extract the same information without downloading the page?
  • That page probably executes javascript which modifies the HTML after load. Your browser can execute the code the so when you save the page, you get that latest version. Rvest will not. You need to use something like RSelenium to run the javascript code for you so you can read it into R. This is a very common misunderstanding. Just search for "rselenium" and you should find what you need. – MrFlick Sep 02 '16 at 04:05
  • Thanks MrFlick, it was really helpful – vikasnitk85 Sep 03 '16 at 06:16

0 Answers0