2

I am trying to scrape full reviews from this webpage. (Full reviews - after clicking the 'Read More' button). This I am doing using RSelenium. I am able to select and extract text from the first <p> element, using the code

reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[@id][1]")

which is for less text review.

But not able to extract full text reviews using the code

reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[@id][2]")

or

reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[@itemprop = 'reviewBody']")

It shows blank list elements. I don't know what is wrong. Please help me..

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
  • What does the first query return? Is it a single node or a collection? I'd expect, based on the page structure, it would retrieve a collection of all `p` elements whose `id` attribute starts with `"lessReviewContent"`, as those are first `

    ` children of their parents. Am I right?

    – CiaPan Apr 01 '16 at 11:17
  • yes... you are right... it retrieves the collection. – Rishabh Soni Apr 02 '16 at 04:26
  • even when I type the xpath query "//p[@id][2]" in the "xpath helper" chrome extension, it retrieves the intended text. But the same xpath is not working in the code. Can't think about the reason.... – Rishabh Soni Apr 02 '16 at 04:29

2 Answers2

0

Drop the double slash and try to use the explicit descendant axis:

/descendant::p[@id][2]

(see the note from W3C document on XPath I mentioned in this answer)

Community
  • 1
  • 1
CiaPan
  • 9,381
  • 2
  • 21
  • 35
0

As you're dealing with a list, you should first find the list items, e.g. using CSS selector

div.srm

Based on these elements, you can then search on inside the list items, e.g. using CSS selector

p[itemprop='reviewBody']

Of course you can also do it in 1 single expression, but that is not quite as neat imho:

div.srm p[itemprop='reviewBody']

Or in XPath (which I wouldn't recommend):

//div[@class='srm']//p[@itemprop='reviewBody']

If neither of these work for you, then the problem must be somewhere else.

Kim Homann
  • 3,042
  • 1
  • 17
  • 20