0

I'm trying to extract forum posts (message2) while getting rid of the blockquote (message1). Here is the HTML (post content modified/simplified):

 <div class="cPost_contentWrap ipsPad">
                      <div data-controller="core.front.core.lightboxedImages" class="ipsType_normal ipsType_richText ipsContained" itemprop="text" data-role="commentContent">
                        <blockquote data-ipsquote-contentclass="forums_Topic" data-ipsquote-contentid="40244" data-ipsquote-contenttype="forums" data-ipsquote-contentapp="forums" data-cite="aries_gurl" data-ipsquote-username="aries_gurl" data-ipsquote-contentcommentid="584324" class="ipsQuote" data-ipsquote="">
                          <div>
                            (message1)
                          </div>
                        </blockquote>

                        <p>(message2)</p>
                      </div>

I am trying with the following XPath query:

//div[@class="ipsType_normal ipsType_richText ipsContained"]/p[not(@class="ipsQuote")]

For some reason, however, this query returns all subsequent posts under the same case rather than just the current node -so, taking the above as a reference, the returned results would be: message2 message2 message2 message2, and so on (total N of messages).

Is there a way I can get one message at a time? Thank you!

T.Gil
  • 3
  • 2

1 Answers1

1

Is there a way I can get one message at a time?

Yes ;) use:

(//div[@class="ipsType_normal ipsType_richText ipsContained"]/p[not(@class="ipsQuote")])[1] 

for the first one. And [n] with n=1..x for the others.

hr_117
  • 9,589
  • 1
  • 18
  • 23
  • Thanks a lot! I've tried with .//div[@class="ipsType_normal ipsType_richText ipsContained"]/p[not(@class="ipsQuote")] and it worked! – T.Gil Apr 30 '16 at 00:16