1

Let me post part of html I want to scrape first

<div id="hello">
  <p>abc</p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <center><img src="image_url"></center>
  <p align="center" style="text-align: center;"><b>def</b></p>
  <p>abc</p>
  <center><img src="image_url"></center>
</div>

I am trying to scrape the text in p and src of image which is the image_url in order. The thing is, the html I showed above is actually not static, all pages have different structure which means sometimes there'll be more p tags before having center tag which includes img src

Since the p and center tags are randomly structured in each pages, I was thinking of getting all the p tags for example using response.css('#hello p') then loop through all the p to get text but while getting the text from current p tag while looping, also check if next sibling has a center tag, if do then get the src append it.

I found something like that by doing p.xpath('following-sibling::center[1]/img/@src').get() as p is each paragraph duing the iteration.

But I figured, that does not work at all because let's say if I have 4 p tags until a center I will actually get 4 img src because that p.xpath('following-sibling::center[1]/img/@src').get() does not just find the next sibling but goes through all the siblings after and see if center tag is matched.

I tried googling but I do not see anything mentioning only check if next sibling is some tag. Anyone has any idea I can get it work so I can save the data in sequence?

Hopefully my explanation makes sense.

Thanks in advance for any help and suggestions

Dora
  • 6,776
  • 14
  • 51
  • 99
  • 1
    So you want to scrape `center` node only if it is an *immediate sibling of `p`*, right? Does `p.xpath('following-sibling::*[1][name()="center"]/img/@src')` solve your problem? – JaSON Sep 10 '20 at 09:40
  • @JaSON great :D that's truly what I need, can you post it so I can mark it as an answer? But I have another question I forgot to mention which I solved using a loop but wonder if there's another way? In my script above after a `p` there MIGHT be one `center` but what if there's multiple? Is there an easy way to do it instead of looping to check if there's a next image? Right now I am incrementing `center[1]` by one to see if return is None or not – Dora Sep 11 '20 at 00:05
  • I can't provide you with generic solution. For 2 possible `center` nodes in a row it will be `following-sibling::*[(position()=1 and name()="center") or (position()=2 and name()="center" and not(preceding-sibling::*[1][name()="p"]))]/img/@src`. The more nodes there might be - the more complicated and messy XPath it would be :) – JaSON Sep 11 '20 at 08:30

1 Answers1

1

Try below XPath to get required output

p.xpath('following-sibling::*[1][name()="center"]/img/@src')
JaSON
  • 4,843
  • 2
  • 8
  • 15