-1

I'm trying to access an attribute of a previous sibling, but it's proving difficult.

So basically the web page I'm trying to scrape is TERRIBLE and the anchor tags use crappy onclick instead of href. Stupid, I know. I'm trying to first find the anchor tag containing an onclick with the window.open('servletLinkJunkHere...'), then move to the previous sibling, which is an img tag, and extract the src attribute from it.

<IMG SRC="images/warning.gif" ALT="blah blah blah" STYLE="position:relative;top:2px;cursor:help;">
<a href="#" onclick="javascript:window.open('servletLinkJunkHere...')>

And here's the xpath I'm trying to use:

$url_pre = 'a[onclick*="'servletLinkJunkHere...'"]/preceding-sibling::img/@src'; 

Any ideas on how I can accomplish this? I know it's possible, I'm just not totally proficient in xpath queries. Also, are there any good resources for learning all the nooks and crannies of xpath? Thanks!

EDIT: So this is what I have but it doesn't seem to be returning anything but an empty array.

$url_email = "EditNotificationInfoServlet?cb=on&id=" . $id . "&sessionId=1";

$url_pre = "a[contains(@onclick,'" . $url_email . "')]/preceding-sibling::IMG/@SRC";

$final_text = $crawler->filterXPath($url_pre)->each(function($crawler, $i) {
        return $crawler->text();
});
Phil
  • 157,677
  • 23
  • 242
  • 245
Kenny
  • 2,124
  • 3
  • 33
  • 63
  • What is the context for `$crawler`? You may need to prefix the XPath expression with `//` – Phil Apr 29 '15 at 03:19
  • This is a function, and I pass it a `$crawler` object. I have many other functions, and they work just fine, so the context should be fine. – Kenny Apr 29 '15 at 03:21
  • Sorry, I meant the document context. Unless the document context for `$crawler->filterXPath` is the immediate parent of your HTML `` element, you won't find it. Using `//a[contains(...` will search the entire document from whatever context it has – Phil Apr 29 '15 at 03:23
  • Hmm, I tried `//a[contains...` as well, and it didn't seem to provide a working solution. Continuing to research. – Kenny Apr 29 '15 at 03:31
  • Maybe just try `//a/preceding-sibling::img` or even `//a` and see what you get back – Phil Apr 29 '15 at 03:36
  • Can't do that, because there are many, many `a` tags throughout the document, I have to find the one that contains a certain string (that is in the onclick), then go to the preceding sibling. (Note: it doesn't seem to be entering the `each(function($crawler, ...` function at all, I tried putting an echo in. – Kenny Apr 29 '15 at 03:38
  • I realise that, I was just suggesting you start simple and work up from there. Make sure you can even find any image preceding an anchor, etc – Phil Apr 29 '15 at 03:39
  • I printed off my combined string and I get `&sessionId=1&displayMode=update¬ificationType=1&` and as you can see, there's some weird symbol. It's suppose to be `update&notification`. I think that might be the problem. – Kenny Apr 29 '15 at 03:46
  • I'd say so. FYI, assuming you're using Symfony's DOMCrawler, it should work fine given the correct values to search for ~ https://eval.in/320191 – Phil Apr 29 '15 at 03:50
  • Yes, Symfony's DOMCrawler. So then do you think it is that weird character messing stuff up? `¬` character? – Kenny Apr 29 '15 at 03:56
  • Definitely. I can't see that in your code though so you must be omitting something (the string in your question code ends at `sessionId=1`). – Phil Apr 29 '15 at 04:02

1 Answers1

2

I think you need to use the following xpath:

a[contains(@onclick,'servletLinkJunkHere...')]/preceding-sibling::IMG/@SRC
Lingamurthy CS
  • 5,412
  • 2
  • 13
  • 21
  • I tried your code, no luck. Perhaps I did it wrong. I edited my original question and posted the exact code I have, if maybe that could help fine tune what my mistake is. – Kenny Apr 29 '15 at 03:15