3

I'm trying to "select" the link from the onclick attribute in the following portion of html

<span onclick="Javascript:document.quickFindForm.action='/blah_blah'" 
 class="specialLinkType"><img src="blah"></span>

but can't get any further than the following XPath

//span[@class="specialLinkType"]/@onclick

which only returns

Javascript:document.quickFindForm.action

Any ideas on how to pick out that link inside of the quickFindForm.action with an XPath?

Stephan
  • 41,764
  • 65
  • 238
  • 329
emish
  • 2,813
  • 5
  • 28
  • 34

3 Answers3

1

I tried the XPath in a Java application and it worked ok:

    import java.io.IOException;
    import java.io.StringReader;

    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import javax.xml.parsers.ParserConfigurationException;
    import javax.xml.xpath.XPath;
    import javax.xml.xpath.XPathExpression;
    import javax.xml.xpath.XPathFactory;

    import org.w3c.dom.Document;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;

    public class Teste {

        public static void main(String[] args) throws Exception {
            Document doc = stringToDom("<span onclick=\"Javascript:document.quickFindForm.action='/blah_blah'\" class=\"specialLinkType\"><img src=\"blah\"/></span>");
            XPath newXPath = XPathFactory.newInstance().newXPath();
            XPathExpression xpathExpr = newXPath.compile("//span[@class=\"specialLinkType\"]/@onclick");
            String result = xpathExpr.evaluate(doc);
            System.out.println(result);

        }

        public static Document stringToDom(String xmlSource) throws SAXException, ParserConfigurationException, IOException {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            return builder.parse(new InputSource(new StringReader(xmlSource)));
        }
    }

Result:

Javascript:document.quickFindForm.action='/blah_blah'
sother
  • 556
  • 4
  • 5
  • That worked. Don't know what I was doing wrong; perhaps I had set up the page incorrectly. – emish Jul 06 '11 at 09:28
0

If Scrapy supports XPath string functions this will work

substring-before(
   substring-after(
      //span[@class="specialLinkType"]/@onclick,"quickFindForm.action='")
   ,"'")

It looks like it also supports regex. Something like this should work

.select('//span[@class="specialLinkType"]/@onclick').re(r'quickFindForm.action=\'(.*?)\'')

Caveat: I can't test the second solution and you will have to check that \' is the proper escape sequence for single quotes in this case.

cordsen
  • 1,691
  • 12
  • 10
0

I used xquery but it should be the same in xpath. I used an xpath function "tokenize" that splits a string based on a regular expression (http://www.xqueryfunctions.com/xq/fn_tokenize.html). In this case I split the string basing on " ' "

        xquery version "1.0";
        let $x := //span[@class="specialLinkType"]/@onclick
        let $c := fn:tokenize( $x, '''' )
        return $c[2]

That in xpath shoud be:

        fn:tokenize(//span[@class="specialLinkType"]/@onclick, '''' )[2]
Shilaghae
  • 957
  • 12
  • 22