0

I was trying to parse a page (Kaggle Competitions) with xpath on MacOS as described in another SO question:

curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'

That's just getting a href of a link in a table.

But instead of returning the value, xpath starts validating .html and returns errors like undefined entity at line 89, column 13, byte 2964.

Since man xpath doesn't exist and xpath --help ends with nothing, I'm stuck. Also, many similar solutions relate to xpath from GNU distributions, not in MacOS.

Is there a correct way of getting HTML elements via XPath in bash?

Community
  • 1
  • 1
Anton Tarasenko
  • 8,099
  • 11
  • 66
  • 91
  • 2
    I checked the HTML source at `Kaggle` and it is not well-formed XML, therefore XPath will probably fail. The source is HTML and _not_ XHTML. You would have to remove 'incomplete' tags like `
    ` (luckily only a few of them in that source) before processing the source as XML using XPath.
    – zx485 May 06 '16 at 13:06
  • @zx485 Is there any way to ignore errors? I did `xml sel --html -T -t -v` on the same source and it returns like 20 errors. Would `scrapy` or `lxml` do better? – Anton Tarasenko May 06 '16 at 13:36
  • Not using XML! Line 86 of the source file contains a (very malformed tag) line `
    `. I'm not sure what creates this, but this is unparsable by an XML parser. One way would be to replace the (crappy) part of this HTML source with `sed`. For example, you could replace the (incomplete) `
    ` tags in `competitions.html` with `sed -e "s/
    /
    /g" competitions.html > competitions2.html`. Then repeat that for the other "errors". After you finished that, you can process the resulting file with an XML parser and XPath.
    – zx485 May 06 '16 at 13:59
  • 1
    The central problem with processing HTML as XML are the incomplete tags. I'm not amused that this has _not been fixed_ by standard, that means: making XHTML the default. – zx485 May 06 '16 at 14:02
  • 1
    Also notice that thee is not tbody in the file. This is added by the browser to create valid html – hr_117 May 06 '16 at 14:36

1 Answers1

3

Getting HTML elements via XPath in bash

from html file (with not valid xml)

One possibility may be to use xsltproc. (I hope it is available for MAC). xsltproc has an option --html to use html as input. But with that you need to have a xslt stylesheet.

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 

  <xsl:template match="/*">
    <xsl:value-of  select="//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href" />
  </xsl:template>

</xsl:stylesheet>

Notice that the xapht is changed. There is no tbodyin the input file. Call xsltproc:

xsltproc --html  test.xsl competitions.html 2> /dev/null

Where the xslproc complaining about errors in html is ignored ( send to /devn/null ).

The output is: /c/R

To use different xpath expression from command line you may use a xslt template and replace the __xpath__.

E.g. xslt template:

<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text" /> 
  <xsl:template match="/*">
    <xsl:value-of  select="__xpaht__" />
  </xsl:template>
</xsl:stylesheet>

And use (e.g) sed for the replacement.

 sed -e "s,__xpaht__,//*[@id='competitions-table']/tr[205]/td[1]/div/a/@href," test.xslt.tmpl > test.xsl
 xsltproc --html  test.xsl competitions.html 2> /dev/null
hr_117
  • 9,589
  • 1
  • 18
  • 23