I was trying to parse a page (Kaggle Competitions) with xpath
on MacOS as described in another SO question:
curl "https://www.kaggle.com/competitions/search?SearchVisibility=AllCompetitions&ShowActive=true&ShowCompleted=true&ShowProspect=true&ShowOpenToAll=true&ShowPrivate=true&ShowLimited=true&DeadlineColumnSort=Descending" -o competitions.html
cat competitions.html | xpath '//*[@id="competitions-table"]/tbody/tr[205]/td[1]/div/a/@href'
That's just getting a href
of a link in a table.
But instead of returning the value, xpath
starts validating .html
and returns errors like undefined entity at line 89, column 13, byte 2964
.
Since man xpath
doesn't exist and xpath --help
ends with nothing, I'm stuck. Also, many similar solutions relate to xpath
from GNU distributions, not in MacOS.
Is there a correct way of getting HTML elements via XPath in bash?
` (luckily only a few of them in that source) before processing the source as XML using XPath. – zx485 May 06 '16 at 13:06
` tags in `competitions.html` with `sed -e "s/
/
/g" competitions.html > competitions2.html`. Then repeat that for the other "errors". After you finished that, you can process the resulting file with an XML parser and XPath.