1

I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.

I only want specific links, which I match by a substring.

Example of my html:

<a href="/this/dir/1234/">This should be 1234</a> some other html
<a href="/this/dir/1236/">This should be 1236</a> some other html
<a href="/about_us/">Not important link</a> some other html

I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.

What I have so far:

xidel -e "//a/(@href[contains(.,'/this/dir')],text())"

It basically works, but two issues remain:

  • I get the data separated by linefeed. I would like to have it on same line.
  • Every link text is returned, so I get the text "Not important link" as well.

What is recommended way to get output like

/this/dir/1234  ; This should be 1234
/this/dir/1236  ; This should be 1236

Appreciate any feedback / tips.

edit:

The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.

note : I am on windows.

xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "{$1=$1}1" "OFS=\n" 
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
MyICQ
  • 987
  • 1
  • 9
  • 25

1 Answers1

2

You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with

string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '&#10;')
Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • **Thank you Martin!** That was 99% correct. See my edit to original question. I did not know about the predicate. – MyICQ Mar 07 '19 at 14:22
  • The use of `' '` is use of XQuery syntax so if Xidel has any options to make sure the expression you pass in is evaluated as XQuery and not plain XPath then try that. Or use `codepoints-to-string(10)` instead e.g. `string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), codepoints-to-string(10))`, that should go through as XPath. – Martin Honnen Mar 07 '19 at 14:50
  • the codepoints-to-string(10) worked. You are brilliant. Thank you ! – MyICQ Mar 07 '19 at 15:05
  • @MartinHonnen, by putting the entire query inside `string-join()` you can expect the entire output to be on a single line. MyICQ likes to have every @href on a separate line, so instead `//a[contains(@href,'/this/dir')]/join((@href,.),' ; ')`, or `//a[contains(@href,'/this/dir')]/concat(@href,' ; ',.)` would be better. – Reino Mar 08 '19 at 14:29
  • @Reino, can you cite anything from the XQuery spec or XQuery functions spec that supports your claim that using `string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), ' ')` as I have done puts the entire output on a single line? Not sure where your expectations come from, I certainly don't share them. And I don't see why `//a[contains(@href,'/this/dir')]/concat(@href,' ; ',.)` ensures output on separate lines, you construct a sequence of strings without defining any separator between them. – Martin Honnen Mar 08 '19 at 15:12
  • I literally meant `string-join(` `[...]` `)`, so without the optional separator. String-joining and then using a new-line separator while `//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .)` already has the desired result isn't rather efficient if you ask me. – Reino Mar 09 '19 at 01:27
  • is this possible with `--css` to show multiple values? – chovy Dec 29 '20 at 08:24
  • @chovy, try to raise that as a new question on your own. – Martin Honnen Dec 29 '20 at 08:42