How to get text from anchor tag in an HTML response using bash script

Question

My HTML content is as follows:

<html>
<head><title>Index </title></head>
<body bgcolor="white">
<h1>Index of /Test/</h1><hr><pre><a href="../">../</a>
<a href="1.0/">1.0/</a>                                              17-Mar-2018 17:36                   -
<a href="1.1/">1.1/</a>                                              19-Jun-2018 19:22                   -
<a href="1.2/">1.2/</a>                                              22-Sep-2018 00:18                   -
<a href="documents/">documents/</a>                                             25-Apr-2018 23:40                   -
<a href="samples">samples</a>                                            03-Sep-2018 16:00              403699
</pre><hr></body>
</html>

I get the above HTML output by making a request to the server.

From the HTML output, I want to my final output to be as follows:

1.0
1.1
1.2
documents
samples

How can I get that the above output using bash script?

This is closely related to: https://stackoverflow.com/questions/21264626/ — kvantour, Sep 05 '18 at 09:48
Yeah that's right. But I want to get anchor tag text value not href values even though both are same in my case. — Galet, Sep 05 '18 at 09:50
@karan essentially copy pasted my answer there with a minor update to retrieve your requested values. But be aware that your output is missing the first anchor and also the `href` attribute-value and anchor value are always the same. — kvantour, Sep 05 '18 at 10:00
@kvantour https://chat.stackoverflow.com/transcript/90230?m=43837669#43837669 — Benjamin W., Sep 06 '18 at 13:29

kvantour · Accepted Answer · 2018-09-05T10:01:12.887

Using regex to parse HTML or XML files is essentially not done. Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

In case of parsing XML or HTML, one can easily use xmlstarlet.

In case of an XHTML file, you can use :

xmlstarlet sel --html  -N "x=http://www.w3.org/1999/xhtml" \
               -t -m '//x:a' -v . -n

where -N gives the XHTML namespace if any, this is recognized by

<html xmlns="http://www.w3.org/1999/xhtml">

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
                   -t -m '//x:a' -v . -n
../
1.0/
1.1/
1.2/
documents/
samples

Same... it's usually accompanied by a comment on the question by a certain someone, though. — Benjamin W., Sep 05 '18 at 14:24

score 0 · Answer 2 · answered Sep 05 '18 at 13:42

0

Using the HTML-XML-utils from https://www.w3.org/Tools/HTML-XML-utils:

$ hxnormalize -x infile.html | hxselect -c -s '\n' a
../
1.0/
1.1/
1.2/
documents/
samples

The hxnormalize step is required because of the rogue <hr> tag (hxselect requires well-formed input); the -x option stands for "use XML conventions".

The hxselect a step extracts all anchor elements; the -c option prints content only, and -s '\n' separates the results with a newline.

If you really don't want the trailing /, you can pipe to tr -d '/'.

answered Sep 05 '18 at 13:42

Benjamin W.

46,058
19
106
116

Nice solution. I was not aware of `hxnormalize` and the entire toolkit – kvantour Sep 05 '18 at 14:24
@kvantour The html-xml-utils are neat, but I don't think anybody still maintains them. Edit: oh, never mind, changelog has changes as recent as a few months ago. – Benjamin W. Sep 05 '18 at 14:25

How to get text from anchor tag in an HTML response using bash script

2 Answers2