-3

My HTML content is as follows:

<html>
<head><title>Index </title></head>
<body bgcolor="white">
<h1>Index of /Test/</h1><hr><pre><a href="../">../</a>
<a href="1.0/">1.0/</a>                                              17-Mar-2018 17:36                   -
<a href="1.1/">1.1/</a>                                              19-Jun-2018 19:22                   -
<a href="1.2/">1.2/</a>                                              22-Sep-2018 00:18                   -
<a href="documents/">documents/</a>                                             25-Apr-2018 23:40                   -
<a href="samples">samples</a>                                            03-Sep-2018 16:00              403699
</pre><hr></body>
</html>

I get the above HTML output by making a request to the server.

From the HTML output, I want to my final output to be as follows:

1.0
1.1
1.2
documents
samples

How can I get that the above output using bash script?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Galet
  • 5,853
  • 21
  • 82
  • 148

2 Answers2

0

Using regex to parse HTML or XML files is essentially not done. Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

In case of parsing XML or HTML, one can easily use xmlstarlet.

In case of an XHTML file, you can use :

xmlstarlet sel --html  -N "x=http://www.w3.org/1999/xhtml" \
               -t -m '//x:a' -v . -n

where -N gives the XHTML namespace if any, this is recognized by

<html xmlns="http://www.w3.org/1999/xhtml">

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet sel --html -N "x=http://www.w3.org/1999/xhtml" \
                   -t -m '//x:a' -v . -n
../
1.0/
1.1/
1.2/
documents/
samples
kvantour
  • 25,269
  • 4
  • 47
  • 72
0

Using the HTML-XML-utils from https://www.w3.org/Tools/HTML-XML-utils:

$ hxnormalize -x infile.html | hxselect -c -s '\n' a
../
1.0/
1.1/
1.2/
documents/
samples

The hxnormalize step is required because of the rogue <hr> tag (hxselect requires well-formed input); the -x option stands for "use XML conventions".

The hxselect a step extracts all anchor elements; the -c option prints content only, and -s '\n' separates the results with a newline.

If you really don't want the trailing /, you can pipe to tr -d '/'.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • Nice solution. I was not aware of `hxnormalize` and the entire toolkit – kvantour Sep 05 '18 at 14:24
  • @kvantour The html-xml-utils are neat, but I don't think anybody still maintains them. Edit: oh, never mind, changelog has changes as recent as a few months ago. – Benjamin W. Sep 05 '18 at 14:25