0

I have a lot of HTML files with a lot of different content where I'm extracting always a specific part of it with a command line tool called pup. The extract contains sometimes tags which can look like this:

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

... or like this:

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

... or even like this:

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

What I'm trying to do is to ...

  1. ... extract href value and anchor-text (the text between <a ...> and </a>).
  2. ... put both extracts in a seperate line but in the reverse order: First the text, than the href value.
  3. ... put three characters in front of every href value: =>

So the result looks for example like this:

Visit Duck Duck Go!
=> https://www.duckduckgo.com

I'm able to get what I want with some concatinated sed commands and some RegEx by creating groups/patterns and switching their printed order, if everything is in one line, just like in the first example. But I have no clue how to get what I want if the anchor tag is spred over several lines. I tried to achive my goal only with sed but I had no luck. Yesterday I've been reading about similar problems from other people and that sedis not ment to work over a linebreak beyond. Is this true? Could awk do this? Are there any other tools I could use?

osz
  • 1
  • 1
  • 1
    [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Oct 13 '21 at 18:31
  • I have a feeling pup should be able to do this, and if not, you can always convert to JSON using pup, and then use something like jq to robustly extract it. – Benjamin W. Oct 13 '21 at 18:41

5 Answers5

1

It could be done parsing HTML fragments with xmllint and xpath expressions

frag=$(cat <<EOF
<div>
<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>
<a class="someclasses"
    href="http://example.com">
    URL

</a>
<a class="someclasses"
    href="http://example.com/2">
    URL 2

</a>
</div>
EOF
)


while read -r line; do
    if [ "${line%=*}" == 'href' ]; then
        url=$(tr -d '"' <<<"${line#*=}")
    elif [ -n "$line" ]; then
       echo "$line"
       echo "=> $url"
    fi
done < <(echo "$frag" | xmllint --recover --html --xpath "//a/text()| //a/@href" -)

Result:

email
=> mailto:this.is.an@email.com
URL
=> http://example.com
URL 2
=> http://example.com/2

xmllint could be used to parse HTML files directly also.

LMC
  • 10,453
  • 2
  • 27
  • 52
0

You can try this bash script though it may not be as efficient as the tools mentioned in the comments.

$ cat input_file
<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>
#!/usr/bin/env bash

IFS=$'\n'
i=0
count=$(( $(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' input_file | sed '/^$/d' | wc -l) - 1 ))
while [[ "$i" -le "$count" ]];
    do for f in input_file; do
        first=($(sed -En 's/<.*>(.*)<.*>/\1/g;/<|>/!p' "$f" | sed '/^$/d'))
        second=($(sed -En 's|.*href="(.[^ ]*)".*|\1|p;' "$f"))
        echo "${first[$i]}" $'\n' " => ${second[$i]}"
        ((i++))
    done
done

Output

email
  => mailto:this.is.an@email.com
Visit Duck Duck Go!
  => https://www.duckduckgo.com
anchor text
  => https://www.stackoverflow.com
HatLess
  • 10,622
  • 5
  • 14
  • 32
0

I'm able to get what I want with some concatinated sed commands and some RegEx by creating groups/patterns and switching their printed order, if everything is in one line, just like in the first example. But I have no clue how to get what I want if the anchor tag is spred over several lines.

If you need to retain what you already have then consider just removing newlines before actual processing, for example using tr - translate or delete characters .

Daweo
  • 31,313
  • 3
  • 12
  • 25
0

Using GNU awk for multi-char RS, the 3rd arg to match() and \s/\S shorthand:

$ cat tst.awk
BEGIN { RS="</a>" }
match($0,/<a[^>]+href="([^"]+).*>\s*(\S.*\S)/,a) {
    print a[2] "!" ORS "=> " a[1]
}

e.g. given this input file:

$ cat file
The extract contains sometimes tags which can look like this:

<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

... or like this:

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>

... or even like this:

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class>
    email

</a>

$ awk -f tst.awk file
anchor text!
=> https://www.stackoverflow.com
Visit Duck Duck Go!!
=> https://www.duckduckgo.com
email!
=> mailto:this.is.an@email.com
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

I will suppose that the output of pup is well-formed XML, like this:

<root>
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>

<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class="x">
Visit Duck Duck Go!
</a>

<a class="someclasses"
    href="mailto:this.is.an@email.com" js-class="x">
  email
</a>
</root>

This means that you need a root element, like the root tag in this case, and that every attribute has a value, which is the reason why I changed js-class into js-class="x".

The xmlstarlet command extracting what you want is:

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml

The output corresponding to the input above is:

anchor text
== https://www.stackoverflow.com
Visit Duck Duck Go!
== https://www.duckduckgo.com
email
== mailto:this.is.an@email.com

Since xmlstarlet is unable to output a >, as far as I know, you may want to correct the == string into => by adding a filter at the end of the command, as in:

xmlstarlet sel -t -m "//a" -v "normalize-space()" -n -o "== " -v "@href" -n input.xml | 
sed 's/==/=>/'

giving as final result:

anchor text
=> https://www.stackoverflow.com
Visit Duck Duck Go!
=> https://www.duckduckgo.com
email
=> mailto:this.is.an@email.com

But again and again: don't use regex and sed to process HTML files.

Pierre François
  • 5,850
  • 1
  • 17
  • 38