I have a lot of HTML files with a lot of different content where I'm extracting always a specific part of it with a command line tool called pup
. The extract contains sometimes tags which can look like this:
<a href="https://www.stackoverflow.com" class="someclasses">anchor text</a>
... or like this:
<a class="someclasses" href="https://www.duckduckgo.com" target="_blank" js-class>
Visit Duck Duck Go!
</a>
... or even like this:
<a class="someclasses"
href="mailto:this.is.an@email.com" js-class>
email
</a>
What I'm trying to do is to ...
- ... extract href value and anchor-text (the text between
<a ...>
and</a>
). - ... put both extracts in a seperate line but in the reverse order: First the text, than the href value.
- ... put three characters in front of every href value:
=>
So the result looks for example like this:
Visit Duck Duck Go!
=> https://www.duckduckgo.com
I'm able to get what I want with some concatinated sed
commands and some RegEx by creating groups/patterns and switching their printed order, if everything is in one line, just like in the first example. But I have no clue how to get what I want if the anchor tag is spred over several lines. I tried to achive my goal only with sed
but I had no luck. Yesterday I've been reading about similar problems from other people and that sed
is not ment to work over a linebreak beyond. Is this true? Could awk
do this? Are there any other tools I could use?