4

Input file (test):

123456<a id="id1" name="name1" href="link1">This is link1</a>789<a id="id2"
href="link2">This is link2</a>0123

Desired output:

link1
link2

What I have done:

$ sed -e '/<a/{:begin;/<\/a>/!{N;b begin};s/<a\([^<]*\)<\/a>/QQ/;/<a/b begin}' test
123456QQ789QQ0123

Question: How do you print the regex groups in sed (multiline)?

ekad
  • 14,436
  • 26
  • 44
  • 46
hahakubile
  • 6,978
  • 4
  • 28
  • 18

1 Answers1

2

If you use sed like this:

sed -e '/<a/{:begin;/<\/a>/!{N;b begin};s/<a\([^<]*\)<\/a>/\n/;/<a/b begin}'

then it will print in different lines:

123456
789
0123

But is this what you are trying to print? Or you want to print text in hrefs?

Update 1: To get hrefs between well formed <a and </a>

sed -r '$!N; s~\n~~; s~(<a )~\n\1~ig; s~[^<]*<a[^>]*href\s*=\s*"([^"]*)"[^\n]*~\1\n~ig' test

output

link1
link2

Update 2: Getting above output using bash regex feature

regex='href="([^"]*)"'
while read line; do
   [[ $line =~ $regex ]] || continue
   echo ${BASH_REMATCH[1]}
done < test

output

link1
link2
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • @anubhava thx, I want to print the **text in href** out of link. – hahakubile May 12 '11 at 03:46
  • @anubhava still in trouble, there is a "\n" before link2 href in the test. Have you missed the \n? thanks very much – hahakubile May 12 '11 at 05:07
  • thanks, @anubhava It worked at last. But the way still uses \1 to replace the regex matching text. Is there a way just to print the groups? – hahakubile May 12 '11 at 06:25
  • Please check my update2 section above for printing just the captured groups using BASH. However I think in sed there is no straightforward way of doing exactly that. – anubhava May 12 '11 at 12:28
  • Yes, update2 can handle the example line by line. but if href=\n"link2", there will be a problem. So BASH and Sed are not good at parsing HTMLs. – hahakubile May 13 '11 at 07:47