1

How can I (e)grep all content between a certain tag block?

Assume input file below, I want to get as output all characters between the B-tags so:

<B><C>Test</C></B>
<B>Test2</B>

I tried the following grep to search all XML files with the content between the <B> and </B> tags.

grep '<B>.*</B>' *.xml

but it did not work.

For the following input:

<A>
 <B>
  <C>Test</C>
 </B>
 <D>
 </D>
 <B>
    Test2
 </B>
</A>

Any ideas?

Jeremy Stein
  • 19,171
  • 16
  • 68
  • 83
robert
  • 1,921
  • 2
  • 17
  • 27
  • 1
    http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – David Brabant May 23 '12 at 06:43
  • Regular expressions (and particularly the wildcards) only match on a single line. Why not just search for and then search for . But you probably want to handle nested tags, too. – PauliL May 23 '12 at 06:52
  • possible duplicate of [How can I search for a multiline pattern in a file ? Use pcregrep](http://stackoverflow.com/questions/152708/how-can-i-search-for-a-multiline-pattern-in-a-file-use-pcregrep) – Jeremy Stein May 23 '12 at 15:14
  • @PauliL: Wildcards aren't the problem, it's grep itself that confines each match to a single line. – Alan Moore May 23 '12 at 16:17

2 Answers2

3

Use awk:

awk '/<B>/,/<\/B>/'
Jeremy Stein
  • 19,171
  • 16
  • 68
  • 83
0

When working with xml files, the best way is to use xml tools.

XMLStarlet:

xmlstarlet sel -t -c '//B' file.xml

xmllint from libxml2:

xmllint --xpath '//B' file.xml
marbu
  • 1,939
  • 2
  • 16
  • 30