How to (e) grep XML for certain tag content?

Question

How can I (e)grep all content between a certain tag block?

Assume input file below, I want to get as output all characters between the B-tags so:

<B><C>Test</C></B>
<B>Test2</B>

I tried the following grep to search all XML files with the content between the <B> and </B> tags.

grep '<B>.*</B>' *.xml

but it did not work.

For the following input:

<A>
 <B>
  <C>Test</C>
 </B>
 <D>
 </D>
 <B>
    Test2
 </B>
</A>

Any ideas?

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — David Brabant, May 23 '12 at 06:43
Regular expressions (and particularly the wildcards) only match on a single line. Why not just search for and then search for . But you probably want to handle nested tags, too. — PauliL, May 23 '12 at 06:52
possible duplicate of [How can I search for a multiline pattern in a file ? Use pcregrep](http://stackoverflow.com/questions/152708/how-can-i-search-for-a-multiline-pattern-in-a-file-use-pcregrep) — Jeremy Stein, May 23 '12 at 15:14
@PauliL: Wildcards aren't the problem, it's grep itself that confines each match to a single line. — Alan Moore, May 23 '12 at 16:17

score 3 · Answer 1 · answered May 23 '12 at 15:12

3

Use awk:

awk '/<B>/,/<\/B>/'

answered May 23 '12 at 15:12

Jeremy Stein

score 0 · Answer 2 · answered Jan 27 '15 at 12:16

0

When working with xml files, the best way is to use xml tools.

xmlstarlet sel -t -c '//B' file.xml

xmllint from libxml2:

xmllint --xpath '//B' file.xml

answered Jan 27 '15 at 12:16

marbu

2 Answers2