23

I am working on a bash script to extract some information from a xml file. I'm using grep for this.

To find the information I need, I run:

grep -oP "<title>(.*)</title>" temp.xml

I get a list of matches and this includes the <title> tag.

How can I get a list containing only the text inside the title tag but without the title tag using grep?

filype
  • 8,034
  • 10
  • 40
  • 66

7 Answers7

43

Since you already use grep -P, why don't you use its features?

grep -oP '(?<=<title>).*?(?=</title>)'

In the general case, XPath is the correct solution, but for toy scenarios, yes Virginia, it can be done.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    but now grep -P is obsolete – Bharat Pahalwani Jul 07 '14 at 06:32
  • 2
    @Bharat Obsolete?? Can you provide a reference? – tripleee Jul 07 '14 at 07:48
  • i found that [here](http://stackoverflow.com/questions/16658333/grep-p-no-longer-works-how-can-i-rewrite-my-searches) – Bharat Pahalwani Jul 07 '14 at 07:55
  • 4
    The fact that OSX chose to remove useful functionality hardly indicates that the feature is obsolete. There is no indication that it will be removed from GNU `grep` which is easy to install on OSX if you need it, and standard on most other platforms these days. – tripleee Jun 10 '15 at 07:13
  • Are the ?<= called look behind or something in regex? I need to learn that – filype Jul 22 '16 at 21:24
  • 1
    [`man perlre`](http://perldoc.perl.org/perlre.html#Extended-Patterns) - `(?<=pattern)` is a lookbehind assertion and `(?=pattern)` is a lookahead assertion. – tripleee Jul 23 '16 at 06:04
  • 1
    I also don't have access to XPath on the unix system I'm using, so this is the best answer for me – deccles Mar 24 '20 at 02:37
9

I can't see why you'd want to use grep for this, while it can be solved with a trivial XPath expression:

//title/text()

There are many command line tools for XPath and they're usually bundled with the OS.

Answers to this question on Stack Overflow list a number of such tools.

The problem with grep here is that it's a generic tool for text processing and it's not aware of any XML structure. For a very simple scenario, you can get it working. If the document is complex or if you're using this in a script that will survive months or years and not just a one-off job, you may end up feeling sorry for the results.

XPath makes it easy to tell the difference between similarly named tags that appear in different contexts in a document.

<article>
    <author>
        <name>Jon Doe</name>
        <title>Chief Editor</title>
    </author>
    <title>On the Benefits of grep</title>
    <publicationDate>2018-02-12</publicationDate>
    <text>blah blah blah</text>
</article>

Extracting the title of the article represented by this document with grep would fail if you used any of the other answers posted here. You could technically write the regular expression to get what you need but it's a lot easier with XPath.

/article/title/text()

If you know you're dealing with a trivial document and the format doesn't change or if it's a one time job where you can quickly validate the results, you can go for grep as explained by others.

toniedzwiedz
  • 17,895
  • 9
  • 86
  • 131
  • Examples of commands that support XPath are xgrep (http://wohlberg.net/public/software/xml/xgrep), xmlgrep (http://search.cpan.org/dist/XML-Twig/tools/xml_grep/xml_grep) or sgrep (http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html). – Claudi Sep 05 '14 at 06:49
  • 6
    What didn't you understand in the (clear) question that ends with : "using grep" ? – Moonchild Feb 12 '15 at 16:46
  • 3
    What did you not understand in the answer providing a useful answer to a question that addresses the core of the problem as opposed to assumptions made by the OP. Why is it bothering you? – toniedzwiedz Feb 12 '15 at 17:18
  • See also http://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell for a catalog of XPath tools for U*x. – tripleee Jun 10 '15 at 08:48
  • 3
    Ask a question about oranges and the accepted answer is about bananas. Nice. Here's a tip: _tips go in comments_, not answers. – Christian Mar 01 '18 at 16:44
  • 1
    I'm working on a server that doesn't have xpath, nor xmlstarlet, but it has grep. This is why I'm looking for a grep answer and cannot use xpath. – Katie Mar 01 '18 at 23:44
  • @Kayvar then feel free to use any of the other answers to this question or install the right tool for the job on your server. Sure you can hack this with `grep`, I just think it's valuable to point out that it's not a robust solution and, depending on the XML format in question and the use case, such a solution may blow up in your face. The OP seems to agree. – toniedzwiedz Mar 02 '18 at 10:15
6

It's not the best solution, I would search for XML lib in bash but you can do:

grep -oP "<title>(.*)</title>" temp.xml | cut -d ">" -f 2 | cut -d "<" -f 1
hovanessyan
  • 30,580
  • 6
  • 55
  • 83
3
grep -oP "<foo>(.*)</foo>" "XML.xml" | sed -n 's/.*<foo>\([^<]*\)<\/foo>.*/\1/p' >> "foo.txt"
1

You could install xgrep using xpath as suggested in Tom's answer

man xgrep

Community
  • 1
  • 1
Yannick
  • 102
  • 1
  • 6
1

You can use anyone of the below commands to get the values between the tags.

grep -oP '(>).*?(?=</title>)' test.xml | cut -d ">" -f 2
grep -oP '(?<=title>).*(?=</title)' test.xml
awk -F "[><]" '/title/{print $3}' test.xml

ARGStackOvaFlo
  • 185
  • 1
  • 4
  • 16
0

Use the gawk tool ex:

gawk 'BEGIN { RS="<[^>]+>" } { print RT, $0 }' myfile
S.B
  • 13,077
  • 10
  • 22
  • 49