2

I'm using curl to get the html from a site then I just need a specific string which is between 'standards.xml?revision=' and '&amp'. I'm using sed to do this but I can't seem to get the regex right and needed some help.

curl website.com | sed -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|'

The output I'm getting is the full html--any help would be appreciated.

cakes88
  • 1,857
  • 5
  • 24
  • 33

3 Answers3

5

You're almost there. Try using -n option with sed not to print unmatched data and add p modifier to s||| to print replace string

curl website.com | sed -n -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|p'
jkshah
  • 11,387
  • 6
  • 35
  • 45
  • 1
    @Konnor Welcome! It seems you're new to this site. If any ans is working for you, consider accepting that ans by clicking on hollow green tick mark besides ans. P.S. I noticed you haven't accepted any of your 3 answers. – jkshah Oct 30 '13 at 18:47
2

you can use grep -oP (PCRE option):

grep -oP 'standards\.xml\?revision=\K[0-9]+'

\K resets the matched text hence only later part [0-9]+ is returned.

anubhava
  • 761,203
  • 64
  • 569
  • 643
1
curl website.com | sed -n '/xml/ {s|.*standards\.xml\?revision=([^&]+).*|\1|p;q;}'

From previous sed [0-9]+ is only if number occur maybe a [^&]+ is more appropriate. Very good to use the ' and | to avoid problem with \ so I pick it :-)

NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43