matching a specific substring with regular expressions using awk

Question

I'm dealing with a specific filenames, and need to extract information from them.

The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"

with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".

The information I need to extract is the substring of RANDOMSTR without this optional substring.

I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:

gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045

The expected results are:

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING

How can I get the desired effect.

Thanks.

You mentioned that the substring has the pattern `"-W[0-9].[0-9]{2}.[0-9]{3}"` yet your example input contains `...W0.40+045.raw.gz`. Do you need to cater to both? — Shawn Chin, Dec 15 '10 at 14:37
Sorry, I meant to draw attention to the plus sign which would not be covered by your pattern. — Shawn Chin, Dec 15 '10 at 15:01
Meaning the pattern was only to match the part I don't want from RANDOMSTR, not the whole string. (I could not edit my previous comment) — RogerFC, Dec 15 '10 at 15:04
Well, I was a bit lazy and just put a "." in the place of the plus sign. It matches the string, so it's ok for me. But in the end I don't really use that pattern, using "(-W.*)" is enough for me. The pattern for the substring is only provided as a reference, in case it helps. — RogerFC, Dec 15 '10 at 15:11

score 2 · Accepted Answer · answered Dec 15 '10 at 15:32

2

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.

$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING

answered Dec 15 '10 at 15:32

Dennis Williamson

346,391
90
374
439

That's definitely stronger regex-fu! +1 – Shawn Chin Dec 15 '10 at 15:34
1

btw, does not work for me unless I change it to `pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.+?)(?=(-W.*)?\.raw\.gz)'`, i.e. I had to use `(.+?)` instead of `(.*?)`. – Shawn Chin Dec 15 '10 at 15:41
@Shawn: `(.+?)` is probably better, but it works for me as shown. I just copied and pasted the lines from my answer to test it again and it works (either way). – Dennis Williamson Dec 15 '10 at 15:52
Thanks both! It works for me too, but only using Shawn's variant. A pity my grep-fu is not as strong as my awk-fu. If after some trials I'm not able to get the result I need (out of the scope of this issue), I'll get back to you. :) – RogerFC Dec 15 '10 at 16:36
`(.*?)` works when I ran it on a RHEL5 box but gave an empty result on RHEL4. Strangely, the version of grep is the same (2.5.1) but version of Bash differs (3.2 vs 3.0). I expected it to be down the version of grep, not Bash. – Shawn Chin Dec 15 '10 at 16:37
Looks like it's down the the different versions of `libpcre` used by grep (6.6 vs 4.5). – Shawn Chin Dec 15 '10 at 16:43
Note that the `-P` option in grep is not supported in [FreeBSD](http://freebsd.org/). – ghoti Jan 31 '12 at 06:18
@ghoti: True (nor OpenBSD or NetBSD, or many others), but it is supported in OS X (BSD-based). – Dennis Williamson Jan 31 '12 at 16:36

ghoti · Answer 2 · 2012-01-31T07:17:20.747

While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.

$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$

Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:

$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'

Shawn Chin · Answer 3 · 2010-12-15T15:32:49.857

The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.

If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.

str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz}  | # remove trailing .raw.gz
     sed 's/-W.*$//' | # remove trainling -W.*, if any
     sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'

I used sed, but you can just as well use gawk/awk.

score 0 · Answer 4 · answered Jan 31 '12 at 05:25

0

Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:

sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO

answered Jan 31 '12 at 05:25

PaulMurrayCbr

1,167
12
16

1

`sed -E 's/(-W[0-9].[0-9]{2}.[0-9]{3})?\.raw\.gz$//;s/.*_//'` ... You don't need multiple pipes. (For all you Linux users, use `sed -r` instead of `sed -E`.) – ghoti Jan 31 '12 at 05:49
Yes, quite right. sed -e will take a sequence of commands. I should re-write one of my scripts :) – PaulMurrayCbr Feb 03 '12 at 01:42

matching a specific substring with regular expressions using awk

4 Answers4

Linked