3

I'm dealing with a specific filenames, and need to extract information from them.

The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"

with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".

The information I need to extract is the substring of RANDOMSTR without this optional substring.

I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:

gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045

The expected results are:

gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING

How can I get the desired effect.

Thanks.

RogerFC
  • 329
  • 3
  • 15
  • You mentioned that the substring has the pattern `"-W[0-9].[0-9]{2}.[0-9]{3}"` yet your example input contains `...W0.40+045.raw.gz`. Do you need to cater to both? – Shawn Chin Dec 15 '10 at 14:37
  • I do not include the ".raw.gz" as part of the substring. – RogerFC Dec 15 '10 at 14:54
  • Sorry, I meant to draw attention to the plus sign which would not be covered by your pattern. – Shawn Chin Dec 15 '10 at 15:01
  • Meaning the pattern was only to match the part I don't want from RANDOMSTR, not the whole string. (I could not edit my previous comment) – RogerFC Dec 15 '10 at 15:04
  • Well, I was a bit lazy and just put a "." in the place of the plus sign. It matches the string, so it's ok for me. But in the end I don't really use that pattern, using "(-W.*)" is enough for me. The pattern for the substring is only provided as a reference, in case it helps. – RogerFC Dec 15 '10 at 15:11

4 Answers4

2

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.

$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • That's definitely stronger regex-fu! +1 – Shawn Chin Dec 15 '10 at 15:34
  • 1
    btw, does not work for me unless I change it to `pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.+?)(?=(-W.*)?\.raw\.gz)'`, i.e. I had to use `(.+?)` instead of `(.*?)`. – Shawn Chin Dec 15 '10 at 15:41
  • @Shawn: `(.+?)` is probably better, but it works for me as shown. I just copied and pasted the lines from my answer to test it again and it works (either way). – Dennis Williamson Dec 15 '10 at 15:52
  • Thanks both! It works for me too, but only using Shawn's variant. A pity my grep-fu is not as strong as my awk-fu. If after some trials I'm not able to get the result I need (out of the scope of this issue), I'll get back to you. :) – RogerFC Dec 15 '10 at 16:36
  • `(.*?)` works when I ran it on a RHEL5 box but gave an empty result on RHEL4. Strangely, the version of grep is the same (2.5.1) but version of Bash differs (3.2 vs 3.0). I expected it to be down the version of grep, not Bash. – Shawn Chin Dec 15 '10 at 16:37
  • Looks like it's down the the different versions of `libpcre` used by grep (6.6 vs 4.5). – Shawn Chin Dec 15 '10 at 16:43
  • Note that the `-P` option in grep is not supported in [FreeBSD](http://freebsd.org/). – ghoti Jan 31 '12 at 06:18
  • @ghoti: True (nor OpenBSD or NetBSD, or many others), but it is supported in OS X (BSD-based). – Dennis Williamson Jan 31 '12 at 16:36
1

While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.

$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$ 

Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:

$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
ghoti
  • 45,319
  • 8
  • 65
  • 104
0

The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.

If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.

str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz}  | # remove trailing .raw.gz
     sed 's/-W.*$//' | # remove trainling -W.*, if any
     sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'

I used sed, but you can just as well use gawk/awk.

Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
0

Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:

sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO
PaulMurrayCbr
  • 1,167
  • 12
  • 16
  • 1
    `sed -E 's/(-W[0-9].[0-9]{2}.[0-9]{3})?\.raw\.gz$//;s/.*_//'` ... You don't need multiple pipes. (For all you Linux users, use `sed -r` instead of `sed -E`.) – ghoti Jan 31 '12 at 05:49
  • Yes, quite right. sed -e will take a sequence of commands. I should re-write one of my scripts :) – PaulMurrayCbr Feb 03 '12 at 01:42