I have some files named as shown in the examples below:
2000_A_tim110_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_BB_tim110_may112_AAMM_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_C_tim110_DDFF_V18_P006_R1_001.ext
2000_DD_may112_EEJJ_V88_P004_R1_001.ext
From these filenames, I would like to extract the leading 2000_[A-Z]{1,2}
and ALL instances the V[0-9]{2}
regex patterns.
That is,
From
2000_A_tim110_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
I'd like to have
2000_A_V22_V23
And from
2000_DD_may112_EEJJ_V88_P004_R1_001.ext
I'd like to have
2000_DD_V88
I've been trying to achieve this with sed
but I've not had any success thus far.
At first--rather naively--I tried
find *.ext | sed -r 's/^(2000_[A-Z]{1,2}).*(V{1}[0-9]{2,3}).*(V{1}[0-9]{2,3}).*\.ext/\1_\2_\3/'
And that yielded:
2000_A_V22_V23
2000_BB_V14_V45
2000_C_tim110_DDFF_V18_P006_R1_001.ext
2000_DD_may112_EEJJ_V88_P004_R1_001.ext
Which is not quite what I wanted, since two of the filenames here have returned unedited.
Then, having read this post, I tried making the group being captured in the middle optional like so:
find *.ext | sed -r 's/^(2000_[A-Z]{1,2}).*(V{1}[0-9]{2})?.*(V{1}[0-9]{2}).*\.ext/\1_\2_\3/'
But this didn't seem to have worked either since it returned
2000_A__V23
2000_BB__V45
2000_C__V18
2000_DD__V88
(i.e., the capturing group in the middle seems to have been skipped entirely.)
My question is, how do I get the following result?
2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88
Where am I going wrong? Or conversely, what am I missing? I'm very new to sed
and regex
--and I'd like to learn to use both well--so pointers and guidance would be much appreciated.