sed: capturing a recurring regex group that happens to be optional

Question

I have some files named as shown in the examples below:

2000_A_tim110_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_BB_tim110_may112_AAMM_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_C_tim110_DDFF_V18_P006_R1_001.ext
2000_DD_may112_EEJJ_V88_P004_R1_001.ext

From these filenames, I would like to extract the leading 2000_[A-Z]{1,2} and ALL instances the V[0-9]{2} regex patterns.

That is,

From

2000_A_tim110_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext

I'd like to have

2000_A_V22_V23

And from

2000_DD_may112_EEJJ_V88_P004_R1_001.ext

I'd like to have

2000_DD_V88

I've been trying to achieve this with sed but I've not had any success thus far.

At first--rather naively--I tried

find *.ext | sed -r 's/^(2000_[A-Z]{1,2}).*(V{1}[0-9]{2,3}).*(V{1}[0-9]{2,3}).*\.ext/\1_\2_\3/'

And that yielded:

2000_A_V22_V23
2000_BB_V14_V45
2000_C_tim110_DDFF_V18_P006_R1_001.ext
2000_DD_may112_EEJJ_V88_P004_R1_001.ext

Which is not quite what I wanted, since two of the filenames here have returned unedited.

Then, having read this post, I tried making the group being captured in the middle optional like so:

find *.ext | sed -r 's/^(2000_[A-Z]{1,2}).*(V{1}[0-9]{2})?.*(V{1}[0-9]{2}).*\.ext/\1_\2_\3/'

But this didn't seem to have worked either since it returned

2000_A__V23
2000_BB__V45
2000_C__V18
2000_DD__V88

(i.e., the capturing group in the middle seems to have been skipped entirely.)

My question is, how do I get the following result?

2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88

Where am I going wrong? Or conversely, what am I missing? I'm very new to sed and regex--and I'd like to learn to use both well--so pointers and guidance would be much appreciated.

It is at least extremely hard to do in `sed` — I'd almost be ready to say 'not possible', but that's probably not quite right. You'd probably have to repeatedly delete the bits you didn't want, which `sed` can do (labels, test and branch, though the negative patterns make life complex; the patterns can probably leverage the underscores before and after), but it is neither simple nor obvious. — Jonathan Leffler, Jul 29 '19 at 01:06
If you want an education in sed, we can give you a sed solution. If you want a simple solution that works, you should choose a different tool. — Beta, Jul 29 '19 at 01:23
@Beta, I would not mind receiving said education (so long as it isn't too much of a bother for you/the educator). As a newbie, it's rather hard to figure out which tool is appropriate for the task at hand--e.g., Ed Morton has posted a neat answer that uses `awk` which I've never really used before, but have variously encountered being touted as a tool comparable to `sed`. I feel any information you provide me here would help me understand these tools and their relative advantages/disadvantages better. — Dunois, Jul 29 '19 at 01:39
@JonathanLeffler no need for hard tools in sed, just an alternation `|` operator should do this job. See my answer. — Avinash Raj, Jul 29 '19 at 05:47

score 2 · Answer 1 · answered Jul 29 '19 at 01:00

2

With GNU awk for FPAT:

$ awk -v FPAT='^2000_[A-Z]{1,2}|V[0-9]{2}' '{out=$1; for (i=2; i<=NF;i++) out=out "_" $i; print out}' file
2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88

answered Jul 29 '19 at 01:00

Ed Morton

188,023
17
78
185

score 1 · Answer 2 · answered Jul 29 '19 at 05:24

As a pure bash solution (sorry, without sed), how about:

#!/bin/bash

pat='((^2000_[A-Z]{1,2})|(_V[0-9]{2}))(.*)'
while IFS= read -r -d '' line; do
    result=
    while [[ $line =~ $pat ]]; do
        result+="${BASH_REMATCH[1]}"
        line="${BASH_REMATCH[4]}"
    done
    [[ -n "$result" ]] && echo "$result"
done < <(find . -type f -name '*.ext' -printf '%f\0')

output:

2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88

Avinash Raj · Answer 3 · 2019-07-30T00:22:21.660

1

What's hard with basic sed? Make use the power of alternation | operator with sed's substitute functionality.

$ cat sedtets 
2000_A_tim110_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_BB_tim110_may112_AAMM_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_C_tim110_DDFF_V18_P006_R1_001.ext
2000_DD_may112_EEJJ_V88_P004_R1_001.ext

$ sed 's/\(2000_[A-Z]\{1,2\}\|_V[0-9]\+\)\|./\1/g' sedtets
2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88

DEMO

Logic here is to capture all the necessary parts using a single capturing group and then match all the remaining characters.

Then replace the all the matched, captured chars with captured chars. This will keep only captured chars and deletes all the matched chars.

edited Jul 30 '19 at 00:22

answered Jul 29 '19 at 05:45

Avinash Raj

172,303
28
230
274

1

The `sed` script shown does not work with macOS (BSD) `sed`, with or without the `-E` (extended regular expression) option. GNU `sed` does accept it without needing the `-E` option. However, given a line `2001_DD_V96`, it outputs `_V96`. Given a line `2000_BB_tim110_may112_AAMM_P002_R1_001_P008_R2_001_comb.ext` with no `_V##` in it, it outputs `2000_BB`. Granted, those weren't in the data in the question, so it isn't clear what the correct behaviour is, but it is likely that neither line should generate any output. The `2001_DD…` line certainly doesn't match the required `2000_…` prefix. – Jonathan Leffler Jul 29 '19 at 06:04
1

The `2000_` prefix can be handled by changing `find *.ext |` into `find 2000_*.ext |` – Walter A Jul 29 '19 at 07:53
1

We don't need to go for find even, if we use the same regex (without slashes) in rename utility – Avinash Raj Jul 29 '19 at 07:56
Could you please explain why it's necessary to use the `|` operator here? Although I accepted @Jonathan Leffler's answer as it was educational, I must confess that the solution you've presented here would be the most straightforward one. As you both can guess, the data doesn't have discrepancies such as `2001_` instead of `2000_` among the filenames and so forth. Perhaps I should have clarified this in the OP; I did not foresee people thinking so far out from what was presented in the question. : ) : ) – Dunois Jul 29 '19 at 13:29

Jonathan Leffler · Accepted Answer · 2019-07-29T06:06:07.523

As I noted in a comment, it is very hard to do the job in sed. However, with careful use of branching and testing, it can be done.

I'm using the classic sed BRE notation; if you choose to use the more modern but not necessarily as portable ERE notation, you can eliminate a fair number of backslashes. I also saved the script in a file, sed.script, and the sample data in a file data, and ran the command using:

$ sed -f sed.script data
2000_A_V22_V23
2000_BB_V14_V45
2000_C_V18
2000_DD_V88
$

The script contains:

:retry
s/^\(2000_[A-Z]\{1,2\}\(_V[0-9][0-9]\)*\)_[^_]\{1,\}$/\1/
t
s/^\(2000_[A-Z]\{1,2\}\(_V[0-9][0-9]\)*\)_[^_]\{1,\}_/\1_/
t retry

The first line sets a label retry.
The first s/// line looks for 2000_ followed by one or two upper-case letters, then a sequence of zero or more instances of an underscore, a V and two digits (this is all remembered); then an underscore and a sequence of one or more non-underscores and the end of line. This is replaced by the remembered material.
If the first s/// matched, then it branches to the end of script (t with no label name). This results in the line being printed.
The second s/// line is very similar to the first, except that instead of looking for the end of line, it looks for another underscore after the underscore and sequence of non-underscores. Note that the term that looks for _V## (where # represents a digit) finds as many of them as possible, so the _xxx_ term does not match _V##_. That is replaced by the remembered term and an underscore, so it deletes one unit of _xxx_ from the string.
If the second s/// matched, then it branches back to the start of the script.
In theory, if the second s/// does not match, then the loop is broken and what's left is printed. In practice, it is not reached with the sample data, but if an input line did not match at all (e.g. it started 2001 instead of 2000), then it would be printed unchanged after not being worked on by either of the s/// operations.
If lines that do not match the start pattern should be deleted, that can be handled by adding a line at the start of the script:
```
/^2000_[A-Z]\{1,2\}/!d
```
If lines that do not contain any _V##_ sequences, that too can be dealt with, adding more lines before the retry label. If there's _V## at the end of a line (and nowhere sooner), then it skips past the next line. The next line looks for _V##_ in the middle of a line and deletes the line if there's no match.
```
/_V[0-9][0-9]$/b skip
/_V[0-9][0-9]_/!d
:skip
```

You can see how this progresses by adding p after each s/// operation, which shows the intermediate results too:

2000_A_may112_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_A_AATT_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_A_V22_P001_R1_001_V23_P007_R2_001_comb.ext
2000_A_V22_R1_001_V23_P007_R2_001_comb.ext
2000_A_V22_001_V23_P007_R2_001_comb.ext
2000_A_V22_V23_P007_R2_001_comb.ext
2000_A_V22_V23_R2_001_comb.ext
2000_A_V22_V23_001_comb.ext
2000_A_V22_V23_comb.ext
2000_A_V22_V23
2000_A_V22_V23
2000_BB_may112_AAMM_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_BB_AAMM_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_BB_V14_P002_R1_001_V45_P008_R2_001_comb.ext
2000_BB_V14_R1_001_V45_P008_R2_001_comb.ext
2000_BB_V14_001_V45_P008_R2_001_comb.ext
2000_BB_V14_V45_P008_R2_001_comb.ext
2000_BB_V14_V45_R2_001_comb.ext
2000_BB_V14_V45_001_comb.ext
2000_BB_V14_V45_comb.ext
2000_BB_V14_V45
2000_BB_V14_V45
2000_C_DDFF_V18_P006_R1_001.ext
2000_C_V18_P006_R1_001.ext
2000_C_V18_R1_001.ext
2000_C_V18_001.ext
2000_C_V18
2000_C_V18
2000_DD_EEJJ_V88_P004_R1_001.ext
2000_DD_V88_P004_R1_001.ext
2000_DD_V88_R1_001.ext
2000_DD_V88_001.ext
2000_DD_V88
2000_DD_V88

If your sed supports extensions over what POSIX sed requires, you may be able to simplify the script. For example, there may be options to simplify the script if you can use | or +. This should work with any version of sed.

This code was tested with both macOS (BSD) sed and with GNU sed and works the same with both.

Thank you for the detailed answer @Jonathan Leffler. I'm really grateful for the step-by-step explanations, and I'm also impressed by the fact that you've covered outlier scenarios. — Dunois, Jul 29 '19 at 13:23
I think a simpler solution is possible: `:a/^2000_[A-Z]*$_V[0-9]\{2\}$*$/bs/$^2000_[A-Z]*\(_V[0-9]\{2\}$*\)_[^_]*/\1/ba` — Beta, Jul 29 '19 at 23:47

score 1 · Answer 5 · answered Jul 29 '19 at 08:03

1

You can use grep with a loop:

for f in $(find 2000* -regex '2000_[A-Z].*ext'); do
    printf "%s\n" $(grep -Eo "^2000_[A-Z]{1,2}|_V[0-9]{2}" <<<"$f" | tr -d "\n")
done

answered Jul 29 '19 at 08:03

Walter A

19,067
2
23
43

sed: capturing a recurring regex group that happens to be optional

5 Answers5