regex in sed removing only the first occurrence from every line

Question

I have the following file I would like to clean up

cat file.txt

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

My desired output is:

MNS:N+  GYPA*01 or GYPA*M   
MNS:M+  GYPA*02 or GYPA*N
MNS:Mc  GYPA*08 or GYP*Mc
MNS:Vw  GYPA*09 or GYPA*Vw
MNS:Mg  GYPA*11 or GYPA*Mg
MNS:Vr  GYPA*12 or GYPA*Vr

I would like to remove everything between ":" and the first occurence of "or"

I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.

I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex

Another approach with GNU sed: `sed -r 's/:.{14}/:/' file` – Cyrus Feb 29 '20 at 05:37 — Cyrus, Feb 29 '20 at 05:37

RavinderSingh13 · Accepted Answer · 2020-02-29T06:39:06.047

3

perl should be more suitable here because we need Lazy match logic here.

perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file

by using .*?or we are checking for the first nearest match for or string in the line.

edited Feb 29 '20 at 06:39

answered Feb 29 '20 at 05:30

RavinderSingh13

130,504
14
57
93

Are there any advantages to using sed over perl -pe overall? – Shahin Feb 29 '20 at 05:32
@user171558, if its a single character then we could have used back reference logic in sed but in perl you can do lazy match(which means look for nearest match first). – RavinderSingh13 Feb 29 '20 at 05:34
2

No downvote from me. Shorter: `perl -pe 's/:.*? or /:/' file` – Cyrus Feb 29 '20 at 07:07

score 2 · Answer 2 · answered Feb 29 '20 at 10:40

This might work for you (GNU sed):

sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file

If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.

score 2 · Answer 3 · answered Feb 29 '20 at 15:00

Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N

Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:

$ sed 's/:.\{14\}/:/' file
MNS:N+    GYPA*01 or GYPA*M
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

You should REALLY change `MNS:2 odwasdsw or M+ GYPA*02 or GYPA*N` to `MNS:2 or asdsw or M+ GYPA*02 or GYPA*N` in your example if that can occur because most solutions posted so far will fail if/when it does. — Ed Morton, Mar 01 '20 at 02:50

score 1 · Answer 4 · answered Feb 29 '20 at 05:50

If it is sure the or always occurs twice a line as provided example, please try:

sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt

Result:

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.

score 1 · Answer 5 · answered Feb 29 '20 at 06:50

1

ex supports lazy matching with \{-}:

ex -s '+%s/:\zs.\{-}or //g|wq' input_file

The pattern :\zs.\{-}or matches any character after the first : up to the first or.

answered Feb 29 '20 at 06:50

builder-7000

7,131
3
19
43

regex in sed removing only the first occurrence from every line

5 Answers5