-1

I have the following file I would like to clean up

cat file.txt

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

My desired output is:

MNS:N+  GYPA*01 or GYPA*M   
MNS:M+  GYPA*02 or GYPA*N
MNS:Mc  GYPA*08 or GYP*Mc
MNS:Vw  GYPA*09 or GYPA*Vw
MNS:Mg  GYPA*11 or GYPA*Mg
MNS:Vr  GYPA*12 or GYPA*Vr

I would like to remove everything between ":" and the first occurence of "or"

I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.

I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex

Shahin
  • 1,196
  • 1
  • 8
  • 15

5 Answers5

3

perl should be more suitable here because we need Lazy match logic here.

perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file

by using .*?or we are checking for the first nearest match for or string in the line.

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
2

This might work for you (GNU sed):

sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file

If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.

potong
  • 55,640
  • 6
  • 51
  • 83
2

Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N

Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:

$ sed 's/:.\{14\}/:/' file
MNS:N+    GYPA*01 or GYPA*M
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • You should REALLY change `MNS:2 odwasdsw or M+ GYPA*02 or GYPA*N` to `MNS:2 or asdsw or M+ GYPA*02 or GYPA*N` in your example if that can occur because most solutions posted so far will fail if/when it does. – Ed Morton Mar 01 '20 at 02:50
1

If it is sure the or always occurs twice a line as provided example, please try:

sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt

Result:

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.

tshiono
  • 21,248
  • 2
  • 14
  • 22
1

ex supports lazy matching with \{-}:

ex -s '+%s/:\zs.\{-}or //g|wq' input_file

The pattern :\zs.\{-}or matches any character after the first : up to the first or.

builder-7000
  • 7,131
  • 3
  • 19
  • 43