Regular Expression: PATTERN with exception using AWK gsub

Question

I have a data file (cou.data)

USSR    8649    275 Asia
Cananda 3852    25  North America
China   3705    1032    Asia
USA 3615    237 North America
Brazil  3286    134 South America
India   1267    746 Asia
Mexico  762 78  North America
France  211 55  Europe
Japan   144 120 Asia
Germany 96  61  Europe
England 94  56  Europe
Taiwan  55  144 Asia
North Korea 44  2134    Asia

There are only spaces but no tabs in this data.

I want to replace all space(s) with ":", but leave country names with space unchanged.

That is, my desired output should look like the following:

USSR:8649:275:Asia
Cananda:3852:25:North America
China:3705:1032:Asia
USA:3615:237:North America
Brazil:3286:134:South America
India:1267:746:Asia
Mexico:762:78:North America
France:211:55:Europe
Japan:144:120:Asia
Germany:96:61:Europe
England:94:56:Europe
Taiwan:55:144:Asia
North Korea:44:2134:Asia

I have cudgeled my brain and can only write this

awk '{ gsub(/([a-zA-Z] +[0-9]|[0-9] +[a-zA-Z]|[0-9] +[0-9])/, ":"); print }' cou.data

But the output is not right.

USS:64:7:sia
Canand:85::orth America
Chin:70:03:sia
US:61:3:orth America
Brazi:28:3:outh America
Indi:26:4:sia
Mexic:6::orth America
Franc:1::urope
Japa:4:2:sia
German:::urope
Englan:::urope
Taiwa::4:sia
North Kore::13:sia

Some parts which should not have been removed are gone.

How can my AWK code be modified or is there an easy solution to get what I want ?

ps

awk '{ print gensub(/([a-zA-Z])( )([a-zA-Z])/, "\\1~\\3", "g", $0) }' cou.data | sed -r 's/ +/:/g; s/~/ /g'

Thanks for the headsup – Sleeping On a Giant's Shoulder Aug 06 '18 at 13:39 — Sleeping On a Giant's Shoulder, Aug 06 '18 at 13:39

Sundeep · Accepted Answer · 2018-08-06T08:23:14.647

You need capture groups and back-references, which is not supported by all awk implementations.. GNU awk supports it using gensub.. I would suggest to use sed instead

$ sed -E 's/ +([0-9])/:\1/g; s/([0-9]) +/\1:/g' ip.txt
USSR:8649:275:Asia
Cananda:3852:25:North America
China:3705:1032:Asia
USA:3615:237:North America
Brazil:3286:134:South America
India:1267:746:Asia
Mexico:762:78:North America
France:211:55:Europe
Japan:144:120:Asia
Germany:96:61:Europe
England:94:56:Europe
Taiwan:55:144:Asia
North Korea:44:2134:Asia

-E to enable ERE, some sed version need -r instead of -E
s/ +([0-9])/:\1/g match one or more spaces followed by a digit. We need to replace only the spaces but leave the digit as is. So capture the digit and refer to it in replacement section using backreference
s/([0-9]) +/\1:/g this will cover cases of digit followed by spaces
a capture group is defined by placing the regex inside () - from left to right, \1 refers first such group, \2 refers second one and so on

With perl, you could avoid having to use capture groups

perl -pe 's/ +(?=\d)|\d\K +/:/g' ip.txt

+(?=\d)|\d\K + will match spaces only if it is followed by a digit or preceded by a digit

With GNU awk, see gawk String-Manipulation Functions for syntax and details

awk '{$0=gensub(/ +([0-9])/, ":\\1", "g", $0);
      print gensub(/([0-9]) +/, "\\1:", "g", $0)}' ip.txt

awk '{ print gensub(/([a-zA-Z])( )([a-zA-Z])/, "\\1~\\3", "g", $0) }' cou.data | sed -r 's/ +/:/g; s/~/ /g' What do you think of this one ? — Sleeping On a Giant's Shoulder, Aug 06 '18 at 12:46
I do not see any benefit to that approach other than as learning attempt... could you explain why you want to do it that way? it would fail if input contains `~` and awk code could be done with sed, so why awk+sed? — Sundeep, Aug 06 '18 at 12:53
Yes, you are right. Just practised using gensub. sed can do well. Thanks — Sleeping On a Giant's Shoulder, Aug 06 '18 at 13:15

Mic · Answer 2 · 2018-08-06T19:46:58.460

2

You can use backreferences to include the parts of the original you want to keep with gnu awk. Using gensub and adding backreferences to your regex gives you the below.

gawk '{ print gensub(/(([a-zA-Z]) +([0-9]))|(([0-9]) +([a-zA-Z]))|(([0-9]) +([0-9]))/, "\\2\\5\\8:\\3\\6\\9", "g"); }' file

see https://www.gnu.org/software/gawk/manual/gawk.html#index-substitute-in-string

edited Aug 06 '18 at 19:46

answered Aug 06 '18 at 08:29

Mic

331
1
4

1

thanks Inian, forgot to mention it would need to be gnu awk to use gensub, wil modify – Mic Aug 06 '18 at 08:43
triple back-slash is a typo or some weird requirement on your system? – Sundeep Aug 06 '18 at 12:50
1

my mistake. typed without tags first double backslash showed up as one. thanks. like your solution by the way very neat. – Mic Aug 06 '18 at 19:47

Regular Expression: PATTERN with exception using AWK gsub

2 Answers2