2

I have a lines like these:

ORIGINAL

sometext1 sometext2 word:A12 B34 C56 sometext3 sometext4
sometext5 sometext6 word:A123 B45 C67 sometext7 sometext8
sometext9 sometext10 anotherword:(someword1 someword2 someword3) sometext11 sometext12

EDITED

asdjfkklj lkdsjfic kdiw:A12 B34 C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123 B45 C678 oknes lkwid 
cnqule nkdal anotherword:(kdlklks inlqok mncvmnx) unqieo lksdnf

Desired output:

asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid 
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf

EDITED: Would this be more explicit? But frankly this is much more difficult to read and answer than writing sometext#. I do not know people's preference.

I only want to replace the whitespaces with dashes after A alphabet letter followed by some digits AND replace the whitespaces with dashes between the words between the two parentheses. And not any other whitespaces in the line. Would appreciate an explanation of the syntax too.

Thanks!

Char
  • 105
  • 11
  • 1
    Why no whitespace between `t2 w` in `sometext2 word`, that meets the criteria of a letter followed by a digit. Does it have to be multiple digits. Does it have to be a single character bordered by a boundary? – 123 Oct 26 '17 at 06:30
  • Are there always three parts to be joined by dashes? – Armali Oct 26 '17 at 07:52
  • @123 sometext1 sometext2 just means a bunch of text. I'm just using the numbers to show that they are hold different characters. Same for the different group of words in the third example. – Char Oct 26 '17 at 13:03
  • @Armali No, there could be more than 3 groups. Same for the word groups in the third example. – Char Oct 26 '17 at 13:05

3 Answers3

1

This code work good

darby@Debian:~/Scrivania$ cat test.txt | sed -r 's@\s+([A-Z][0-9]+)@-\1@g' | sed ':l s/\(([^ )]*\)[ ]/\1-/;tl'
asdjfkklj lkdsjfic kdiw:A12-B34-C56 lksjdfioe sldkjflkjd
lknal niewoc kdiw:A123-B45-C678 oknes lkwid 
cnqule nkdal anotherword:(kdlklks-inlqok-mncvmnx) unqieo lksdnf

Explain my regex

In the first regex

Options

-r              Enable regex extended

Pattern

\s+             One or more space characters
([A-Z][0-9]+)   Submatch a uppercase letter and one or more digits

Replace

-              Dash character
\1             Previous submatch

Note

The g after delimiters ///g is for global substitution.

In the second regex

Pattern

:l             label branched to by t or b
tl             jump to label if any substitution has been made on the pattern space since the most recent reading of input line or execution of command 't'. If label is not specified, then jump to the end of the script. This is a conditional branch
\(([^ )]*\)    match all in round brackets and stop to first space found
[ ]            one space character

Replace

\1             Previous submatch
-              Add a dash
Darby_Crash
  • 446
  • 3
  • 6
  • Doesn't work with `sometext5 sometext6 word:A123 B45 C678 D888 sometext7 sometext8` or `sometext5 sometext6 word:A123 B45 sometext7 sometext8` – Indent Oct 26 '17 at 06:41
  • While this code snippet may solve the question, [including an explanation](http://meta.stackexchange.com/questions/114762/explaining-entirely-code-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Dr Rob Lang Oct 26 '17 at 07:58
  • No need of cat this way : sed 's/ \([A-Z]\)/-\1/g;:l s/\(([^ )]*\) /\1-/;tl' test.txt – ctac_ Oct 26 '17 at 10:06
  • 1
    I now realise that adding a digit after sometext means to some people literally having a number after some characters. And some people take sometext literally as having a word with characters s, o, m, e, t, e, x, t. I apologise for the confusion. Pardon my newbieness in regex. My context here is that sometext# represents a string of characters that may or may not form a readable word, and most likely would be completely different from another sometext#, either having different characters or the same characters in different combinations, and may be of different length. I shall edit my question. – Char Oct 26 '17 at 13:19
  • Now i have explained my code. I hope that it can help you. – Darby_Crash Oct 26 '17 at 17:24
1

This might work for you (GNU sed):

sed -r ':a;s/(A[0-9]+(-[A-Z][0-9]+)*) ([A-Z][0-9]+)/\1-\3/;ta;s/(\(\S+(-\S+)*) (\S+( \S+)*\))/\1-\3/;ta' file

Iteratively replace the space(s) in the required strings using a regexp and back references.

potong
  • 55,640
  • 6
  • 51
  • 83
0

You need capture the first Alphanumeric group using () and the second group. Then you can simply replace all using backreferences \1 and \2 :

using sed twice

sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g' | sed -E 's/(\b[A-Za-z][0-9]+) ([A-Z])/\1-\2/g' 

or using perl (with lookahead (?=...)the regex don't capture the 2nd group)

perl -pe 's/(\b[A-Za-z][0-9]+) (?=[A-Z])/\1-/g'


\b work boundary
[A-Za-z] 1 letter
[0-9]+ 1 or more digits

sed doesn't support lookahead and lookbehind fonctionality

Indent
  • 4,675
  • 1
  • 19
  • 35
  • Thanks. But it replaced only the first whitespace with a dash. `sometext1 word:A12-B34 C56 sometext2`. Could you refine your expression please? – Char Oct 26 '17 at 06:08
  • 1
    If I execute it a second time, then it works, replacing the second whitespace `word:A12-B34-C56`. But it shouldn't have to run a second time right? Could the replacement be done in one execution? – Char Oct 26 '17 at 06:13