5

I have a file containing multiple lines starting with "1ECLI H--- 12.345 .....". I want to remove a space between I and H and add R/S/T upon iteration of the H pattern. for eg. H810 if repeated in consecutive three lines, it should get added with a letter R, S (second iteration), T (third iteration). so it would be H810R. Any help will be appreciated.
text looks like below

1ECLI  H813   98   7.529   8.326   9.267
1ECLI  H813   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI  H814  101   7.607   8.617   9.289
1ECLI  H814  102   7.633   8.489   9.156
1ECLI  H814  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

upon chage

1ECLI H813R   98   7.529   8.326   9.267
1ECLI H813S   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI H814R  101   7.607   8.617   9.289
1ECLI H814s  102   7.633   8.489   9.156
1ECLI H814T  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

Thanks.

amruta
  • 61
  • 4
  • Why `H74R` doesn't get anything? What happens if H repeats more than 3 times? – pogibas Nov 06 '17 at 08:24
  • It already has R... I want to add letters to the pattern "H digit digit digit" (H with 3 digits). – amruta Nov 06 '17 at 08:26
  • 1
    Removing the spaces isn't hard, but if you could answer PoGibas' questions, it would make answering much easier –  Nov 06 '17 at 08:26
  • Is there always going to be `H\d\d\d` 3 times consecutively ? – Sachin Nov 06 '17 at 08:28
  • H with two digit and a letter R is fine and as per the required naming style. The H with three digit is missing with R/S/T letters.. I have to add it... – amruta Nov 06 '17 at 08:29
  • H\d\d\d 3 times consecution is not always present. it might be found two times only and thus need to add just R and S at the end. – amruta Nov 06 '17 at 08:31

3 Answers3

2

If your Input_file is same as shown sample then could you please try following awk and let me know if this helps you.

awk '
BEGIN{
  val[1]="R";
  val[2]="S";
  val[3]="T"
}
$2 !~ /^H[0-9]+/ || i==3{
  i=""
}
$2 ~ /^H[0-9]+$/ && /^1ECLI/{
  $2=$2val[++i]
}
1
'   Input_file  > temp_file  && mv  temp_file   Input_file

Adding explanation also for answer too as follows.

awk '
BEGIN{                        ##Starting BEGIN section of awk here.
  val[1]="R";                 ##creating an array named val whose index is 1 and value is string R.
  val[2]="S";                 ##creating array val 2nd element here whose value is S.
  val[3]="T"                  ##creating array val 3rd element here whose value is T.
}
$2 !~ /^H[0-9]+/ || i==3{     ##Checking condition if 2nd field does not start from H and digits after that OR variable i value is equal to 3.
  i=""                        ##Then nullifying the value of variable i here.
}
$2 ~ /^H[0-9]+$/ && /^1ECLI/{ ##Checking here if 2nd field value is starts from H till all digits till end AND line starts from 1ECLI string then do following.
  $2=$2val[++i]               ##re-creating value of 2nd field by adding value of array val whose index is increasing value of variable i.
}
1                             ##Mentioning 1 here, which means it will print the current line.
' Input_file   > temp_file  && mv  temp_file   Input_file                 ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • it is giving errors: awk: 1: unexpected character ''' awk: 14: unexpected character ''' – amruta Nov 06 '17 at 08:50
  • @amruta, without letting me know which errors, I can't help you. Please let me know the errors? Also a GUESS on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk. – RavinderSingh13 Nov 06 '17 at 08:50
  • I did some minor changes and it worked but there is no addition of R S and T letters – amruta Nov 06 '17 at 09:04
  • But the changes are not done somehow... I am sorry you missed the last part of my text. The script ran but no changes done.... – amruta Nov 06 '17 at 09:20
  • I got the output file but no changes were made.. as R, S, or T. – amruta Nov 06 '17 at 09:33
  • @amruta, did you try my edited version of command? Kindly do let me know on same. – RavinderSingh13 Nov 06 '17 at 09:35
  • Sorry. The code did not work... I did the changes as per your instructions. I am sorry as I am new to scripting... and did not know much. the code gives the same input file...let me know if I am missing something – amruta Nov 06 '17 at 09:58
1

Even below one can give desired output, if your real input file is same as what you have posted.

awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile 

Explanation

  • split("R,S,T",a,/,/) - split string "R,S,T" by separator comma, and save in array a, so it becomes a[1] = R, a[2] = S, a[3] = T

  • f=$2~/^H[0-9]+$/ - f is variable, validate regexp $2 ~ /^H[0-9]+$/, which returns boolean status. if it returned true then variable f will be true, otherwise false

  • $2 = $2 a[++c] if above one was true, then modify second field, so second field will have existing value plus array a value, corresponding to the index (c), ++c is pre-increment variable

  • !f{c=0} if variable f is false then reset variable c, not consecutive.

  • 1 at the end does default operation that is print current/record/row, print $0. To know how awk works try, awk '1' infile, which will print all records/lines, whereas awk '0' infile prints nothing. Any number other than zero is true, which triggers the default behavior.

Test Results:

$ cat infile
1ECLI  H813   98   7.529   8.326   9.267
1ECLI  H813   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI  H814  101   7.607   8.617   9.289
1ECLI  H814  102   7.633   8.489   9.156
1ECLI  H814  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

$ awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile
1ECLI H813R 98 7.529 8.326 9.267
1ECLI H813S 99 7.427 8.470 9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI H814R 101 7.607 8.617 9.289
1ECLI H814S 102 7.633 8.489 9.156
1ECLI H814T 103 7.721 8.509 9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

If you want better formatting like tab or some other char as field separator, then you may use below one, modify OFS variable

$ awk -v OFS="\t" 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}{$1=$1}1'  infile
1ECLI   H813R   98  7.529   8.326   9.267
1ECLI   H813S   99  7.427   8.470   9.251
1ECLI   C814    100 7.621   8.513   9.263
1ECLI   H814R   101 7.607   8.617   9.289
1ECLI   H814S   102 7.633   8.489   9.156
1ECLI   H814T   103 7.721   8.509   9.305
1ECLI   C74     104 8.164   8.733   10.740
1ECLI   H74R    105 8.247   8.690   10.799
Akshay Hegde
  • 16,536
  • 2
  • 22
  • 36
  • 1
    Thank you. It worked perfectly well. Only one query.. how would I get the tab set up as below.. 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 instead of 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 – amruta Nov 07 '17 at 04:53
  • use `-v OFS="\t"`, and `{$1=$1}` awk will modify output field separator – Akshay Hegde Nov 07 '17 at 05:00
  • 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 instead of 1ECLI C814 100 7.621 8.513 9.263 1ECLI H814R 101 7.607 8.617 9.289 I am not able to show the leading and lagging tab pattern due to some reason.. – amruta Nov 07 '17 at 05:07
  • how did you try ? can you post your command in comment, to make sure OFS is working, you may change `OFS='|'` for testing – Akshay Hegde Nov 07 '17 at 05:46
  • awk -v OFS="\t" 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}{$1=$1}1' test.in > test.out , OFS is working, I checked the way you suggested. – amruta Nov 07 '17 at 05:59
  • @amruta : then its correct only, to make sure do `od -c test.out`, you will see `\t` char – Akshay Hegde Nov 07 '17 at 06:06
0

The code below assumes that lines is a list of strings representing a line in your file.


with open('filename') as f:
    lines = f.readlines()

from collections import defaultdict
cntd = defaultdict(lambda: 0)
suffix = ['R', 'S', 'T']
newlines = []
for line in lines:
    try:
        kwd = line.split()[1]
    except IndexError:
        newlines.append(line)
        continue
    if kwd[0] == 'H' and kwd[-1].isdigit():
        sfx = suffix[cntd[kwd]]
        idx = line.index(kwd)
        nl = line[:idx -1] + kwd + sfx + line[idx + len(kwd):]
        # nl = line[:idx + len(kwd)] + sfx + line[idx + len(kwd):] # adjust formatting to your taste
        newlines.append(nl)
        cntd[kwd] += 1
    else:
        newlines.append(line)

with open('filename', 'w') as f:
    f.writelines(newlines)
AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
  • I got following error. File "./test.py", line 5, in for line in lines: NameError: name 'lines' is not defined My file name is test.dat and I put the code in test.py... should I change the 'lines' with 'test.dat'?? – amruta Nov 06 '17 at 08:57
  • @amruta Read the first sentence in my answer. You are supposed to read all lines in your file into `lines`. Do you need help with this? – AGN Gazer Nov 06 '17 at 09:00
  • Thanks for reexplaining. but now it is giving me this error, File "./test.py", line 9, in kwd = line.split()[1] IndexError: list index out of range What does that mean? – amruta Nov 06 '17 at 09:15
  • It means that some of the lines in your input file do not have the structure shown in your post. For example, you may have empty lines. I will fix the code to skip those in a minute. – AGN Gazer Nov 06 '17 at 09:18
  • I did not get any output. Output file "test.log" is empty. – amruta Nov 06 '17 at 09:26
  • @amruta So, should I understand that you expect me to write a complete solution for you and you will not even make an effort to understand what this code is doing? Try printing `newlines` if you have a slightest idea of Python. If not - find some good tutorials. – AGN Gazer Nov 06 '17 at 09:33
  • Sorry. If you get offended. I am a new to scripting. also the code you provided did not make any changes as I wanted to have. Anyways, thank you for the information and sharing the code. – amruta Nov 06 '17 at 09:37
  • @amruta _"the code you provided did not make any changes as I wanted to have"_ You did not specify what you *wanted* to have in your original post. Learning how to read from a file or write to a file should be among the first things to learn in scripting. Anyway, I have added writing to a file for your pleasure. – AGN Gazer Nov 06 '17 at 09:47