2

I have a file with this format content:

1  6  8
1  6  9
1  12 20
1  6
2  8
2  9
2  12
2  20
2  35

I want to delete all the lines if the number (from 2nd or 3rd column but not from 1st) is found in the next lines whether it is in the 2nd or 3rd column inluding the line where the initial number is found.

I should have this as an output:

2 35

I've tried using:

awk '{for(i=2;i<=NF;i++){if($i in a){next};a[$i]}} 1' 

but it doesn't seem to work.

What is wrong ?

inourss
  • 73
  • 6
  • 1
    OP should explain "found in the **next** lines". does it mean "following lines"? if true, only first 3 lines in your example should be removed. – Kent Aug 30 '18 at 09:17
  • What is your example not working, what is the output you get. – kvantour Aug 30 '18 at 09:17
  • For example: the first line contains 6 and 8 and these numbers are also found in 2nd line, 4th and 5th. Thus, lines number 1, 2, 4 and 5 should be removed etc. In this case only the last line should remain i.e. (2,35) – inourss Aug 30 '18 at 09:20

5 Answers5

3

One-pass awk that hashes all the records to r[NR] and keeps another array a[$i] for the values seen in fields $2,...NF.

awk ' {
    for(i=2;i<=NF;i++)       # iterate fields starting from the second
        if($i in a) {        # if field value was seen before
            delete r[a[$i]]  # delete related record
            a[$i]=""         # clear a
            f=1              # flag up
        } else {             # if it was not seen before
            a[$i]=NR         # add record number to a
            r[NR]=$0
        }
    if(f!=1)                 # if flag was not raised
        r[NR]=$0             # store record on record number
    else                     # if it was raised
        f=""                 # flag down
}
END {
    for(i=1;i<=NR;++i)
        if(i in r)
            print r[i]       # output remaining
}' file

Output:

2  35
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • Out of curiosity, if your data is big enough and you test all of these solutions, please let me know if one or two-pass solution was faster. – James Brown Aug 30 '18 at 09:43
  • 1
    Very nice solution. I like the clearing of the buffer to reduce memory. However, take into account that `for(i in r)` will iterate in an unspecified order, so you might not keep the order intact. You might want to write `for(i=1;i<=NR;++i) if(i in r) print r[i]` – kvantour Aug 30 '18 at 09:51
  • I'm a bit puzzled about the flag. what is its use? It looks like the `else` statement of the `if($i in a)` is already doing all the work. Or am I missing something? – kvantour Aug 31 '18 at 09:17
3

The simplest way is a double-pass algorithm where you read your file twice.

The idea is to store all values in an array a and count how many times they appear. If the value appears 2 or more times, it means you have found more then a single entry and you should not print the line.

awk '(NR==FNR){a[$2]++; if(NF>2) a[$3]++; next} 
     (NF==2) && (a[$2]==1);
     (NF==3) && (a[$2]==1 && a[$3]==1)' <file> <file>

In practice, you should avoid things such as a[var]==1 if you are not sure whether var is in the array as it will create that array element. However, since we never increase it any more, it is fine to proceed.

If you want to achieve the same thing with more then three fields you can do:

awk '(NR==FNR){for(i=2;i<=NF;++i) a[$i]++; next }
     {for(i=2;i<=NF;++i) if(a[$i]>1) next }
     {print}' <file> <file>

While both these solutions read the file twice, you can also store the full file in memory and read the file only a single time. This, however, is exactly the same algorithm:

awk '{for(i=2;i<=NF;++i) a[$i]++; b[NR]=$0}
     END{ for(j=1;j<=NR;++j) {
            $0=b[j];
            for(i=2;i<=NF;++i) if(a[$i]>1) continue
            print $0
          }
         }' <file>

comment: this single-pass solution is very simple and stores the full file in memory. The solution of James Brown is very clever. It removes stuff from memory when they are not needed anymore. A bit shorter version is:

awk '{ for(i=2;i<=NF;++i) if ($i in a) delete b[a[$i]]; else { a[$i]=NR; b[NR]=$0 }}
     END { for(n=1;n<=NR;++n) if(n in b) print b[n] }' <file>

note: you should never thrive for the shortest solution, but the most readable one!

kvantour
  • 25,269
  • 4
  • 47
  • 72
2

Could you please try following.

awk '
FNR==NR{
  for(i=2;i<=NF;i++){
    a[$i]++
  }
  next
}
(NF==2 && a[$2]==1) || (NF==3 && a[$2]==1 && a[$3]==1)
'  Input_file  Input_file

Output will be as follows.

2  35
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • Maybe I'm doing something wrong but it gave me nothing as output. – inourss Aug 30 '18 at 09:26
  • 1
    @inourss, First thing I am reading Input_file 2 times so make sure you are copying code correctly then if your Input_file is same as shown sample + `35` is not coming in your Input_file more than once then this should work, if this is not the case then check if your Input_file is having control M characters in them by doing `cat -v Input_file` and let me know then?? – RavinderSingh13 Aug 30 '18 at 09:28
  • @kvantour, please elaborate why it is incorrect, will be grateful? – RavinderSingh13 Aug 30 '18 at 09:36
  • 1
    @RavinderSingh13 It worked for me. I forgot to put the input_file twice. – inourss Aug 30 '18 at 09:38
  • 1
    If `a[$2]==1` it will print the line and will never test if `a[$3]==1`. You should, in theory, have an `&&` instead of `||` but the fact that the amount of fields varies, this will fail. – kvantour Aug 30 '18 at 09:41
  • 1
    @kvantour, sure, changed the code now, thanks for letting me know. – RavinderSingh13 Aug 30 '18 at 09:56
2
$ cat tst.awk
NR==FNR {
    cnt[$2]++
    cnt[$3]++
    next
}
cnt[$2]<2 && cnt[$NF]<2

$ awk -f tst.awk file file
2  35
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    This is so clever as it uses two nifty ideas! 1. Always add the third field, no matter what. So if there are only 2 fields, increase the counter of the empty field! 2. use `cnt[$NF]` instead of `cnt[$3]` this ensures that if you only have 2 fields, you just test field `$2` twice! (If I could, I vote this 10 times) – kvantour Aug 30 '18 at 15:21
  • Thanks. IMHO `NF` is an under-used resource. So many problems can be solved easily by using `NF` or `$NF` instead of some other approach, e.g. https://stackoverflow.com/a/52089084/1745001. – Ed Morton Aug 30 '18 at 15:31
0

This might work for you (GNU sed):

sed -r 'H;s/^[0-9]+ +//;G;s/\n(.*\n)/\1/;h;$!d;s/^([^\n]*)\n(.*)/\2\n  \1/;:a;/^[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/^[0-9]+ +[0-9]+ +([0-9]+)\n(.*\n)*[^\n]*\1[^\n]*\1[^\n]*$/bb;/\n/P;:b;s/^[^\n]*\n//;ta;d' file

This is not a serious solution however it demonstrates what can be achieved using only matching and substitution.

The solution makes a copy of the original file and whilst doing so accumulates all numbers in the second and possible third fields of each record in a separate line which it maintains at the head of the copy.

At the end of the file, the first line of the copy contains all the pertinent keys and if there are duplicate keys then any line in the file that contains such a key is deleted. This is achieved by moving the keys (the first line) to the end of the file and matching the second (and possibly third) fields of each record on those keys.

potong
  • 55,640
  • 6
  • 51
  • 83