-2

I have a file (df.txt) with 3045 rows and 900.000 columns, but 145 repeated rows, thus:

 1234  1111122233330000000000003333311122222............................
    1235678 00000000000000000000000111111122222............................
    4567  1122222222222222222222223333333333333............................
    3456  111111111111111122222222222222222222............................
    1234 1111122233330000000000003333311122222............................
    1235678 00000000000000000000000111111122222............................
    3423 33333333300000000011111112222222222222............................
    2211 11111111111111111111111111111111111111............................

Thus, the new file (dffinal.txt)should not have repeating information in column 1, as:

 1234  1111122233330000000000003333311122222............................
    1235678 00000000000000000000000111111122222............................
    4567  1122222222222222222222223333333333333............................
    3456  111111111111111122222222222222222222............................
    3423 33333333300000000011111112222222222222............................
    2211 11111111111111111111111111111111111111............................

I try with

cat df.txt | sort |uniq > dffinal.txt 

but it keeps the same number of rows

Johanna Ramirez
  • 161
  • 1
  • 9

1 Answers1

1

You can use awk to check for duplicates in column 1.

awk '!a[$1] { a[$1]++; print }' df.txt > dffinal.txt

This remembers the first column in the a array. If the column isn't already in there, it saves it and prints the line. So it prints the first instance of any repeated key.

Barmar
  • 741,623
  • 53
  • 500
  • 612