1

What is the best way of doing this? It's a 250GB Text file 1 word per line

Input:

123
123
123
456
456
874
875
875
8923
8932
8923

Output wanted:

123
456
874
875
8923
8932

I need to get 1 copy of each duplicated line I DON'T WANT if there are 2 of the SAME LINES, REMOVE BOTH, just remove 1, always keeping 1 unique line.

What I do now:

$ cat final.txt | sort | uniq > finalnoduplicates.txt

In a screen, this is working? I don't know, because when I check the size of output file, and it's 0:

123user@instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep  3 13:59 final.txt
-rw-rw-r-- 1 123user            0 Sep  3 14:26 finalnoduplicates.txt
123user@instance-1:~$

But when I check htop cpu value of the screen running this command is at 100%.

Am I doing something wrong?

slm
  • 15,396
  • 12
  • 109
  • 124
Local Host
  • 41
  • 5
  • 2
    I'd lose the useless `cat`, as `sort` is perfectly capable of reading files on its own. I'd also suggest you use the `-u` option to eliminate the `uniq`. – Hasturkun Sep 03 '18 at 15:44
  • 2
    Possible duplicate of [How get unique lines from a very large file in linux?](https://stackoverflow.com/questions/45357399/how-get-unique-lines-from-a-very-large-file-in-linux) – samabcde Sep 03 '18 at 15:47
  • Then "cat final.txt | sort -u | uniq > finalnoduplicates.txt" ? – Local Host Sep 03 '18 at 15:57
  • Are the lines sorted as you example suggests? – AnFi Sep 03 '18 at 16:47
  • 3
    You're probably seeing an empty file because you're looking at it before `sort` is done sorting and nothing's been outputted to it yet. Sorting that much data takes a while. And, yeah, don't use `cat` and `uniq`. No need for either in this; it should be done with a single program. And since your file looks to be all numbers, maybe tell `sort` that so it sorts the file numerically: `sort -o results.txt -nu file.txt` or the like. – Shawn Sep 03 '18 at 18:49
  • What wordlist were you using? – Hashim Aziz Sep 09 '18 at 17:27

2 Answers2

1

You can do this using just sort.

$ sort -u final.txt > finalnoduplicates.txt

You can simplify this further and just have sort do all of it:

$ sort -u final.txt -o finalnoduplicates.txt

Finally, since your input file is purely just numerical data, you can tell sort via the -n switch this to further improve the overall performance of this task:

$ sort -nu final.txt -o finalnoduplicates.txt
sort's man page
   -n, --numeric-sort
          compare according to string numerical value

   -u, --unique
          with -c, check for strict ordering; without -c, output only the
          first of an equal run

   -o, --output=FILE
          write result to FILE instead of standard output
slm
  • 15,396
  • 12
  • 109
  • 124
0

I found out about this awesome tool called Duplicut. The entire point of the project was to combine the advantages of unique sorting and increasing the memory limit for wordlists.

It is pretty simple to install, this is the GitHub link https://github.com/nil0x42/duplicut