Remove duplicates from INSANE BIG WORDLIST

Question

What is the best way of doing this? It's a 250GB Text file 1 word per line

Input:

Output wanted:

I need to get 1 copy of each duplicated line I DON'T WANT if there are 2 of the SAME LINES, REMOVE BOTH, just remove 1, always keeping 1 unique line.

What I do now:

$ cat final.txt | sort | uniq > finalnoduplicates.txt

In a screen, this is working? I don't know, because when I check the size of output file, and it's 0:

123user@instance-1:~$ ls -l
total 243898460
-rw-rw-r-- 1 123user 249751990933 Sep  3 13:59 final.txt
-rw-rw-r-- 1 123user            0 Sep  3 14:26 finalnoduplicates.txt
123user@instance-1:~$

But when I check htop cpu value of the screen running this command is at 100%.

Am I doing something wrong?

I'd lose the useless `cat`, as `sort` is perfectly capable of reading files on its own. I'd also suggest you use the `-u` option to eliminate the `uniq`. — Hasturkun, Sep 03 '18 at 15:44
Possible duplicate of [How get unique lines from a very large file in linux?](https://stackoverflow.com/questions/45357399/how-get-unique-lines-from-a-very-large-file-in-linux) — samabcde, Sep 03 '18 at 15:47
Then "cat final.txt | sort -u | uniq > finalnoduplicates.txt" ? — Local Host, Sep 03 '18 at 15:57
You're probably seeing an empty file because you're looking at it before `sort` is done sorting and nothing's been outputted to it yet. Sorting that much data takes a while. And, yeah, don't use `cat` and `uniq`. No need for either in this; it should be done with a single program. And since your file looks to be all numbers, maybe tell `sort` that so it sorts the file numerically: `sort -o results.txt -nu file.txt` or the like. — Shawn, Sep 03 '18 at 18:49

score 1 · Answer 1 · answered Sep 04 '18 at 02:37

You can do this using just sort.

$ sort -u final.txt > finalnoduplicates.txt

You can simplify this further and just have sort do all of it:

$ sort -u final.txt -o finalnoduplicates.txt

Finally, since your input file is purely just numerical data, you can tell sort via the -n switch this to further improve the overall performance of this task:

$ sort -nu final.txt -o finalnoduplicates.txt

sort's man page

   -n, --numeric-sort
          compare according to string numerical value

   -u, --unique
          with -c, check for strict ordering; without -c, output only the
          first of an equal run

   -o, --output=FILE
          write result to FILE instead of standard output

@Hashim - some, the `sort` command is doing the file construction whereas the redirect is leaning on the shell to do it. — slm, Jan 16 '20 at 17:39

score 0 · Answer 2 · answered Jun 18 '22 at 11:27

I found out about this awesome tool called Duplicut. The entire point of the project was to combine the advantages of unique sorting and increasing the memory limit for wordlists.

It is pretty simple to install, this is the GitHub link https://github.com/nil0x42/duplicut

Remove duplicates from INSANE BIG WORDLIST

2 Answers2