Cleaning doubles out of a massive word list

Question

I got a wordlist which is 56GB and I would like to remove doubles. I've tried to approach this in java but I run out of space on my laptop after 2.5M words. So I'm looking for an (online) program or algorithm which would allow me to remove all duplicates.

Thanks in advance, Sir Troll

edit: What I did in java was put it in a TreeSet so they would be ordered and removed of duplicated

56GB? you sure there are anything like that many english words? http://oxforddictionaries.com/page/93 — Mitch Wheat, Jun 16 '11 at 09:30
"I've tried to approach this in java" - you neglect to give any clues to what you actually did. — Mitch Wheat, Jun 16 '11 at 09:31
@Sir, why do not post your current java solution and the error message. "I run out of space": this is too vague for me. — bpgergo, Jun 16 '11 at 09:32
I just read it from the file and put it in a TreeSet But kinda doesn't matter what I did in java because 2.5M words is nothing compared to the size of the list. And the error was run out of heap space (obv) — Sir Troll, Jun 16 '11 at 09:34
If there were a reasonable number of different words, you would just read them in to a Set, then print all the keys in the Set at the end. But we don't know anything about the "words", so this might not be the right solution. — chrisdowney, Jun 16 '11 at 09:35
You can insert words to MongoDB for example and then export to file. — Grzegorz Gajos, Nov 05 '14 at 21:42

score 2 · Answer 1 · edited May 23 '17 at 12:04

2

Frameworks like Mapreduce or Hadoop are perfect for such tasks. You'll need to write your own map and reduce functions. Although i'm sure this must've been done before. A quick search on stackoverflow gave this

edited May 23 '17 at 12:04

Community

1
1

answered Jun 16 '11 at 09:35

Kakira

846
1
8
14

I looked in previous questions with the tag "wordlist" but didn't find anything. Thanks for the answer :) I'll check it out – Sir Troll Jun 16 '11 at 09:36

score 2 · Accepted Answer · answered Jun 16 '11 at 09:36

I think the problem here is the huge amount of data. I would in a first step try to split the data into several files: e.g. make a file for every char like where you put words with the first character beeing 'a' into a.txt, first char equals 'b' into b.txt. ...

a.txt
b.txt
c.txt -

afterwards i would try using default sorting algorithms and check whether they work with the size of the files. After sorting cleaning of doubles should be easy.

if the files remain to big you can also split using more than 1 char e.g:

aa.txt
ab.txt
ac.txt
...

aioobe · Answer 3 · 2011-06-16T09:50:38.217

1

I suggest you use a Bloom Filter for this.

For each word, check if it's already present in the filter, otherwise insert it (or, rather some good hash value of it).

It should be fairly efficient and you shouldn't need to provide it with more than a gigabyte or two for it to have practically no false negatives. I leave it to you to work out the math.

edited Jun 16 '11 at 09:50

answered Jun 16 '11 at 09:32

aioobe

413,195
112
811
826

Thanks, will check the algo out :) – Sir Troll Jun 16 '11 at 09:37
Bloom Filter: False positives are possible, but false negatives are not. So you can falsely assume a word is present and throw it away...Not sure how that solves the problem.... – Mitch Wheat Jun 16 '11 at 09:47

score 0 · Answer 4 · answered Jun 17 '11 at 07:54

I do like the divide-and-conquer comments here, but I have to admit: If you're running into trouble with 2.5mio words, something's going wrong with your original approach. Even if we assume each word is unique within those 2.5mio (which basically rules out that what we're talking about is a text in a natural language) and assuming each word is on average 100 unicode characters long we're at 500MB for storing the unique strings plus some overhead for storing the set structure. Meaning: You should be doing really fine since those numbers are totally overestimated already. Maybe before installing Hadoop, you could try increasing your heap size?

Cleaning doubles out of a massive word list

4 Answers4