1

My goal is to create a database of MD5 hashes, and then check another list of hashes against it to see if they exist in the database.

I have a working solution using anydbm, which if you are unfamiliar works exactly like a dictionary in python, but you don't have to load the whole thing in memory at once. Creation of the db is rather slow (100 million entries in roughly 2 and a half hours) and retrieval speed is moderate (100000 entries in 1 to 2 seconds). Checking if the hash exists is as simple as if hash in dbm.

For a performance improvement, I attempted to make a working solution with sqlite. Creation speed is blazing fast, creating all 100 million entries in a few minutes. But retrieval took upwards of 15 seconds for one entry. This is unreasonable!

I'm not a SQL pro, so I'm wondering if I'm just using commands which overly complicate things.

They are as follows:

creation of table: c.execute('''create table keys(id integer not null primary key autoincrement, hash text, alert text) ''')

adding entries(in a loop): c.execute('''insert into keys(hash, alert) values (?,?) ''', (hash, "1"))

retrieval(also looped):

c.execute('''select * from keys where hash = ? ''', (hash,))
hits = c.fetchall() 
numhits += len(hits)
user1442389
  • 83
  • 1
  • 6
  • 2
    sqlite is not for performance. – Slater Victoroff Aug 05 '13 at 14:46
  • 7
    [Index your `hash` column.](http://www.sqlite.org/lang_createindex.html) – Carsten Aug 05 '13 at 14:48
  • 3
    Doing a `select *` from a non-indexed, non-unique field causes the database to scan every record in the queried table, whereas a hash-based lookup with unique keys like a dictionary/`anydbm` key:value store only stores each key once, so only 1 record needs to be found. As @Carsten said, you need to use an index to make your queries faster (and consider adding a unique constraint if possible). – mdscruggs Aug 05 '13 at 15:02
  • The MD5 hashes will not be unique in the source document, but I don't necessarily need to keep every entry. Is there a way I can set it up so that the hashes are unique, but if I try to add a duplicate it simply ignores the op? – user1442389 Aug 05 '13 at 15:04
  • 1
    I haven't used this clause, but it looks like what you need: http://www.sqlite.org/lang_conflict.html ... use the IGNORE statement – mdscruggs Aug 05 '13 at 15:09
  • @SlaterTyranus: as for sqlite performance, I think you should egt some benchmarks and citations. It is not for concurrence or scalability - but single accss performance it does. – jsbueno Aug 05 '13 at 15:53

1 Answers1

4

You have to create an index on your md5 table - no matter if the hashes are not unique - not having an index means a linear access throw all the records at each access.

Looking at the docs, it looks like creating your index is as easy as issuing CREATE INDEX hash ON keys (hash) on your database.

jsbueno
  • 99,910
  • 10
  • 151
  • 209