is perfect hashing without buckets possible?

Question

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers. However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?

The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.

Cheers

Edit:

I'll try to provide some more info :)

1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).

2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means. We will then look up the original number at the other end using a rainbow table. The problem is that the source the devices generally have 512k-4Meg of memory to use.

3) it must be perfect - we 100% cannot have a collision .

Edit2:

4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.

Edit3:

As this is not sensible, Its purely academic question now (I promise)

is 10^11 your keyspace, or the number of keys actually occupied? — bdonlan, Feb 03 '11 at 11:14
So, you want to use a *one-way* function to make the number *secure* and then **crack** it on the other end? Why not just encrypt the number with a regular encryption algorithm, send it and decrypt it on the other side? — R. Martinho Fernandes, Feb 03 '11 at 14:53
@Martinho Fernandes:Basilcy yes... I've been told we cant do encryption on the devices which is why this solution has been thought up. — Dreaddan, Feb 03 '11 at 15:48
@Dreaddan: But that makes no sense at all! If you're cracking it, you're proving its insecurity! How can you secure data for transmission without using some form of encryption? If you want to convert the data to some secure form and then back, you want a *two-way* function, aka encryption. — R. Martinho Fernandes, Feb 03 '11 at 15:51
If every value is valid, what is the 'security' for? If you have a billion valid commands, then you're just changing what id each command is, if you only have a few numbers which mean something, then you don't have a big table. You also may want to consider replay attacks - if X happens when the sniffer sees value Y on the wire, then they know Y does X. — Pete Kirkham, Feb 03 '11 at 16:23
@Martinho Fernandes:Effectly yes, and no. Yes you can compute it but without knowing the hash fn and a salt which will change at least dayly it makes it more secure. — Dreaddan, Feb 03 '11 at 17:19
@pete: thats a good point.. im going to have to take this back and find out what there thinking is/was.. — Dreaddan, Feb 03 '11 at 17:24
Did you notice that you're effectively engineering a two-way function (the second part being the lookup in the rainbow table) and you're using that salt you change daily as if it was an encryption key? Do you realize that everytime you change the salt, you'll have to recompute the rainbow table? And that you don't need to store *anything* in the source device? All you need in the source is the hashing algorithm. Only the destination needs the rainbow table. — R. Martinho Fernandes, Feb 03 '11 at 17:40
@Martinho Fernandes: yes, yes, yes... not quite. I know the device doent need the table but I was told that we couldnt use bob jenkins' pmh as the device couldnt hold the buckets. Hashing(implementation) is something new to me and has been a huge learning curve.. and no theres no way I'll write one! — Dreaddan, Feb 03 '11 at 17:58
the ped is the name we call the device. My understanding is the buckets are used for genrating the minimal hash while the table will be used as the lookup. I feel some more reading is needed — Dreaddan, Feb 03 '11 at 18:03

bdonlan · Accepted Answer · 2011-02-04T03:31:17.970

Okay, since you've clarified what you're trying to do, I rewrote my answer.

To summarize: Use a real encryption algorithm.

First, let me go over why your hashing system is a bad idea.

What is your hashing system, anyway?

As I understand it, your proposed system is something like this:

Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).

Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.

This is an encryption algorithm

An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.

This is a very poor encryption algorithm

First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.

Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.

Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.

Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.

This isn't any faster than a real encryption algorithm on the embedded device

Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:

Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds

There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.

For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.

Key management is fundamental

With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.

So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.

Thus, secrets are generally made easy to replace - this is what we call a key.

Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!

What should you do?

Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.

First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.

Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.

Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.

The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means. We will then look up the original number at the other end using a rainbow table. The problem is that the source the devices generally have 512k-4Meg of memory to use. — Dreaddan, Feb 03 '11 at 14:02
@Dreaddan, updated my answer. In short: A hash is the wrong tool to use for the job. — bdonlan, Feb 03 '11 at 14:54
AES was my first choise but we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could. — Dreaddan, Feb 03 '11 at 15:53
@Dreaddan, there are other options for encryption beyond AES - and many of them aren't any more expensive than SHA1 or other hashes, but what you're trying to do is encryption, pure and simple. Doing silly things with hashes won't buy you any security at all - key management (and the existence of a key) is _fundamental_ to security in cases like this. And there will _always_ be the possibility of a collision. If you're really short on processing power, try RC4, but make sure you drop the first few kilobytes of the stream. — bdonlan, Feb 04 '11 at 01:17
Also, for more in-depth discussion of your issue, it would be helpful if you could open a new question asking what you _really_ want to get done, ie, secure communications - this one will get a bit cluttered at this rate :) — bdonlan, Feb 04 '11 at 01:18
@Dreaddan, updated my answer significantly. Please ask if you need any more clarification :) — bdonlan, Feb 04 '11 at 03:31
@bdonlan cheers - this is kinda what i expected for the implementation. I think the real answer there is that if generating a hash isn't more intensive than using a proper encryption method then we might as well do that. — Dreaddan, Feb 04 '11 at 16:13

score 1 · Answer 2 · answered Feb 03 '11 at 10:33

1

There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.

However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.

It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.

answered Feb 03 '11 at 10:33

apenwarr

10,838
6
47
58

1

+1. Perfect hashing without absolutely no knowledge about the input data is insane. – R. Martinho Fernandes Feb 03 '11 at 11:00
it must be perfect - we 100% cannot have a collision . @Martinho Fernandes: We know the data range, it's a number between 0 and 10000000000 (10^10). – Dreaddan Feb 03 '11 at 14:02
@Dreaddan: you should make that clear in the question: "I have 10^10 numbers" is not the same as "I have numbers between 0 and 10^10". Compare "I have 3 numbers: 42, 23, 17" with "I have numbers between 1 and 3: 1, 2, 3". – R. Martinho Fernandes Feb 03 '11 at 14:37