Using hashing for efficient Deep Packet Inspection

Question

In order to enhance the performance of Deep Packet Inspection we are preprocessing the set of rules by performing a hashing algorithm on them which in turn divides the rules into smaller chunks of sub-rules, making the inspection much faster.

The hashing is done on the first 17 bits of the originally 104. After the preprocessing is done, whenever a packet arrives, we hash its first 17 bits and check it against the much smaller set of rules based on the result.

(The algorithm is used twice, after hashing the first 17 it hashes the next 16 bits and stores the results as well, but for this specific problem we can assume that we're only performing a simple hash on a fixed number of bits)

The algorithm is indeed efficient, however, we can't seem to find a way to apply it on entries with don't care bits - which we get a lot.

We searched for a solution in numerous places, and tried for instance a suggestion of duplicating rules with don't care bits. It didn't work however, for the vast amount of memory it would take (for each don't care bit of the 17 of the numerous rules there is an option of it being either 1 or zero - this would demand an exponential amount of space).

We would very much appreciate any suggestion or insight, even a partial solution would be great.

Note: There is no limit on preprocessing time or additional space as long as it is not exponential or anything impractical.

score 2 · Answer 1 · answered Jan 16 '17 at 05:14

If you use the hash table as a cache and revert to something slower if an entry for the current value isn't found then you don't need to populate it completely. You could either build it ahead of time based on an analysis of previous traffic, creating as many entries as you can afford, or you could populate it dynamically, creating new entries after you process a packet when an entry was not found, and removing old entries that had not been used for some time to reclaim store.

This could actually work. We're definitely taking your suggestion into consideration. Thank you. — RoaaGharra, Jan 17 '17 at 10:09

score 0 · Answer 2 · answered Jan 15 '17 at 19:24

0

2¹⁷ is 131,072: large, but not inordinately so. If you use a bit of indirection (for example, storing your rules in an array without duplication, and then building a size-2¹⁷ table of indices into that array), then you should be able to do this in well under 1 MB.

answered Jan 15 '17 at 19:24

ruakh

175,680
26
273
307

1

Agreed. However, 17 is only the number of bits hashed PER RULE. The amount of rules is approximately 500k and after hashing each 17 bits of each rule, we go over the same operation for the next 16 bits of each rule which makes it an impractical solution. (~500k*2^17+500k*2^16) – RoaaGharra Jan 15 '17 at 19:41
@RoaaGharra Could you just build a single hash table for the first 17+16 bits of the data? This would require a few gigabytes? – Peter de Rivaz Jan 15 '17 at 19:52
@RoaaGharra: If I'm picturing this correctly, then I think the question is not how many distinct rules you have, but how many distinct *sets* of rules you have. You then store the sets-of-rules in its own array that can be indexed into. – ruakh Jan 15 '17 at 19:53
1

@ruakh Exactly. The problem is, there are a lot of rules with don't care bits, and after duplicating them, each duplicate of the same rule might be an entry of almost all hash keys. We have 2^17 of those, therefore, giving us a 500K*2^17 memory requirement. – RoaaGharra Jan 17 '17 at 10:14
1

@PeterdeRivaz I can, but the problem is with the number of entries for each row in that table. Please see my comment above. – RoaaGharra Jan 17 '17 at 10:16
@RoaaGharra: The point of my answer is that you don't need to duplicate entire rules. – ruakh Jan 17 '17 at 15:32
@RoaaGharra: Let's make this less abstract. Can you upload your ruleset somewhere? (Not necessarily the bodies of rules, but just the descriptors of which bits they care about?) – ruakh Jan 17 '17 at 15:53
@ruakh Oh. I kind of see your point now. Let me consider all aspects and get back to you. But seriously great idea. Thank you. – RoaaGharra Jan 17 '17 at 16:16
@ruakh Okay so it would've worked if not for the fact that a distinct 17 bits row in the table can point to multiple rules. Since the 17 bits are taken as part of the whole 104 bit rule, there would be more than just one rule with those specific 17 bits. So we have 2^17 and 500k tables with bidirectional relationship which might result in 500k*2^17 time/memory complexity. – RoaaGharra Jan 17 '17 at 16:57

Using hashing for efficient Deep Packet Inspection

2 Answers2