1

I am looking for hash functions that can be used to generate batches out of integer stream. Specifically, I want to map integers xi from a set or stream (say X) to another set of integers or strings (say Y) such that many xi are mapped to one yj. While doing that, I want to ensure that there are at max n xi mapped to a single yj. As with the hashing, I need to be able to reliably find the y given an x.

I would like to ensure most of the yj have close to n number of xi mapped to them (to avoid very sparse mapping from X to Y).

One function I can think of is quotient:

int BATCH_SIZE = 3;
public int map(int x) {
  return x / BATCH_SIZE;
}

for a stream of sequential integers, it can work fairly well. e.g. stream 1..9 will be mapped to

1 -> 0
2 -> 0
3 -> 1
4 -> 1
5 -> 1
6 -> 2
7 -> 2
8 -> 2
9 -> 3

and so on. However, for non sequential large integers and small batch size (my use case), this can generate super sparse mapping (each batch will have only 1 element most of the time).

Are there any standard ways to generate such a mapping (batching)

aoak
  • 983
  • 1
  • 11
  • 20
  • 2
    how about using `modulo` operation as the hash function? – Josnidhin Jul 19 '17 at 06:44
  • 1
    modulo generates mappings that create unbounded batch size, but bounded number of partitions. I want the opposite. Bounded batch size, no restriction on number of batches – aoak Jul 19 '17 at 06:52
  • Doesn't work that well for streams, but if you read everything into an array you can sort it and make batches of n indices. – maraca Jul 19 '17 at 17:56
  • That won't work either because at later point, given a single `x`, I need to be able find out what it was mapped to. – aoak Jul 19 '17 at 19:59

1 Answers1

0

There's no way to get it to work under these assumptions.

You need to know how many items are in the stream and their distribution or you need to relax the ability to map item to batch precisely.

Let's say you have items a and b from the stream. Are you going to put them together in the same batch or not? You can't answer this unless you know if you're going to get more items to fill the 2 or more batches (if you decide to put them in separate batches).

If you know how many there will be (even approximately) you can take their distribution and build batches based on that. Say you have string hashes (uniform distribution over 32bit). If you know you are getting 1M items and you want batches of 100 you can generate intervals of 2^32 / (1.000.000 / 100) and use that as the batch id (yj). This doesn't guarantee you get batches of exactly batch_size but they should be approximately batch_size. If the distribution is not uniform things are more difficult, but can still be done.

If you relax the ability to map item to batch then just group them every batch_size as they come out of the stream. You could keep a map for steam item to batch if you have the space.

Sorin
  • 11,863
  • 22
  • 26