0

I have batch of ids I want to partition them with good liner spread function.
Ids do not contain timestamp and are really badly spread. I'm limited to few dumb xpath operators.

Could you please propose better function to spread ids between 10 buckets?

public static void main(String[] args) {
    int[] buckets = new int[10];
    for (int i = 0; i < 10; i++)
        buckets[i] = 0;

    for (int i = 0; i < 1000; i++) {
        String id = format("130770%s0020", i);
        long l = parseLong(id);
        int partition = (int) f(l);
        buckets[partition] = buckets[partition] + 1;
    }

    for (int i = 0; i < 10; i++)
        out.println(buckets[i]);
}

Currently my best result is

private static long f(long l) {
    return l % 31 % 10;
}

which gives

130 96 97 96 97 97 96 98 97 96

can you propose better implementation?


This is how code I'm editing looks like

<rule id="trade_to_backet_4">
    <forall var="trade_identifier" in="/eMxML/msml/trade/systemReference[1]/@reference">
        <equal op1="translate($trade_identifier, translate($trade_identifier,'0123456789',''), '') mod 813 mod 10" op2="4"/>
    </forall>
</rule>
Mike
  • 20,010
  • 25
  • 97
  • 140
  • All java objects have a hashCode method in them. Use it. Most IDEs, certainly Eclipse, have functionality to generate a hashcode method for you. – Ryan Mar 05 '20 at 21:53
  • I don't have hashCode in xpath, and I don't have shift. I'm using legacy app and I have filter which accepts xpath to make a partitioning – Mike Mar 05 '20 at 21:57
  • @WJS or people do understand the complexities, and think that simply saying "my current solution is insufficient, make it arbitrarily better" isn't a well-posed question. – Andy Turner Mar 05 '20 at 22:02
  • is that really a reason for down vote? I provided working example and explained what is eligible to be used. I skipped all unrelated explanation – Mike Mar 05 '20 at 22:02
  • I think you have confused everyone by mixing Java and XPath in this way. You've said you want to use XPath operators, but these don't have any direct equivalent in Java (mod in XPath isn't the same as % in Java, for example). – Michael Kay Mar 06 '20 at 08:29
  • What does this have to do with XPath? Can you please explain what your actual goal is? – Mathias Müller Mar 06 '20 at 09:51
  • I need to implement this logic inside a filter which is based on xpath syntax – Mike Mar 06 '20 at 11:16
  • @MichaelKay, why are you saying that mod in xpath is not the same as % in java? – Mike Mar 06 '20 at 12:11
  • @MykhayloAdamovych The mod operator in XPath and the % operator in Java handle negative numbers differently. I forget the detail, you'll have to look it up. – Michael Kay Mar 06 '20 at 18:23

4 Answers4

1

I would recommend picking the same solution that has been applied to the HashMap class.

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

For your code that would mean:

return (l ^ (l >>> 16)) % 10;

With your test data, that produces a spread of:

109 102 103 94 91 95 93 100 104 109

From comment:

I don't have shift

The expression l >>> 16 can also be written l / 65536, but division is a lot slower than bit-shifting, so that's why you'd usually use l >>> 16.


UPDATE From another comment:

I don't have XOR operator

Use + instead of ^. Although not as good, it seems good enough here:

return (l + (l / 65536)) % 10;

Resulting spread:

101 92 92 99 105 104 105 99 97 106
Community
  • 1
  • 1
Andreas
  • 154,647
  • 11
  • 152
  • 247
1

If your target is to get things equally distributed amongst the buckets, this seems to work:

return ((l / 10000) % 1000) % 10;

(This is simply extracting the i back out from the number)

Ideone demo.

Output:

100 100 100 100 100 100 100 100 100 100 

An alternative which seems to give the same result:

// NB: abs(int) isn't always non-negative. Should really handle Integer.MIN_VALUE.
return Math.abs(Long.toString(l).hashCode()) % 10;

Ideone demo

Output:

100 100 100 100 100 100 100 100 100 100 
Andy Turner
  • 137,514
  • 11
  • 162
  • 243
1

Are you looking for a solution that works well with your particular batch of ids, or with any batch of ids, or with batches that have some particular characteristics (like all being in the form 130770%s0020)?

I think that solutions using integer arithmetic alone are always going to perform badly in some worst-case scenarios, e.g. where all the IDs are multiples of 31. You really need to do some bit-twiddling, which can't be implemented efficiently in XPath 1.0.

Having said that, I think I try the following: choose 3 prime numbers P, Q, and R, and return (N mod P + N mod Q + N mod R) mod 10.

It's also worth remembering that a perfect algorithm will not deliver the same number of items in each bucket; rather the result will at best reflect a random distribution, that is, it will be binomial. And you need to do some fairly smart testing on a large set of inputs to see whether you've achieved that.

I'm inclined at this stage to take a step back and ask: what are you actually doing that requires this hash function? Is there a different way of solving the problem that doesn't require this hash function? Can you tell us the real problem you are trying to solve?

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks for your response Michael! I need general function hash code. I'm working with legacy application with closed sources (I can't modify existing implementation). The application is build 'for multiple clients' so it has some ability for configuration and support. Configuration include route configuration in addition to property files. One can build his own routes and add custom logic using propitiatory application DSL. It is not javascript no mvel nor groovy, it is really custom DSL implemented with ANTLR, syntax is close to google apps script functions. – Mike Mar 06 '20 at 11:02
  • I'm able to connect to this java application with jvisualvm and make performance analysis and other stuff. So I want to parallel slow IO with identical parallel threads which is possible by for using several identical routes of this legacy application (matter of configuration). But I don't want to break message ordering which is base on trade id. It is possible to achieve this by for using proprietary filters which make the message to be processed by concrete 'route'. Filters in this application is xpath filters. – Mike Mar 06 '20 at 11:07
  • So I was able to [extract numbers](https://stackoverflow.com/a/7338084/448078) from trade id and looking for a way to implement good linear spread algorithm to make a trick with message partitioning and linear load of my parallel running routes (each route is a java thread under the hood) – Mike Mar 06 '20 at 11:10
0

813 is really outstanding number

101 101 101 100 100 99 99 99 100 100
101 101 101 100 100 100 100 99 99 99
101 100 100 99 100 100 100 100 100 100
101 101 101 100 100 100 99 99 99 100
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99
101 101 101 100 100 100 100 99 99 99

private static final int GROUP_SIZE = 1000;
private static final int BUCKET_SIZE = 10;
private static final double MAX_DEVIATION = BUCKET_SIZE * 1.0;
private static final int NUMBER_TO_TEST = 813;

public static void main(String[] args) {
    List<Long> list = LongStream.range(1, 1000).boxed().parallel()
            .filter(l -> filter("005001307700020%s", l))
            .filter(l -> filter("0050013077%s00020", l))
            .filter(l -> filter("00500%s1307700020", l))
            .filter(l -> filter("%s005001307700020", l))
            .filter(l -> filter("111111111111111%s", l))
            .filter(l -> filter("1111111111%s11111", l))
            .filter(l -> filter("11111%s1111111111", l))
            .filter(l -> filter("%s111111111111111", l))
            .filter(l -> filter("222222222222222%s", l))
            .filter(l -> filter("2222222222%s22222", l))
            .filter(l -> filter("22222%s2222222222", l))
            .filter(l -> filter("%s222222222222222", l))
            .filter(l -> filter("333333333333333%s", l))
            .filter(l -> filter("3333333333%s33333", l))
            .filter(l -> filter("33333%s3333333333", l))
            .filter(l -> filter("%s333333333333333", l))
            .filter(l -> filter("444444444444444%s", l))
            .filter(l -> filter("4444444444%s44444", l))
            .filter(l -> filter("44444%s4444444444", l))
            .filter(l -> filter("%s444444444444444", l))
            .filter(l -> filter("555555555555555%s", l))
            .filter(l -> filter("5555555555%s55555", l))
            .filter(l -> filter("55555%s5555555555", l))
            .filter(l -> filter("%s555555555555555", l))
            .filter(l -> filter("666666666666666%s", l))
            .filter(l -> filter("6666666666%s66666", l))
            .filter(l -> filter("66666%s6666666666", l))
            .filter(l -> filter("%s666666666666666", l))
            .filter(l -> filter("777777777777777%s", l))
            .filter(l -> filter("7777777777%s77777", l))
            .filter(l -> filter("77777%s7777777777", l))
            .filter(l -> filter("%s777777777777777", l))
            .filter(l -> filter("888888888888888%s", l))
            .filter(l -> filter("8888888888%s88888", l))
            .filter(l -> filter("88888%s8888888888", l))
            .filter(l -> filter("%s888888888888888", l))
            .filter(l -> filter("999999999999999%s", l))
            .filter(l -> filter("9999999999%s99999", l))
            .filter(l -> filter("99999%s9999999999", l))
            .filter(l -> filter("%s999999999999999", l))
            .collect(toList());
    System.err.println(list);
}

public static boolean filter(String format, long number) {
    int[] buckets = new int[BUCKET_SIZE];
    for (int i = 0; i < BUCKET_SIZE; i++)
        buckets[i] = 0;

    for (int i = 0; i < GROUP_SIZE; i++) {
        String id = format(format, i);
        long l = parseLong(id);
        int partition = (int) (l % number % BUCKET_SIZE);
        buckets[partition] = buckets[partition] + 1;
    }

    int sum = 0;
    for (int i = 0; i < BUCKET_SIZE; i++)
        sum += buckets[i];

    int deviation = 0;
    for (int i = 0; i < BUCKET_SIZE; i++)
        deviation += abs(buckets[i] - sum / BUCKET_SIZE);

    if (number == NUMBER_TO_TEST) {
        for (int i = 0; i < BUCKET_SIZE; i++)
            System.out.println(buckets[i]);
        System.out.println("----------------------");
    }
    return deviation < MAX_DEVIATION;
}
Mike
  • 20,010
  • 25
  • 97
  • 140