5

I watched a code from JavaDays, author said that this approach with probability is very effective for storing Strings like analogue to String intern method

 public class CHMDeduplicator<T> {
    private final int prob;
    private final Map<T, T> map;

    public CHMDeduplicator(double prob) {
        this.prob = (int) (Integer.MIN_VALUE + prob * (1L << 32));
        this.map = new ConcurrentHashMap<>();
    }

    public T dedup(T t) {
        if (ThreadLocalRandom.current().nextInt() > prob) {
            return t;
        }
        T exist = map.putIfAbsent(t, t);
        return (exist == null) ? t : exist;
    }
}

Please, explain me, what is effect of probability in this line:

if (ThreadLocalRandom.current().nextInt() > prob) return t;

This is original presentation from Java Days https://shipilev.net/talks/jpoint-April2015-string-catechism.pdf (56th slide)

vsminkov
  • 10,912
  • 2
  • 38
  • 50
pacman
  • 797
  • 10
  • 28

2 Answers2

7

If you look at the next slide which has a table with data with different probabilities, or listen to the talk, you will see/hear the rationale: probabilistic deduplicators balance the time spent deduplicating the Strings, and the memory savings coming from the deduplication. This allows to fine-tune the time spent processing Strings, or even sprinkle the low-prob deduplicators around the code thus amortizing the deduplication costs.

(Source: these are my slides)

Aleksey Shipilev
  • 18,599
  • 2
  • 67
  • 86
  • Also, I am surprised to hear the talk is from JavaDays. I never did JavaDays. – Aleksey Shipilev Aug 24 '16 at 18:55
  • Thank you for the great explanation, it really clarified the situation. I made an error - i confused JavaDays with Jpoint. Thank you for the your work about String catechism, it's amazing. – pacman Aug 28 '16 at 15:24
0

The double value passed to the constructor is intended to be a probability value in the range 0.0 to 1.0. It is converted to an integer such that the proportion of integer values below it is equal to the double value.

The whole expression is designed to evaluate to true with a probability equal to that of the constructor parameter. By using integer math it will be slightly faster than if the raw double value were used.

The intention of implementation is that sometimes it won't cache the String, instead just returning it. The reason for doing this is a CPU vs memory performance trade off: if the memory-saving caching process causes a CPU bottleneck, you can turn up the "do nothing" probability until you find a balance.

Bohemian
  • 412,405
  • 93
  • 575
  • 722