Data structure for occurrence counting in long tail distribution

Question

I have a big list of elements (tens of millions). I am trying to count the number of occurrence of several subset of these elements. The occurrence distribution is long-tailed.

The data structure currently looks like this (in an OCaml-ish flavor):

type element_key
type element_aggr_key

type raw_data = element_key list

type element_stat =
{
     occurrence : (element_key, int) Hashtbl.t;
}

type stat =
{
    element_stat_hashtable : (element_aggr_key, element_stat) Hashtbl.t;
}

Element_stat currently use hashtable where the key is each elements and the value is an integer. However, this is inefficient because when many elements have a single occurrence, the occurrence hashtable is resized many times. I cannot avoid resizing the occurrence hashtable by setting a big initial size because there actually are many element_stat instances (the size of hashtable in stat is big).

I would like to know if there is a more efficient (memory-wise and/or insertion-wise) data structure for this use-case. I found a lot of existing data structure like trie, radix tree, Judy array. But I have trouble understanding their differences and whether they fit my problem.

Are you just worried about resizing, or have you measured this as a real performance bottleneck? In aggregate, resizing adds a log factor, I believe. Resizing happens a lot at the beginning, but the table is small then. Later it happens almost never. — Jeffrey Scofield, Feb 08 '14 at 08:04
I have an experience of a high resizing cost. I however do not have any numbers. I also know for a fact that resizing happens in special cases where a huge number of key (around a million) have a single occurrence (cf. long tail distribution). — Johan Mazel, Feb 09 '14 at 11:00

score 1 · Answer 1 · answered Feb 06 '15 at 05:01

What you have here is a table mapping element_aggr_key to tables that in turn map element_key to int. For all practical purposes, this is equivalent to a single table that maps element_aggr_key * element_key to int, so you could do:

type stat = (element_aggr_key * element_key, int) Hashtbl.t

Then you have a single hash table, and you can give it a huge initial size.

Data structure for occurrence counting in long tail distribution

1 Answers1