I have a big list of elements (tens of millions). I am trying to count the number of occurrence of several subset of these elements. The occurrence distribution is long-tailed.
The data structure currently looks like this (in an OCaml-ish flavor):
type element_key
type element_aggr_key
type raw_data = element_key list
type element_stat =
{
occurrence : (element_key, int) Hashtbl.t;
}
type stat =
{
element_stat_hashtable : (element_aggr_key, element_stat) Hashtbl.t;
}
Element_stat currently use hashtable where the key is each elements and the value is an integer. However, this is inefficient because when many elements have a single occurrence, the occurrence hashtable is resized many times. I cannot avoid resizing the occurrence hashtable by setting a big initial size because there actually are many element_stat instances (the size of hashtable in stat is big).
I would like to know if there is a more efficient (memory-wise and/or insertion-wise) data structure for this use-case. I found a lot of existing data structure like trie, radix tree, Judy array. But I have trouble understanding their differences and whether they fit my problem.