Why such a big pickle of a sklearn decision tree (30K times bigger)?

Question

Why pickling a sklearn decision tree can generate a pickle thousands times bigger (in terms of memory) than the original estimator?

I ran into this issue at work where a random forest estimator (with 100 decision trees) over a dataset with around 1_000_000 samples and 7 features generated a pickle bigger than 2GB.

I was able to track down the issue to the pickling of a single decision tree and I was able to replicate the issue with a generated dataset as below.

For memory estimations I used pympler library. Sklearn version used is 1.0.1

# here using a regressor tree but I would expect the same issue to be present with a classification tree
import pickle
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_friedman1  # using a dataset generation function from sklear
from pympler import asizeof

# function that creates the dataset and trains the estimator
def make_example(n_samples: int):
    X, y = make_friedman1(n_samples=n_samples, n_features=7, noise=1.0, random_state=49)
    estimator = DecisionTreeRegressor(max_depth=50, max_features='auto', min_samples_split=5)
    estimator.fit(X, y)
    return X, y, estimator

# utilities to compute and compare the size of an object and its pickled version
def readable_size(size_in_bytes: int, suffix='B') -> str:
    num = size_in_bytes
    for unit in ['', 'k', 'M', 'G', 'T', 'P', 'E', 'Z']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def print_size(obj, skip_detail=False):
    obj_size = asizeof.asized(obj).size
    print(readable_size(obj_size))
    return obj_size

def compare_with_pickle(obj):
    size_obj = print_size(obj)
    size_pickle = print_size(pickle.dumps(obj))
    print(f"Ratio pickle/obj: {(size_pickle / size_obj):.2f}")
    
_, _, model100K = make_example(100_000)
compare_with_pickle(model100K)
_, _, model1M = make_example(1_000_000)
compare_with_pickle(model1M)

output:

1.7 kB
4.9 MB
Ratio pickle/obj: 2876.22
1.7 kB
49.3 MB
Ratio pickle/obj: 28982.84

if I had to guess I would say that maybe for each node in the tree maybe pickle serialize the complete subtree, thus being highly redundant — pietroppeter, Jun 09 '22 at 14:02
an interesting article about how to use pickletool to "disassemble" a pickle: https://rushter.com/blog/pickle-serialization-internals/ — pietroppeter, Jun 10 '22 at 08:15

pygeek · Answer 1 · 2022-06-15T00:50:07.847

3

Preamble

asizeof usually outputs bad calculations when it is unfamiliar with how to resolve references in objects. By default, asizeof only traverses attributes for calculations,. There are exceptions, however— reference methods present in libraries such as numpy are hardcoded.

I suspect DecisionTreeRegressor has its own internal reference methods used to build a tree/graph that is not recognized by asizeof

Reducing output size

Depending on your requirements (python version, compatibility, time) you may be able to optimize for output size by changing the default protocol parameter for pickle to a protocol more space efficient.

There is also a built in module called pickletools that can be used to reduce space used by your pickled file (pickle tools.optimize). pickletools may also be used to disassemble the byte code.

Furthermore, you may compress the pickled output using built-in archiving modules.

References

https://github.com/pympler/pympler/blob/master/pympler/asizeof.py

https://docs.python.org/3/library/pickle.html

https://docs.python.org/3/library/pickletools.html#module-pickletools

https://docs.python.org/3/library/archiving.html

edited Jun 15 '22 at 00:50

answered Jun 15 '22 at 00:18

pygeek

7,356
1
20
41

It might be that the memory that `asizeof` reports can be wrong (an example?) but in this case I do not think it impacts the question. In particular the size of the pickle is just the length of the byte sequence, so for sure correct. The size of the tree object is also reasonable with respect to memory consumption. – pietroppeter Jun 15 '22 at 07:45
Regarding protocols, I did test but they do not seem to improve and compressing (which in my original use case I was doing) does not improve dramatically (more or less halves the size from 2GB to 1GB), as it should since from what I understand the pickle seems to contain a lot of redundant information – pietroppeter Jun 15 '22 at 07:46
I think the key point here is the suggestion to use pickletools to disassemble and understand what is going on (see also my comments in the question). I generally appreciate the answer but I do not think it still answers my question. I am looking for an explanation of why this explosion of size happens. – pietroppeter Jun 15 '22 at 07:49
1

@pietroppeter the size reflected by `asizeof` is wrong. This is a common complaint with the library when using it against graph/tree building libraries. In order to properly size `DecisionTreeRegressor` `asizeof` would need a code change. – pygeek Jun 15 '22 at 12:50
1

ooh, right the size of the trees is indeed wrong and it is comparable to the size of the pickle (see details in my answer added below). Thanks for the help in understanding this, bounty awarded! – pietroppeter Jun 15 '22 at 14:15

score 1 · Accepted Answer · answered Jun 15 '22 at 14:14

As pointed out by @pygeek's answer and subsequent comments, the wrong assumption of the question is that the pickle is increasing the size of the object substantially. Instead the issue lies with pympler.asizeof which is not giving the correct estimate of the tree object.

Indeed the DecisionTreeRegressor object has a tree_ attribute that has a number of arrays of length tree_.node_count. Using help(sklearn.tree._tree.Tree) we can see that there are 8 such arrays (values, children_left, children_right, feature, impurity, threshold, n_node_samples, weighted_n_node_samples) and the underlying type of every array (except possibly the values array, see note below) should be an underlying 64 bit integer or 64 bit float (the underlying Tree object is a cython object), so a better estimate of the size of a DecisionTree is estimator.tree_.node_count*8*8.

Computing this estimate for the models above:

def print_tree_estimate(tree):
    print(f"A tree with max_depth {tree.max_depth} can have up to {2**(tree.max_depth -1)} nodes")
    print(f"This tree has node_count {tree.node_count} and a size estimate is {readable_size(tree.node_count*8*8)}")
    
print_tree_estimate(model100K.tree_)
print()
print_tree_estimate(model1M.tree_)

gives as output:

A tree with max_depth 37 can have up to 68719476736 nodes
This tree has node_count 80159 and a size estimate is 4.9 MB

A tree with max_depth 46 can have up to 35184372088832 nodes
This tree has node_count 807881 and a size estimate is 49.3 MB

and indeed these estimates are in line with the sizes of pickle objects.

Further note that the only way to be sure to bound the size of DecisionTree is to bound max_depth, since a binary tree can have a maximum number of nodes that is bounded by 2**(max_depth - 1), but the specific tree realizations above have a number of nodes well below this theoretical bound.

note: the above estimate is valid for this decision tree regressor which has a single output and no classes. estimator.tree_.values is an array of shape [node_count, n_outputs, max_n_classes] so for n_outputs > 1 and/or max_n_classes > 1 the size estimate would need to take into account those and the correct estimate would be estimator.tree_.node_count*8*(7 + n_outputs*max_n_classes)

Why such a big pickle of a sklearn decision tree (30K times bigger)?

2 Answers2

Preamble

Reducing output size

References

Linked