25

When we need to copy full data from a dictionary containing primitive data types ( for simplicity, lets ignore presence of datatypes like datetime etc), the most obvious choice that we have is to use deepcopy, but deepcopy is slower than some other hackish methods of achieving the same i.e. using serialization-unserialization for example like json-dump-json-load or msgpack-pack-msgpack-unpack. The difference in efficiency can be seen here :

>>> import timeit
>>> setup = '''
... import msgpack
... import json
... from copy import deepcopy
... data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}
... '''
>>> print(timeit.timeit('deepcopy(data)', setup=setup))
12.0860249996
>>> print(timeit.timeit('json.loads(json.dumps(data))', setup=setup))
9.07182312012
>>> print(timeit.timeit('msgpack.unpackb(msgpack.packb(data))', setup=setup))
1.42743492126

json and msgpack (or cPickle) methods are faster than a normal deepcopy, which is obvious as deepcopy would be doing much more in copying all the attributes of the object too.

Question: Is there a more pythonic/inbuilt way to achieve just a data copy of a dictionary or list, without having all the overhead that deepcopy has ?

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • 8
    It's rarely useful to measure the performance on a small dataset and draw conclusions based on that. If you have a more nested or otherwise bigger datastructure is `deepcopy` still much slower? – MSeifert Aug 24 '17 at 09:43
  • @MSeifert I agree with your feedback, but my intent here is not to compare deepcopy with any method, my primary ask is to how to reduce all overhead on deepcopy if my interest is just a data copy. – DhruvPathak Aug 24 '17 at 09:46
  • 1
    relevant: https://stackoverflow.com/questions/24756712/deepcopy-is-extremely-slow and https://stackoverflow.com/questions/10128351/any-alternative-to-a-very-slow-deepcopy-in-a-dfs and https://stackoverflow.com/questions/8957400/what-is-the-runtime-complexity-of-pythons-deepcopy and https://writeonly.wordpress.com/2009/05/07/deepcopy-is-a-pig-for-simple-data/ – Chris_Rands Aug 24 '17 at 09:46
  • 8
    It's worth noting that a round trip serialization with `json` is not always equivalent to `copy.deepcopy`. For instance, `deepcopy` will preserve multiple references to the same object if they are nested in a container. Consider `D = {1: 2}; L = [D, D]`. If you copy that with `deepcopy`, the new list will still contain two references to a single dict (a copy of `D`). With `json`, you'd get two independent dicts. Using `json` will also convert the integer keys in the dict into strings. I'm not familiar with `msgpack`, so I don't know if it has the same limitations as `json` or not. – Blckknght Aug 24 '17 at 10:01

5 Answers5

34

It really depends on your needs. deepcopy was built with the intention to do the (most) correct thing. It keeps shared references, it doesn't recurse into infinite recursive structures and so on... It can do that by keeping a memo dictionary in which all encountered "things" are inserted by reference. That's what makes it quite slow for pure-data copies. However I would almost always say that deepcopy is the most pythonic way to copy data even if other approaches could be faster.

If you have pure-data and a limited amount of types inside it you could build your own deepcopy (build roughly after the implementation of deepcopy in CPython):

_dispatcher = {}

def _copy_list(l, dispatch):
    ret = l.copy()
    for idx, item in enumerate(ret):
        cp = dispatch.get(type(item))
        if cp is not None:
            ret[idx] = cp(item, dispatch)
    return ret

def _copy_dict(d, dispatch):
    ret = d.copy()
    for key, value in ret.items():
        cp = dispatch.get(type(value))
        if cp is not None:
            ret[key] = cp(value, dispatch)

    return ret

_dispatcher[list] = _copy_list
_dispatcher[dict] = _copy_dict

def deepcopy(sth):
    cp = _dispatcher.get(type(sth))
    if cp is None:
        return sth
    else:
        return cp(sth, _dispatcher)

This only works correct for all immutable non-container types and list and dict instances. You could add more dispatchers if you need them.

# Timings done on Python 3.5.3 - Windows - on a really slow laptop :-/

import copy
import msgpack
import json

import string

data = {'name':'John Doe','ranks':{'sports':13,'edu':34,'arts':45},'grade':5}

%timeit deepcopy(data)
# 11.9 µs ± 280 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit copy.deepcopy(data)
# 64.3 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit json.loads(json.dumps(data))
# 65.9 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 56.5 µs ± 2.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Let's also see how it performs when copying a big dictionary containing strings and integers:

data = {''.join([a,b,c]): 1 for a in string.ascii_letters for b in string.ascii_letters for c in string.ascii_letters}

%timeit deepcopy(data)
# 194 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit copy.deepcopy(data)
# 1.02 s ± 46.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit json.loads(json.dumps(data))
# 398 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit msgpack.unpackb(msgpack.packb(data))
# 238 ms ± 8.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
MSeifert
  • 145,886
  • 38
  • 333
  • 352
5

I think you can manually implement what you need by overriding object.__deepcopy__.

A pythonic way to do this is creating your custom dict extends from builtin dict and implement your custom __deepcopy__.

Sraw
  • 18,892
  • 11
  • 54
  • 87
2

@MSeifert The suggested answer is not accurate

So far i found ujson.loads(ujson.dumps(my_dict)) to be the fastest option which looks strange (how translating dict to string and then from string to new dict is faster then some pure copy)

Here is an example of the methods i tried and their running time for small dictionary (the results of course are more clear with larger dictionary):

x = {'a':1,'b':2,'c':3,'d':4, 'e':{'a':1,'b':2}}

#this function only handle dict of dicts very similar to the suggested solution
def fast_copy(d):
    output = d.copy()
    for key, value in output.items():
        output[key] = fast_copy(value) if isinstance(value, dict) else value        
    return output



from copy import deepcopy
import ujson


%timeit deepcopy(x)
13.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit fast_copy(x)
2.57 µs ± 31.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit ujson.loads(ujson.dumps(x))
1.67 µs ± 14.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

is there any other C extension that might work better than ujson? it very strange that this is the fastest method to copy large dict.

Idok
  • 3,642
  • 4
  • 21
  • 18
0

It's always fastest to write your own copy function specific to your data structure.

Your example

data = {
    'name': 'John Doe',
    'ranks': {
        'sports': 13,
        'edu': 34,
        'arts': 45
        },
    'grade': 5
    }

is a dict consisting just of strs or dicts. Hence:

def copy(obj):

    out = obj.copy() # Shallow copy

    for k, v in obj.items():

        if isinstance(obj[k], dict):

            out[k] = obj[k].copy()

    return obj
%timeit deepcopy(data)
5.26 µs ± 88.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit json.loads(json.dumps(data))
5.11 µs ± 117 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit msgpack.unpackb(msgpack.packb(data))
2.44 µs ± 76.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit ujson.loads(ujson.dumps(data))
1.63 µs ± 25.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit copy(data)
548 ns ± 5.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Suuuehgi
  • 4,547
  • 3
  • 27
  • 32
0

@MSeifert's answer did not work for me. so I implemented a somewhat different approach.


def myDictDeepCopy(dictToCopy) -> dict:
    '''
    Parameters
    ----------
    dictToCopy : dict 
        dict that you want to copy

    Returns
    -------
    dict

    '''
    # Shallow copy
    temp = dictToCopy.copy()
    dictToReturn = {}
    for key, value in temp.items():
        dictToReturn[key] = copy(value)
        
    return dictToReturn
Dariyoush
  • 500
  • 4
  • 16