1

So I have a dictionary which is a hash object I'm getting from Redis, similar to the following dictionary:

source_data = {
   b'key-1': b'{"age":33,"gender":"Male"}', 
   b'key-2': b'{"age":20,"gender":"Female"}'
}

My goal is extract all the values from this dictionary and have them as a list of Python dictionaries like so:

final_data = [
   {
      'age': 33,
      'gender': 'Male'
   },

   {
      'age': 20,
      'gender': 'Female'
   }
]

I tried list comprehension with json parsing:

import json
final_data = [json.loads(a) for a in source_data.values()]

It works but for large data set, it takes too much time.

I switch to using this 3rd party json module ujson which is faster according to this benchmark, but I haven't noticed any improvement.

I tried using multi-threading :

pool = Pool()
final_data = pool.map(ujson.loads, source_data.values(), chunksize=500)

pool.close()
pool.join()

I played a bit with chunksize but the result is the same, still taking too much time.

It would be super helpful if someone can suggest another solution or improvement to previous tries, it would be ideal if I could avoid using a loop.

Sam
  • 605
  • 9
  • 19
  • Might be worth trying pypy? – Tom Dalton Jul 03 '18 at 13:44
  • how long does it take and how big is your data source? – acushner Jul 03 '18 at 13:55
  • @TomDalton trying pypy at the moment is not possible for me. – Sam Jul 03 '18 at 13:56
  • @acushner it takes +35 seconds with a data source that contains ~2000 keys. – Sam Jul 03 '18 at 13:57
  • 2
    Multiprocessing is more likely to hinder than to help here. You want to deserialise a string, but once the child processes have done that they have to serialise the object to a string, send the string to the parent process, which then deserialises the strings into a objects again... The only difference is that multiprocessing doesn't use JSON as its data exchange format. – Dunes Jul 03 '18 at 14:10
  • @Dunes thanks for the explanation, I indeed tried it and saw no improvement compared to the solution that uses list comprehension. – Sam Jul 03 '18 at 14:13
  • 2
    If it takes 35 seconds, I highly suspect there's some other bottleneck here that we're overlooking. `json.loads` should be able to handle small data like that in microseconds. Are you getting all the data from redis in a single batch? Or are you sequentially requesting entries from a remote redis server? (just speculating, but this kind of latency is more typical for network i/o) – Håken Lid Jul 03 '18 at 14:15
  • @HåkenLid I'm getting all the data from Redis in a single batch as a dictionary. – Sam Jul 03 '18 at 14:21
  • How large is the combined json data? If you follow @chepner's answer what is `len(new_json)` ? Can you include sample data (or code that generates mock data) that we may use to reproduce this issue? [mcve] – Håken Lid Jul 03 '18 at 14:27
  • @HåkenLid the data set I'm using has almost 4000 entries, I just tested again using solution provided by chepner and I'm getting 0.12 seconds. – Sam Jul 03 '18 at 14:37
  • I just realized that the +30 seconds were caused by reading the data from Redis. – Sam Jul 03 '18 at 14:46

2 Answers2

4

Assuming the values are, indeed, valid JSON, it might be faster to build a single JSON object to decode. I think it should be safe to just join the values into a single string.

>>> new_json = b'[%s]' % (b','.join(source_data.values(),)
>>> new_json
b'[{"age":33,"gender":"Male"},{"age":20,"gender":"Female"}]'
>>> json.loads(new_json)
[{'age': 33, 'gender': 'Male'}, {'age': 20, 'gender': 'Female'}]

This replaces the overhead of calling json.loads 2000+ times with the lesser overhead of a single call to b','.join and a single string-formatting operation.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • I already tried that, but as you see the values are byte strings. – Sam Jul 03 '18 at 13:58
  • Sorry, got lazy and tested in Python 2. The update should work in Python 3. – chepner Jul 03 '18 at 14:01
  • Thanks for the attempt, it works but still taking 34+ seconds to parse 2000+ entries. – Sam Jul 03 '18 at 14:10
  • 1
    There must be something else going on here if it takes multiple seconds to decode just a few thousand entries. Can you post a link to sample data so we may reproduce this ourselves? [mcve] – Håken Lid Jul 03 '18 at 14:12
  • This suddenly seems to be really fast with ~4000 entries, parsing them in 0.12 sec – Sam Jul 03 '18 at 14:38
1

For reference, I tried replicating the situation:

import json, timeit, random
source_data = { 'key-{}'.format(n).encode('ascii'): 
                '{{"age":{},"gender":"{}"}}'.format(
                    random.randint(18,75), 
                    random.choice(("Male", "Female"))
                 ).encode('ascii') 
               for n in range(45000) }
timeit.timeit("{ k: json.loads(v) for (k,v) in source_data.items() }", 
    number=1, globals={'json': json, 'source_data': source_data})

This completed in far less than a second. Those over 30 seconds must be from something I'm not seeing.

My closest guess is that you had the data in some sort of proxy container, wherein each key fetch turned into a remote call, such as if using hscan rather than hgetall. A tradeoff between the two should be possible using the count hint to hscan.

Proper profiling should reveal where the delays come from.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26