How would I do a Spark explode in Dask?

Question

I'm new to dask so bear with me.

I have a JSON file where each row has the following schema:

{
 'id': 2,
 'version': 7.3,
 'participants': range(10)
}

participants is a nested field.

input_file = 'data.json'   
df = db.read_text(input_file).map(json.loads)

I can do either:
df.pluck(['id', 'version'])
or
df.pluck('participants').flatten()

But how can I do the equivalent of a Spark explode, where I could at the same time select the id, version and flatten the participants ?

So the output would be :

{'id': 2, 'version': 7.3, 'participants': 0}
{'id': 2, 'version': 7.3, 'participants': 1}
{'id': 2, 'version': 7.3, 'participants': 2}
{'id': 2, 'version': 7.3, 'participants': 3}
...

For completeness, could you put the output that you expect, as not everyone is familiar with `explode`. — mdurant, Sep 11 '17 at 19:08
Any updates on this question? Would love to find out if its possible to use flatten this way — Manuel G, May 22 '18 at 16:04

score 1 · Answer 1 · answered May 29 '18 at 19:48

Its possible to write a custom functions that reads & transforms rows the file with dask.bag.from_sequence

def mapper(row, denest_field):
    js = json.loads(row)
    for v in js[denest_field]:
        yield {'id': js['id'], denest_field: v, 'version': js['version']}


def yield_unnested(fname, denest_field):
    with open (fname) as f:
        for row in f:
            yield from mapper(row, denest_field)

I've saved a file called 'data.json' with the following contents

{"id": 2, "version": 7.3, "participants": [0,1,2,3,4,5,6,7,9,9]}

Then reading with from_sequence

df = db.from_sequence(yield_unnested('data.json', 'participants'))
list(df) # outputs:

[{'id': 2, 'participants': 0, 'version': 7.3},
 {'id': 2, 'participants': 1, 'version': 7.3},
 {'id': 2, 'participants': 2, 'version': 7.3},
 {'id': 2, 'participants': 3, 'version': 7.3},
 {'id': 2, 'participants': 4, 'version': 7.3},
 {'id': 2, 'participants': 5, 'version': 7.3},
 {'id': 2, 'participants': 6, 'version': 7.3},
 {'id': 2, 'participants': 7, 'version': 7.3},
 {'id': 2, 'participants': 9, 'version': 7.3},
 {'id': 2, 'participants': 9, 'version': 7.3}]

Note that I'm new to dask and this may not be the most efficient way to go about things.

How would I do a Spark explode in Dask?

1 Answers1