3

I'm new to dask so bear with me.

I have a JSON file where each row has the following schema:

{
 'id': 2,
 'version': 7.3,
 'participants': range(10)
}

participants is a nested field.

input_file = 'data.json'   
df = db.read_text(input_file).map(json.loads)

I can do either:
df.pluck(['id', 'version'])
or
df.pluck('participants').flatten()

But how can I do the equivalent of a Spark explode, where I could at the same time select the id, version and flatten the participants ?

So the output would be :

{'id': 2, 'version': 7.3, 'participants': 0}
{'id': 2, 'version': 7.3, 'participants': 1}
{'id': 2, 'version': 7.3, 'participants': 2}
{'id': 2, 'version': 7.3, 'participants': 3}
...
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
louis_guitton
  • 5,105
  • 1
  • 31
  • 33

1 Answers1

1

Its possible to write a custom functions that reads & transforms rows the file with dask.bag.from_sequence

def mapper(row, denest_field):
    js = json.loads(row)
    for v in js[denest_field]:
        yield {'id': js['id'], denest_field: v, 'version': js['version']}


def yield_unnested(fname, denest_field):
    with open (fname) as f:
        for row in f:
            yield from mapper(row, denest_field)

I've saved a file called 'data.json' with the following contents

{"id": 2, "version": 7.3, "participants": [0,1,2,3,4,5,6,7,9,9]}

Then reading with from_sequence

df = db.from_sequence(yield_unnested('data.json', 'participants'))
list(df) # outputs:

[{'id': 2, 'participants': 0, 'version': 7.3},
 {'id': 2, 'participants': 1, 'version': 7.3},
 {'id': 2, 'participants': 2, 'version': 7.3},
 {'id': 2, 'participants': 3, 'version': 7.3},
 {'id': 2, 'participants': 4, 'version': 7.3},
 {'id': 2, 'participants': 5, 'version': 7.3},
 {'id': 2, 'participants': 6, 'version': 7.3},
 {'id': 2, 'participants': 7, 'version': 7.3},
 {'id': 2, 'participants': 9, 'version': 7.3},
 {'id': 2, 'participants': 9, 'version': 7.3}]

Note that I'm new to dask and this may not be the most efficient way to go about things.

Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85