2

Many functional languages define flatMap function which works like map but can flatten returning values. Spark/pyspark has it http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap

what would be the best way to have it in dask? My code looks like this:

import dask.bag as db
import json
from tools import get_records

records = db.read_text(json_file).map(json.loads).map(get_records)

get_records returns list of dicts. I just need to chain them into one sequence.

Yu Chem
  • 85
  • 4

1 Answers1

3

You probably want the .flatten method

In [1]: import dask.bag as db

In [2]: b = db.from_sequence([1, 2, 3, 4, 5])

In [3]: def f(i):
   ...:     return list(range(i))
   ...: 

In [4]: b.map(f).compute()
Out[4]: [[0], [0, 1], [0, 1, 2], [0, 1, 2, 3], [0, 1, 2, 3, 4]]

In [5]: b.map(f).flatten().compute()
Out[5]: [0, 0, 1, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4]

So instead of a joined "flatMap" operation there are two operations "map" and "flatten" that you can use individually or chain as you prefer.

Benoit Seguin
  • 1,937
  • 15
  • 21
MRocklin
  • 55,641
  • 23
  • 163
  • 235