3

Summary of Problem

Short Version

How do I go from a Dask Bag of Pandas DataFrames, to a single Dask DataFrame?

Long Version

I have a number of files that are not readable by any of dask.dataframe's various read functions (e.g. dd.read_csv or dd.read_parquet). I do have my own function that will read them in as Pandas DataFrames (function only works on one file at a time, akin to pd.read_csv). I would like to have all of these single Pandas DataFrames in one large Dask DataFrame.

Minimum Working Example

Here's some example CSV data (my data isn't actually in CSVs, but using it here for ease of example). To create a minimum working example, you can save this as a CSV and make a few copies, then use the code below

"gender","race/ethnicity","parental level of education","lunch","test preparation course","math score","reading score","writing score"
"female","group B","bachelor's degree","standard","none","72","72","74"
"female","group C","some college","standard","completed","69","90","88"
"female","group B","master's degree","standard","none","90","95","93"
"male","group A","associate's degree","free/reduced","none","47","57","44"
"male","group C","some college","standard","none","76","78","75"
from glob import glob
import pandas as pd
import dask.bag as db

files = glob('/path/to/your/csvs/*.csv')
bag = db.from_sequence(files).map(pd.read_csv)

What I've tried so far

import pandas as pd
import dask.bag as db
import dask.dataframe as dd

# Create a Dask bag of pandas dataframes
bag = db.from_sequence(list_of_files).map(my_reader_function)

df = bag.map(lambda x: x.to_records()).to_dataframe() # this doesn't work
df = bag.map(lambda x: x.to_dict(orient = <any option>)).to_dataframe() # neither does this

# This gets me really close. It's a bag of Dask DataFrames. 
# But I can't figure out how to concatenate them together
df = bag.map(dd.from_pandas, npartitions = 1)

df = dd.from_delayed(bag) # returns an error
natemcintosh
  • 730
  • 6
  • 16
  • The example data you have provided can easily be converted to a pandas data frame, saved as different files and the latter can be read into a single dask data frame. So, it is not clear why you need to use a dask bag and a conversion from it to a dask data frame. – KRKirov Dec 13 '19 at 17:25
  • You ask a good question. I should clarify: I have several million of these files that I need to read in, each anywhere from a few KB to many MB. Hence, I do not have the space to create dask-readable copies of the files. – natemcintosh Dec 13 '19 at 17:34
  • If your custom read function outputs a single pandas data frame, can't you convert the output to a dask data frame, and keep doing this and appending to the dask data frame? – KRKirov Dec 13 '19 at 17:46
  • This is indeed pretty close to what I want! However, I have yet to find a way to do this that doesn't read in all of the files into memory at the same time. `bag = db.from_sequence(files).map(pd.read_csv).map(dd.from_pandas, npartitions=1)` and then `dd.concat(bag.compute())` will indeed work; but it reads everything into memory, which won't work for millions of files – natemcintosh Dec 13 '19 at 19:18

3 Answers3

3

I recommend using dask.delayed with dask.dataframe. There is a good example doing what you want to do here:

https://docs.dask.org/en/latest/delayed-collections.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
1

Here are two additional possible solutions:

1. Convert the bag to a list of dataframes then use dd.multi.concat:

bag #a dask bag of dataframes
list_of_dfs = bag.compute()
df = dd.multi.concat(list_of_dfs).compute()

2. Convert to a bag of dictionaries and use bag.to_dataframe:

bag_of_dicts = bag.map(lambda df: df.to_dict(orient='records')).flatten()
df = bag_of_dicts.to_dataframe().compute()

In my own specific use case, option #2 had better performance than option #1.

0

If you already have a bag of dataframes then you can do the following:

  1. Convert bag to delayed partitions,
  2. convert delayed partitions to delayeds of dataframes by concatenating,
  3. create dataframe from these delayeds.

In python code:

def bag_to_dataframe(bag, **concat_kwargs):
    partitions = bag.to_delayed()
    dataframes = map(
        dask.delayed(lambda partition: pandas.concat(partition, **concat_kwargs)),
        partitions
    )
    return dask.dataframe.from_delayed(dataframes)

You might want to control the concatenation of partitions, for example to ignore the index.

Dror Speiser
  • 435
  • 4
  • 11