7

I regularly use dask.dataframe to read multiple files, as so:

import dask.dataframe as dd

df = dd.read_csv('*.csv')

However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.

Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv' if file1.csv is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute is triggered as part of a workflow.

The idea is that different logic can then be applied depending on the source.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 3
    You may want to look at this example building a custom reader using dask.delayed: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca – MRocklin Feb 14 '18 at 20:12

2 Answers2

8

Dask functions read_csv, read_table, and read_fwf now include a parameter include_path_column:

include_path_column:bool or str, optional
Whether or not to include the path to each particular file.
If True a new column is added to the dataframe called path.
If str, sets new column name. Default is False.
PeterVermont
  • 1,922
  • 23
  • 18
4

Assuming you have or can make a file_list list that has the file path of each csv file, and each individual file fits in RAM (you mentioned 100 rows), then this should work:

import pandas as pd
import dask.dataframe as dd
from dask import delayed

def read_and_label_csv(filename):
    # reads each csv file to a pandas.DataFrame
    df_csv = pd.read_csv(filename)
    df_csv['partition'] = filename.split('\\')[-1]
    return df_csv

# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)

With some customization, of course. If your csv files are bigger-than-RAM, then a concatentation of dask.DataFrames is probably the way to go.

kingfischer
  • 408
  • 2
  • 13