using dask read_csv to read filename as a column name

Question

I am importing 4000+ csv files all with the same columns, columns=['Date', 'Datapint'] the importing the csv's to dask is pretty straight forward and is working fine for me.

file_paths = '/root/data/daily/'
df = dd.read_csv(file_paths+'*.csv',
                 delim_whitespace=True,
                 names=['Date','Datapoint'])

The task I am trying to achive is to be able to name the 'Datapoint' column the filename of the .csv. I know you can set a column to the path using include_path_column = True. But I am wondering if there is a simple way use that pathname as a column name with out having to run a separate step down the line.

score 6 · Answer 1 · answered Oct 26 '19 at 02:01

I was able to do this (fairly straight forward) using dask's delayed function:

import pandas as pd
import dask.dataframe as dd
from dask import delayed
import glob

path = r'/root/data/daily' # use your path
file_list = glob.glob(path + "/*.csv")

def read_and_label_csv(filename):
    # reads each csv file to a pandas.DataFrame
    df_csv = pd.read_csv(filename,
                         delim_whitespace=True,
                         names=['Date','Close'])                 
    df_csv.rename(columns={'Close':path_2_column}, inplace=True)
    return df_csv

# create a list of functions ready to return a pandas.DataFrame
dfs = [delayed(read_and_label_csv)(fname) for fname in file_list]
# using delayed, assemble the pandas.DataFrames into a dask.DataFrame
ddf = dd.from_delayed(dfs)

score 2 · Answer 2 · answered Oct 28 '19 at 16:04

2

It is unclear to me what exactly you are trying to accomplish. If you are just trying to change the name of the column that the filepaths are written to, you can set include_path_column='New Column Name'. If you are naming a column based on the path to each file, it seems like you'll get a rather sparse array once the data are concatenated and I would argue that a groupby would probably work better.

answered Oct 28 '19 at 16:04

jsignell

3,072
1
22
23

i have 4000+ text files all with datetime index. Each file containers 1 column with one datapoint column. How else am i suppose to call only certain files to calculate if I do not know their names. – blonc Oct 30 '19 at 01:06
Ah ok that makes sense. I don't think there is a way to do that out of the box. If pivot were supported then that might be a good option. I guess to simplify your answer you could set the name of to some part of the filename in the read_csv function itself: ```python df_csv = pd.read_csv(filename, delim_whitespace=True, names=['Date',path_2_column]) ``` – jsignell Nov 07 '19 at 22:55

using dask read_csv to read filename as a column name

2 Answers2