I regularly use dask.dataframe
to read multiple files, as so:
import dask.dataframe as dd
df = dd.read_csv('*.csv')
However, the origin of each row, i.e. which file the data was read from, seems to be forever lost.
Is there a way to add this as a column, e.g. df.loc[:100, 'partition'] = 'file1.csv'
if file1.csv
is the first file and contains 100 rows. This would be applied to each "partition" / file that is read into the dataframe, when compute
is triggered as part of a workflow.
The idea is that different logic can then be applied depending on the source.