dask csv reading order

Question

I have a time series which values are stored in different csv. Each csv is sorted and contains a variable seconds that is a time scan.

    df = dd.read_csv('/home/data/derived/ips_subnets.7days/*')
df.head()

          seconds                IP        subnet
    0  1477252800  Private-10.0.0.0   10.101.15.6
    1  1477252800  Private-10.0.0.0  10.102.223.2
    2  1477252800  Private-10.0.0.0  10.104.15.43
    3  1477252800  Private-10.0.0.0  10.104.5.241
    4  1477252800  Private-10.0.0.0  10.106.15.26

Now how can I set that the csv files should be read in order according to the variable seconds?

score 2 · Accepted Answer · answered Dec 03 '16 at 14:07

By default dask.dataframe.read_csv reads files in alphabetical order, so if your filenames follow a standard naming scheme, like 2016-05-06.csv then you should be OK.

If you want, you can customize this with dask.delayed. Here is a similar example notebook.

Finally you can always call df = df.set_index('seconds'), however this is much slower than the alternatives and requires a full scan of the data.

dask csv reading order

1 Answers1