2

I have a time series which values are stored in different csv. Each csv is sorted and contains a variable seconds that is a time scan.

    df = dd.read_csv('/home/data/derived/ips_subnets.7days/*')
df.head()

          seconds                IP        subnet
    0  1477252800  Private-10.0.0.0   10.101.15.6
    1  1477252800  Private-10.0.0.0  10.102.223.2
    2  1477252800  Private-10.0.0.0  10.104.15.43
    3  1477252800  Private-10.0.0.0  10.104.5.241
    4  1477252800  Private-10.0.0.0  10.106.15.26  

Now how can I set that the csv files should be read in order according to the variable seconds?

Donbeo
  • 17,067
  • 37
  • 114
  • 188

1 Answers1

2

By default dask.dataframe.read_csv reads files in alphabetical order, so if your filenames follow a standard naming scheme, like 2016-05-06.csv then you should be OK.

If you want, you can customize this with dask.delayed. Here is a similar example notebook.

Finally you can always call df = df.set_index('seconds'), however this is much slower than the alternatives and requires a full scan of the data.

MRocklin
  • 55,641
  • 23
  • 163
  • 235