Read tail by partition from CSV file with dask.dataframe

Question

With Dash we can easily read CSV files and take first lines with head, even in multiple partitions.

import dask.dataframe as dd
df = dd.read_csv('data.csv').head(n=100, npartitions=2)

But I would like to read last lines of my CSV file on multiple partitions, something like this :

import dask.dataframe as dd
df = dd.read_csv('data.csv').tail(n=100, npartitions=2)

Dask data.frame doesn't seem to support partition on tail method.

In pandas I could manage it with skiprows, but this options seems not available in Dask.

Since this task is not memory-heavy (and so doesn't require `dask`), why not use `pandas` ? — jpp, Mar 14 '18 at 10:22
My pipeline use essentially ```dask``` but it could be an option even if I read millions of lines (I have a huge csv file...). — Thomas, Mar 14 '18 at 13:21

score 0 · Answer 1 · answered Mar 14 '18 at 13:23

0

You seem to have answered your own question. The tail method exists

import dask.dataframe as dd
df = dd.read_csv('data.csv').tail(n=100)

answered Mar 14 '18 at 13:23

MRocklin

Sorry I edit my post, the problem with the ```tail``` implementation of ```dask``` is that according documentation "Caveat, the only checks the last n rows of the last partition.". So I can only use the last partition, I would like to use more than 1 partition. – Thomas Mar 14 '18 at 13:25
1

Nope, that it not currently supported. You'll have to raise an issue, or better yet, provide a fix with a pull request. – MRocklin Mar 14 '18 at 13:33
Ok thanks, for the answer ! Not sure to be able to provide a pull request but I'll give a look. – Thomas Mar 14 '18 at 13:35

1 Answers1