4

With Dash we can easily read CSV files and take first lines with head, even in multiple partitions.

import dask.dataframe as dd
df = dd.read_csv('data.csv').head(n=100, npartitions=2)

But I would like to read last lines of my CSV file on multiple partitions, something like this :

import dask.dataframe as dd
df = dd.read_csv('data.csv').tail(n=100, npartitions=2)

Dask data.frame doesn't seem to support partition on tail method.

In pandas I could manage it with skiprows, but this options seems not available in Dask.

jpp
  • 159,742
  • 34
  • 281
  • 339
Thomas
  • 1,164
  • 13
  • 41
  • Since this task is not memory-heavy (and so doesn't require `dask`), why not use `pandas` ? – jpp Mar 14 '18 at 10:22
  • 2
    My pipeline use essentially ```dask``` but it could be an option even if I read millions of lines (I have a huge csv file...). – Thomas Mar 14 '18 at 13:21

1 Answers1

0

You seem to have answered your own question. The tail method exists

import dask.dataframe as dd
df = dd.read_csv('data.csv').tail(n=100)

See the Dataframe API

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Sorry I edit my post, the problem with the ```tail``` implementation of ```dask``` is that according documentation "Caveat, the only checks the last n rows of the last partition.". So I can only use the last partition, I would like to use more than 1 partition. – Thomas Mar 14 '18 at 13:25
  • 1
    Nope, that it not currently supported. You'll have to raise an issue, or better yet, provide a fix with a pull request. – MRocklin Mar 14 '18 at 13:33
  • Ok thanks, for the answer ! Not sure to be able to provide a pull request but I'll give a look. – Thomas Mar 14 '18 at 13:35