0

I have a csv file that can be accessed using pandas but fails with dask dataframe. I am using exact same parameters and still getting error with dask.

Pandas use case:

import pandas as pd
mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry', 'tran_date']

df = pd.read_csv('s3://some_bucket/abigd/hed4.csv', 
        sep=',', header=None, names=mycols,  skipinitialspace=True, escapechar='\\',  
                 engine='python', dtype=str )

Pandas output:

df.retry.value_counts()

1     2792174
2      907081
3      116369
6        6475
4        5598
7        1314
5        1053
8         288
16          3
13          3
Name: retry, dtype: int64

dask code:

import dask.dataframe as dd
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786') 

df = dd.read_csv('s3://some_bucket/abigd/hed4.csv', 
        sep=',', header=None, names=mycols,  skipinitialspace=True, escapechar='\\',  
                 engine='python', dtype=str,
        storage_options = {'anon':False, 'key': 'xxx' , 'secret':'xxx'} )


df_persisted = client.persist(df)

df_persisted.retry.value_counts().compute()

Dask Output:

ParserError: unexpected end of data

I have tried opening smaller (and bigger) files in dask and there was no issue with them. It is possible that this file may have unclosed quotations. I can not see any reason why dask is unable to read the file.

shantanuo
  • 31,689
  • 78
  • 245
  • 403

1 Answers1

0

Dask splits your files by looking for the line separator character b"\n". It looks for this single byte in parts of the file, so that the whole thing does not need to be read beforehand. When it finds it is not aware of whether the byte is escaped or within a quoted scope.

Thus, the chunking up of a large file by Dask can fail, and it appears that this is happening for you: some block is finishing on a newline which is not really a line ending.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • Can you check what exactly is the issue with this file? I can share the file if you send me an email shantanu dot oak at world's most popular email provider. – shantanuo Jan 18 '19 at 02:40
  • You need to check for line-end characters that are not really line ends – mdurant Jan 18 '19 at 03:05