3

My objective is to parallelize reading many (500+) csv-files containing measurement data. To do so I pass a list of paths (source_files) to a synchronous client. Additionally I have specified dtypes, and column names (order_list).

df = dd.read_csv(source_files, 
                         names = order_list,
                         include_path_column = True,
                         delimiter = ';',
                         decimal = '.',
                         dtype = dtype,
                         na_values = '.',
                         assume_missing = True,
                         error_bad_lines = False
                         )

df = CLIENT.compute(df).result()

For a corrupt line I get the following error message:

File "pandas\_libs\parsers.pyx", line 1164, in pandas._libs.parsers.TextReader._convert_tokens

ValueError: cannot safely convert passed user dtype of bool for float64 dtyped data in column 116

In rare cases the datalogger messes up writing the log files, causing a float to be where I'd expect a boolean. I am sure that the dtypes I'm passing to read_csv are correct and can be satisfied in a vast majority of the csv-files.

Is there a way to identify which csv-file actually caused the error? It would also be nice to know which line of the given csv-file caused the exception.

Thank you in advance!

  • No, I don't - the list of files is submitted at once – boguspolenta Mar 20 '19 at 11:49
  • Have a look at [d6tstack](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) there is an open issue on [github](https://github.com/dask/dask/issues/4105) about this problem. – rpanai Mar 20 '19 at 11:55
  • 1
    @user32185 Thank you for the link to github. Seems like I'll have to check on the integrity before importing the files, as Mrocklin wrote: "If so there are some performance issues. In the general case it can be pretty expensive to read the metadata for every file upfront. Historically we've asked that people handle this ahead of time before using Dask." – boguspolenta Mar 21 '19 at 10:01

1 Answers1

1

Catch the exception:

Instead of

df = dd.read_csv(source_files, 
                         names = order_list,
                         include_path_column = True,
                         delimiter = ';',
                         decimal = '.',
                         dtype = dtype,
                         na_values = '.',
                         assume_missing = True,
                         error_bad_lines = False
                         )

df = CLIENT.compute(df).result()

Iterate all of them and capture the exception

for source_file in source_files:
    try:
        df = dd.read_csv(source_file, 
                             names = order_list,
                             include_path_column = True,
                             delimiter = ';',
                             decimal = '.',
                             dtype = dtype,
                             na_values = '.',
                             assume_missing = True,
                             error_bad_lines = False
                             ) df = dd.read_csv(source_files, 
                             names = order_list,
                             include_path_column = True,
                             delimiter = ';',
                             decimal = '.',
                             dtype = dtype,
                             na_values = '.',
                             assume_missing = True,
                             error_bad_lines = False
                             )
    except ValueError:
        raise Exception('Could not read {}'.format(source_file))

This will tell you what file failed and you can check why. If they don't fail, then just join the dfs you get onto a big one and you are done.

E.Serra
  • 1,495
  • 11
  • 14
  • This solution seems rather pragmatic as it reduces the benefits of importing csv files in parallel. Could it be that it would probably be even faster to run a pandas read_csv with a list of files, reducing the overhead to spawn processes? – boguspolenta Mar 21 '19 at 10:08
  • It is not meant to be efficient, but to see what file is corrupted, or bypass the bug in pandas, usually loading the data is not the bottleneck so this solution should be good enough. – E.Serra Mar 21 '19 at 12:24