Reading csv with separator in python dask

Question

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes

The code is:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

Error is:

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

So how to get rid of that.

If i follow the error then i would have to give dtype for every column, but if I have a 100+ columns then that is of no use.

And if i am reading without separator,then everything goes fine but there is ##### everywhere. So after computing it to pandas DataFrame ,is there a way to get rid of that?

So help me in this.

Does the engine specifically need to by python here? won't the default be `c` and will it just work if you set it to `c`? — EdChum, Dec 14 '15 at 11:57
@EdChum-when i am trying to read csv without engine got the warning /home/ec2-user/anaconda3/lib/python3.4/site-packages/pandas/io/parsers.py:648: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python' — Satya, Dec 14 '15 at 11:59
One option would be to just read the first row and then re-read the df again pass the dtypes: `dtypes_dict = dd.read_csv('D:\temp.csv',sep='#####',engine='python', nrows=2).dtypes.to_dict()` then read it again: `df = dd.read_csv('D:\temp.csv',sep='#####',engine='python', dtypes = dtypes_dict)` — EdChum, Dec 14 '15 at 12:10
"The 'dtype' option is not supported with the 'python' engine" that's what the error says. — Satya, Dec 14 '15 at 12:15
You could replace the errant values as a post-processing step but your `dtypes` will be screwed up and probably will make them `str`. You could strip the duff separator out, write it out again and read it back in again to clean the data — EdChum, Dec 14 '15 at 12:16

Benjamin Cohen · Answer 1 · 2021-11-25T20:04:37.613

Read the entire file in as dtype=object, meaning all columns will be interpreted as type object. This should read in correctly, getting rid of the ##### in each row. From there you can turn it into a pandas frame using the compute() method. Once the data is in a pandas frame, you can use the pandas infer_objects method to update the types without having to hard code them.

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

score 1 · Answer 2 · answered Apr 24 '19 at 18:59

If you want to keep the entire file as a dask dataframe, I had some success with a dataset with a large number of columns simply by increasing the number of bytes sampled in read_csv.

For example:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()

This can resolve some type inference issues, although unlike Benjamin Cohen's answer, you would need to find the right values to choose for sample/

Reading csv with separator in python dask

2 Answers2

Linked