2

I'm trying to port some Pandas code to Dask, and I'm encountering an issue when reading the csv's - it appears that Dask adds the local working directory to the file path in the read operation. It works fine when I read using Pandas.

I'm using Windows 10. Working directory is on my C drive; data is in my D drive.

Pandas code:

import pandas as pd

file_path = 'D:/test_data/'
item = filename.csv
temp_df = pd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_df.head()):

                         time  ticker_price
0  2019-05-15 09:34:09.233373       0.02843
1  2019-05-15 09:34:11.334135       0.02843
2  2019-05-15 09:34:12.147282       0.02843
3  2019-05-15 09:34:13.705145       0.02843
4  2019-05-15 09:34:14.521257       0.02843
type = <class 'pandas.core.frame.DataFrame'>

Dask code:

import dask.dataframe as dd

file_path = 'D:/test_data/'
item = filename.csv
temp_dd = dd.read_csv(file_path + item, usecols=['time', 'ticker_price'])

Output of print(temp_dd.head()):

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Dan\\PycharmProjects\\project1_folder/D:/test_data/filename.csv'

It looks like Dask is appending the file_path to my data on the D drive to the path of my local working directory (the PycharmProjects folder), while Pandas does not. Are there any solutions for this?

A few things I tried that did not work:

(1)

temp_file_path_str = pathlib.Path(file_path + item)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns the same error:

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Dan\\PycharmProjects\\project1_folder/D:\\test_data\\filename.csv'

(2)

temp_file_path_str = 'file://' + file_path + item
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This returns an error that suggests Dask removed the drive ID from the path:

FileNotFoundError: [WinError 3] The system cannot find the path specified: '\\test_data\\filename.csv'

(3)

temp_file_path_str = 'file://' + file_path + item
temp_file_path_str = pathlib.Path(temp_file_path_str)
temp_dd = dd.read_csv(temp_file_path_str, usecols=['time', 'ticker_price'])

This seems to add an extra \ before the drive ID in the path:

OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect: '\\D:\\test_data\\filename.csv'

Update 6/1/19 - I created issue for this: https://github.com/dask/dask/issues/4861

dan
  • 183
  • 13
  • Am I wrong or your path in Windows should be `'D:\\test_data\*'`? – rpanai May 27 '19 at 22:01
  • 1
    Thanks, @rpanai. It's odd that the way I called the file worked fine in Pandas, but in any case, I resolved this issue with the following: `temp_file_path_str = file_path + item ` `temp_file_path_str = temp_file_path_str.replace('/', '\\')` – dan May 29 '19 at 01:18
  • Sometime is better to use `os.path.join` as this produce the right path for the given OS. – rpanai May 29 '19 at 13:32
  • Update/ Correction - I did not resolve the issue with Dask. I am working around the issue by reading the csv's using Pandas and then converting the Pandas dataframes into Dask using the "from_pandas" function. – dan May 30 '19 at 13:26
  • 1
    If you think that this is a bug then I recommend raising an issue on GitHub – MRocklin Jun 02 '19 at 19:17
  • Thanks, all. I had opened an issue on GitHub (https://github.com/dask/dask/issues/4861), which I have now closed. The error appears to have been raised because a distributed worker tried to read the file. The answer here was helpful in resolving my issue: https://stackoverflow.com/questions/50987030/file-not-found-error-in-dask-program-run-on-cluster The weird file path used in the traceback seems to have been a red herring. – dan Jun 04 '19 at 07:05

0 Answers0