8

I have imported an excel file into a pandas dataframe and have completed the data exploration and cleaning process.

I now want to write the cleaned dataframe to csv file back to Azure DataLake, without saving it first as a local file. I am using pandas 3.

My code looks like this:

token = lib.auth(tenant_id = '', 
                 client_secret ='', 
                 client_id = '')

adl = core.AzureDLFileSystem(token, store_name)

with adl.open(path='Raw/Gold/Myfile.csv', mode='wb') as f:
    **in_xls.to_csv(f, encoding='utf-8')**
    f.close()

I get the following dump in statement in bold.

TypeError: a bytes-like object is required, not 'str'

I also tried but without any luck

with adl.open(path='Raw/Gold/Myfile.csv', mode='wb') as f:
    with io.BytesIO(in_xls) as byte_buf:
        byte_buf.to_csv(f, encoding='utf-8')
        f.close()

I am getting the below error:

TypeError: a bytes-like object is required, not 'DataFrame'

Any ideas/tips will be much appreciated

Chanda Korat
  • 2,453
  • 2
  • 19
  • 23
Juanita Smith
  • 169
  • 3
  • 5
  • 9
  • does it work without the `b` mode? ie. `adl.open(path='Raw/Gold/Myfile.csv', mode='w')` – EdChum Feb 23 '17 at 11:10
  • No, just 'w' is not supported in Python 3. Only binary... – Juanita Smith Feb 23 '17 at 11:28
  • I asked our PM for the SDK to follow up. However, can you please tell me why you would want to use client side Python scripts that download data from ADLS and then upload data again instead of using U-SQL (possibly with the Python extension) that operates directly on the data in the cloud? – Michael Rys Feb 26 '17 at 19:10
  • Have you tried just writing the bytes directly to the file handle? If so, do you get the same error? Something like `f = adl.open(path='Raw/Gold/Myfile.csv', mode='wb')` ... `f.write()` – Matt H Feb 27 '17 at 20:58
  • We want to do normalization on a raw data, and write it back to gold, this is the main aim. The program has to deal with zipped Excel files, which cannot be handled in PySpark as far as we know. Our team of data scientists know python, not U-SQL. In the end we gave up, switched back to python 2, where it is working perfectly – Juanita Smith Mar 21 '17 at 14:42

2 Answers2

10

I got this working with pandas the other day with python 3.X. This code runs on an on premise machine and connects to the azure data store in the cloud.

Assuming df is a pandas dataframe you can use the following code:

adl = core.AzureDLFileSystem(token, store_name='YOUR_ADLS_STORE_NAME')
      #toke is your login token that was created by whatever ADLS login method you decided.
      #Personally I use the ServiceProvider login
df_str = df.to_csv()
with adl.open('/path/to/file/on/adls/newfile.csv', 'wb') as f:
    f.write(str.encode(df_str))
    f.close()

This key is converting the dataframe to a string and than using the str.encode() function.

Hope this helps.

ShowMeTheData
  • 111
  • 1
  • 5
0

For anyone visiting this question after 2020: this issue was solved with pandas 1.2.0 (support-for-binary-file-handles-in-to-csv)

With pandas 1.2.0 onward this works:

adl_fs = AzureDLFileSystem(your_adl_creds, store_name='your_store_name')
with adl_fs.open('/path/file.csv', mode='wb') as adl_file:
   some_pandas_df.to_csv(adl_file)

The proposed solution of creating intermediate string variable was valid back in 2017 when it was written, but it comes with a caveat of unnecessary consumption of memory, which can be a problem when working with big files. (This is how I came across this question BTW).