2

I have a set of .parquet files in my local machine that I am trying to upload to a container in Data Lake Gen2.

I cannot do the following:

def upload_file_to_directory():
    try:

        file_system_client = service_client.get_file_system_client(file_system="my-file-system")

        directory_client = file_system_client.get_directory_client("my-directory")
        
        file_client = directory_client.create_file("uploaded-file.parquet")
        local_file = open("C:\\file-to-upload.parquet",'r')

        file_contents = local_file.read()

        file_client.append_data(data=file_contents, offset=0, length=len(file_contents))

        file_client.flush_data(len(file_contents))

    except Exception as e:
      print(e)

because the .parquet file cannot read by the .read() function.

When I try do this:

def upload_file_to_directory():

     file_system_client = service_client.get_file_system_client(file_system="my-file-system")

     directory_client = file_system_client.get_directory_client("my-directory")
        
     file_client = directory_client.create_file("uploaded-file.parquet")
     file_client.upload_file("C:\\file-to-upload.txt",'r')


I get the following error:

AttributeError: 'DataLakeFileClient' object has no attribute 'upload_file'

Any suggestions?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245

1 Answers1

3

You are receiving this because you have imported DataLakeFileClient module. Try installing DataLakeServiceClient since it has upload_file method.

pip install DataLakeServiceClient

However, to read the .parquet file, one of the workarounds is to use pandas. Below is the code that worked for me.

storage_account_name='<ACCOUNT_NAME>'
storage_account_key='ACCOUNT_KEY'

service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=storage_account_key)
    
file_system_client = service_client.get_file_system_client(file_system="container")

directory_client = file_system_client.get_directory_client(directory="directory")
        
file_client = directory_client.create_file("uploaded-file.parquet")

local_file = pd.read_parquet("<YOUR_FILE_NAME>.parquet")
df = pd.DataFrame(local_file).to_parquet()

file_client.upload_data(data=df,overwrite=True) #Either of the lines works
#file_client.append_data(data=df, offset=0, length=len(df)) 
file_client.flush_data(len(df))

and you may be required to import DataLakeFileClient library to make this work:

from azure.storage.filedatalake import DataLakeServiceClient
import pandas as pd

RESULTS:

enter image description here

SwethaKandikonda
  • 7,513
  • 2
  • 4
  • 18
  • This is a good answer. I'll only note that the Python API may have changed since June. DataLakeServiceClient has an upload() method but for .NET, not for Python, and DataLakeFileClient now has an upload_data() method. – MisterJT Oct 26 '22 at 15:57