I am trying to upload my on-premise data on the Azure Datalake storage, the data is about 10 GB in total and divided into multiple folders. I have tried multiple ways to upload the files, the size of each file varies from some KBs to 56MBs, and all are binary data files.
Firstly, I tried to upload them using the python SDK for azure datalake using the following function:
def upload_file_to_directory_bulk(filesystem_name,directory_name,fname_local,fname_uploaded): try:
file_system_client = service_client.get_file_system_client(file_system=filesystem_name)
directory_client = file_system_client.get_directory_client(directory_name)
file_client = directory_client.get_file_client(fname_uploaded)
local_file = open(fname_local,'r',encoding='latin-1')
file_contents = local_file.read()
file_client.upload_data(file_contents, length=len(file_contents),overwrite=True,validate_content=True)
except Exception as e:
print(e)
The problem with this function is that either it skips the files from the local folder to upload, or some of the files uploaded do not have the same size as the local same local file.
The second method that I tried was by uploading the whole folder using Azure Storage Explorer, the storage explorer would crash/fail after uploading about 90 to 100 files. Is there any way I can see the logs and see the reason why it stopped?
Thirdly, I just manually uploaded using the Azure Portal, but that was a complete mess as it also failed on some files.
Can anyone guide me how to upload bulk data on the Azure data lake? And what could be the problem occurring in these 3 methods.