Azure Data Lake Store File Size Limitation

Question

I upload files to Azure Data Lake Store using following function:

DataLakeStoreFileSystemManagementClient.FileSystem.UploadFile(store, filePath, key, overwrite: true);

It gives me the following error for files ONLY larger than ~4MBs:

"Found a record that exceeds the maximum allowed record length around offset 4194304"

Microsoft.Azure.Management.DataLake.Store.TransferFailedException:
   at Microsoft.Azure.Management.DataLake.Store.FileSystemOperations.UploadFile (Microsoft.Azure.Management.DataLake.Store, Version=2.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)

Could anyone provide any insights if this is a setting somewhere in Azure Data Lake or something I can adjust on the client end?

Thanks!

I've googled the error and the only thing that is returned is Java code samples.

score 1 · Answer 1 · answered May 24 '18 at 21:06

According to the Azure subscription limits and quotas

Azure Data Lake Store is an enterprise-wide hyper-scale repository for big data analytic workloads. Data Lake Store enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics. There is no limit to the amount of data you can store in a Data Lake Store account.

But also, according to the chapter Performance and scale considerations under chapter 'Best practices for using Azure Data Lake Store', paragraph 'Optimize “writes” with the Data Lake Store driver buffer'

To optimize performance and reduce IOPS when writing to Data Lake Store from Hadoop, perform write operations as close to the Data Lake Store driver buffer size as possible. Try not to exceed the buffer size before flushing, such as when streaming using Apache Storm or Spark streaming workloads. When writing to Data Lake Store from HDInsight/Hadoop, it is important to know that Data Lake Store has a driver with a 4-MB buffer. Like many file system drivers, this buffer can be manually flushed before reaching the 4-MB size. If not, it is immediately flushed to storage if the next write exceeds the buffer’s maximum size. Where possible, you must avoid an overrun or a significant underrun of the buffer when syncing/flushing policy by count or time window.

Answer
According to this answer, using the DataLakeStoreUploader doesn't present you with this issue. Main reason is probably because they do the flushing for you. So you might be too close to the metal using the FileSystem.UploadFile method ;)

According to this post another solution should be to start with an empty file and adding < 4mb chunks to it before flushing.

score 0 · Answer 2 · answered May 25 '18 at 05:37

Based on my understanding, there is no this limit if use the lastest Microsoft.Azure.Management.DataLake.Store 2.21.

I can't reproduce it on my side. I also check it with Fiddler tool. During I upload the file to Azure Datalake with Azure library, we could find that it uploads the file with append mode. You could have a try to use my following code.

Following is the demo code and packages

 var creds = new ClientCredential("clientId", "secretkey");
 var clientCreds = ApplicationTokenProvider.LoginSilentAsync("tenantId", creds).Result;
 var client = new DataLakeStoreFileSystemManagementClient(clientCreds);
 var source = "D:\\1.txt"; //file size>15M
 var fileInfo = new FileInfo(source);
 var size = fileInfo.Length;
 var destination = "/tomtest/1.txt";
 client.FileSystem.UploadFile("tomdatalake", source, destination, overwrite: true);

Test Result:

Packages.config

<?xml version="1.0" encoding="utf-8"?>
<packages>
  <package id="Microsoft.Azure.Management.DataLake.Store" version="2.2.1" targetFramework="net471" />
  <package id="Microsoft.IdentityModel.Clients.ActiveDirectory" version="3.14.0" targetFramework="net471" />
  <package id="Microsoft.IdentityModel.Logging" version="1.1.2" targetFramework="net471" />
  <package id="Microsoft.IdentityModel.Tokens" version="5.1.2" targetFramework="net471" />
  <package id="Microsoft.Rest.ClientRuntime" version="2.3.11" targetFramework="net471" />
  <package id="Microsoft.Rest.ClientRuntime.Azure" version="3.3.7" targetFramework="net471" />
  <package id="Microsoft.Rest.ClientRuntime.Azure.Authentication" version="2.3.3" targetFramework="net471" />
  <package id="Newtonsoft.Json" version="9.0.1" targetFramework="net471" />
</packages>

Updating to 2.2.1 doesn't help, I think I have to switch to the Microsoft.Azure.DataLake.Store package and use AdlsClient. — jn1kk, May 25 '18 at 14:02

Jan Muncinsky · Answer 3 · 2019-05-02T11:11:11.687

What helped me was specifying

uploadAsBinary: true

I found the explanation of the behavior of this parameter for AdlsClient.BulkUpload, but I guess it should be the same for this API.

If false then writes files to data lake at newline boundaries, however if the file has no newline within 4MB chunks it will throw exception. If true, then upload at new line boundaries is not guranteed but the upload will be faster. By default false, if file has no newlines within 4MB chunks true should be passed

https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.datalake.store.adlsclient.bulkupload?view=azure-dotnet

Azure Data Lake Store File Size Limitation

3 Answers3