0

I am in bit of a situation. I have a stubborn client who wants to upload about ~5gb csv file from a web browser. End goal for the file is to reach a SQL Server but I decided that I will first put that in blob and then have a process to read it from Blob and write it to table.

To make matters more complicated, i need to validate whether the file structure (i.e. file format, columncount etc) is accurate before start uploading file to blob. if these are not correct, i must let user know that file upload can't be done.

To achieve these requirements:

  1. Users will upload the file from UI, our code first check if the file format (column name, file extension etc) is correct and then stream the file to blob storage.

  2. once file is uploaded at blob, I am thinking to use azure functions which will read every 10K rows and validate if the data is correct by going to a table A and then inserts the data into the table B. If the some of the data is inaccurate, it will write the data rows in another directory in blob for users to download. I will repeat this process until entire file is stored into table B.

However, I am confused whether to take above mentioned 2 approach or use 3rd approach

  1. Using Azure Data factory and create a pipeline to get the file from blob and put it in a staging table. once its available at staging table, I need to validate the data by validating it with another table and then insert the valid records into the destination table. In the end, the invalid records will be written into 3rd table (error table) and end point will be given to a user to download the contents of the table.

I am confused whether to use 2nd or 3rd approach. Please suggest if which one may be fast or less likely to have performance issues.

If you can share comments on overall approach, it would great.

Thank you so much

1 Answers1

0

It has been recommended to use Azcopy tool to upload files from on-premises or cloud (Use this command-line tool to easily copy data to and blobs from Azure Blobs, Blob Files, and Table storage Storage with optimal performance.) AzCopy supports concurrency and parallelism, and the ability to resume copy operations when interrupted. It provides high-performance for uploading, downloading larger files. Please consider using this library for larger files.

To ingest data into your system, use Azure Data Factory, Storage Explorer, the AzCopy tool, PowerShell, or Visual Studio. If you use the File Upload feature to import file sizes above 2 GB, use PowerShell or Visual Studio. AzCopy supports a maximum file size of 1 TB and automatically splits data files that exceed 200 GB.

You can use Custom Activity in Azure Data Factory to run the Powershell script. In that Powershell script you can use Azcopy commands to upload the files in Blob Storage.

Utkarsh Pal
  • 4,079
  • 1
  • 5
  • 14