-1

I have millions of files in one container and I need to copy ~100k to another container in the same storage account. What is the most efficient way to do this?

I have tried:

  1. Python API -- Using BlobServiceClient and related classes, I make a BlobClient for the source and destination and start a copy with new_blob.start_copy_from_url(source_blob.url). This runs at roughly 7 files per second.
  2. azcopy (one file per line) -- Basically a batch script with a line like azcopy copy <source w/ SAS> <destination w/ SAS> for every file. This runs at roughly 0.5 files per second due to azcopy's overhead.
  3. azcopy (1000 files per line) -- Another batch script like the above, except I use the --include-path argument to specify a bunch of semicolon-separated files at once. (The number is arbitrary but I chose 1000 because I was concerned about overloading the command. Even 1000 files makes a command with 84k characters.) Extra caveat here: I cannot rename the files with this method, which is required for about 25% due to character constraints on the system that will download from the destination container. This runs at roughly 3.5 files per second.

Surely there must be a better way to do this, probably with another Azure tool that I haven't tried. Or maybe by tagging the files I want to copy then copying the files with that tag, but I couldn't find the arguments to do that.

1 Answers1

3

Please check with below references:

1. AZCOPY would be best for best performance for copying blobs within same storage or other storage accounts .we can force a synchronous copy by specifying "/SyncCopy" parameter for AZCopy to ensures that the copy operation will get consistent speed. azcopy sync | Microsoft Docs .

But note that AzCopy performs the synchronous copy by downloading the blobs to local memory and then uploads to the Blob storage destination. So performance will also depend on network conditions between the location where AZCopy is being run and Azure DC location. Also note that /SyncCopy might generate additional egress cost comparing to asynchronous copy, the recommended approach is to use this sync option with azcopy in the Azure VM which is in the same region as your source storage account to avoid egress cost. Choose a tool and strategy to copy blobs - Learn | Microsoft Docs

2. StartCopyAsync is one of the ways you can try for copy within a storage account .

References: 1. .net - Copying file across Azure container without using azcopy - Stack Overflow 2. Copying Azure Blobs Between Containers the Quick Way (markheath.net)

3. You may consider Azure data factory in case of millions of files but also note that it may be expensive and little timeouts may occur but it may be worth for repeated kind of work.

References: 1. Copy millions of files (andrewconnell.com) , GitHub(microsoft docs) 2. File Transfer between container to another container - Microsoft Q&A

4. Also check out and try the Azure storage explorer copy blob container to another

kavyaS
  • 8,026
  • 1
  • 7
  • 19