1

I am using Azure Data Lake Store (ADLS), targeted by an Azure Data Factory (ADF) pipeline that reads from Blob Storage and writes in to ADLS. During execution I notice that there is a folder created in the output ADLS that does not exist in the source data. The folder has a GUID for a name and many files in it, also GUIDs. The folder is temporary and after around 30 seconds it disappears.

Is this part of the ADLS metadata indexing? Is it something used by ADF during processing? Although it appears in the Data Explorer in the portal, does it show up through the API? I am concerned it may create issues down the line, even though it it a temporary structure.

Any insight appreciated - a Google turned up little.

Picture of the transient folder

Murray Foxcroft
  • 12,785
  • 7
  • 58
  • 86

1 Answers1

1

So what your seeing here is something that Azure Data Lake Storage does regardless of the method you use to upload and copy data into it. It's not specific to Data Factory and not something you can control.

For large files it basically parallelises the read/write operation for a single file. You then get multiple smaller files appearing in the temporary directory for each thread of the parallel operation. Once complete the process concatenates the threads into the single expected destination file.

Comparison: this is similar to what PolyBase does in SQLDW with its 8 external readers that hit a file in 512MB blocks.

I understand your concerns here. I've also done battle with this where by the operation fails and does not clean up the temp files. My advice would be to be explicit with you downstream services when specifying the target file path.

One other thing, I've had problems where using the Visual Studio Data Lake file explorer tool to do uploads of large files. Sometimes the parallel threads did not concatenate into the single correctly and caused corruption in my structured dataset. This was with files in the 4 - 8GB region. Be warned!

Side note. I've found PowerShell most reliable for handling uploads into Data Lake Store.

Hope this helps.

Paul Andrew
  • 3,233
  • 2
  • 17
  • 37
  • Thanks Paul - I'l been digging deeper and testing with high load and I too am seeing temporary files that are not getting cleaned up. I am being explicit where possible however this is not possible with a Azure Data Factory job (without building a custom activity). – Murray Foxcroft Jun 30 '17 at 07:53
  • 1
    Indeed, I considered writing a custom cleaner activity myself several times to handle this. As I'm not alone anymore I've created this as a user voice feedback article, please vote. Thanks https://feedback.azure.com/forums/327234-data-lake/suggestions/19799794-orphaned-temporary-file-auto-clean-up-operation – Paul Andrew Jun 30 '17 at 08:08
  • I've also had that exact same thought (to create a cleanup activity) but built the solution idempotent instead. Maybe a little cleanup custom activity that just nukes GUID folders would be ideal... – Murray Foxcroft Jun 30 '17 at 12:25
  • May I request for sharing the custom script to perform the clean up activity, so I can have a look at it too. Thank you in advance – Manjunath Rao Feb 25 '19 at 08:16