1

I have an SSIS task that reads JSON files from Azure Data Lake, parses them with deserialize command in a Script Task and creates copies of them as tables in a Local SQL Server.

This is going on well, but very slow. It takes me 6 hours to import one directory and there are many.

So I decided to make this in an incremental fashion.

At first, I used Conditional Split to avoid writing the old records into the local SQL server, but this still involved the time-cost of reading the whole file. And Conditional Split is a funny component, I have observed it just skip some records for no reason. Not good.

So, why not just skip the whole file that I have imported previously? But for this, I need to get the last-modified date of the file I am processing.

Now I know there are some .Net methods for doing this in plain c# (Filesystem operations on Azure Data Lake Storage Gen1 using .NET SDK) but since I am using a script component of SSIS I already have a valid connection to the Data Lake. Should be much easier right?

But I couldn't find a way, I appreciate any guidance from here.

Cheers

EnisAkin
  • 43
  • 5

1 Answers1

1

If you are trying to sort files by modified date with the Data Lake Store task within SSIS you aren't going to have much luck filtering the files you load in by modified date- it doesn't have that kind of awareness.

If you are pulling the data into SSIS through your Script Task then you are in better shape. You could make use of the SDK or you can go through the standard REST API. The REST API uses the HDFS REST spec. https://<yourstorename>.azuredatalakestore.net/webhdfs/v1/mytempdir/myinputfile.txt?op=GETFILESTATUS will return the modified timestamp just as the SDK would.

You might be able to combine the two to get a similar result while sticking with the SSIS Data Store Source- use a Script Task with a conditional to determine if the source and it's downstream should run. That depends on how your package is architected though.

PerfectlyPanda
  • 3,271
  • 1
  • 6
  • 17