How can I ensure that only one instance of an Azure Synapse Pipeline modifies a lake database at the same time?

Question

I have created an Azure Synapse Analytics pipeline that is triggered every time a new file is added to a certain directory.

It basically obtains the name of the file as input parameter of a notebook, which then reads said file and updates a lake database. If a file is added while the pipeline is running, a second instance of the same pipeline is created simultaneously. My question is, how do I make sure that the two instances of the same pipeline are not modifying the database at the same time? Is this feature implemented by default or do I have to ensure it with my code?

score 0 · Accepted Answer · answered Oct 27 '22 at 15:13

You absolutely have to do this yourself. The best way I know of is to create an external management system that controls when pipelines are submitted. One feature of that would be concurrency, so you can configure your pipeline to only permit one run at a time. I have done a lot of work on this concept over the past few year, first in ADF and now in Synapse.

Here is another SO Answer where I discuss some of this, and here is a video recording of a presentation I gave a few years ago discussing this topic. Some of the details have evolved, and your implementation may be very different, but the concepts are the same.

How can I ensure that only one instance of an Azure Synapse Pipeline modifies a lake database at the same time?

1 Answers1