We are testing out Azure Synapse Link for Dataverse due to data export service being deprecated. We did some initial tests on low volume data, and were able to get the service up and running fairly easily. I've now tried setting it up for a larger dataset and it seems the initial synch for these entities is extrememly slow.
We tried to synch 3 of our larger entities, two of these entities have approximately 7 million records and one of somewhere around 12 million. In three days, it's managed to get through around 7 million between them, so still around 19 million to go. So it looks like it could easily take another several days for it to finish these three.
I'm trying to understand where the bottleneck is but struggling to find guidance on larger datasets.
The examples and documentation from Microsoft show setting up the link with a standard storage account. I've left the partitions on the entities as "month", and the workspace is using a serverless pool. I can see in the monitor, that this pool isn't doing anything which makes sense as I understand it'll only be used once we start issuing queries.
So, I have a few questions around the setup:
- Is there a way to monitor where my bottleneck in this synchronisation is? I.e. how could I tell if it's the workspace, the storage account.. something else?
- Is "standard" tier on the storage account sufficient for these sorts of volumes? (I can't find any guidelines around this but maybe I'm looking in the wrong place)
- If I set it to premium, I'm assuming we want a "Block Blob" account type. Is that right?
- Are there any other aspects that will affect the performance? For example, it's a non-production D365 instance, would that make a difference too?
- MS suggests not adding more than a 5 entities at a time, so seeing as there are only 3 I am wondering if these being a bit bigger would I see improved performance synching them 1 at a time?
Thanks!