5

I have been trying to conditionally/selectively sync millions of records from SQL table on Azure to another SQL database within Azure.

Is there any way to configure Azure Data Sync or replication not to sync all records of table, rather sync only few rows as specified in where clause?

Or Is there any way to sync data generated by SQL View into another database table?

P.S. Did my research and it is not available by default in Azure Portal. So looking forward to expert's opinion on this. Thanks for attempting to help out!

Abhijeet
  • 13,562
  • 26
  • 94
  • 175
  • 2
    data factory can do this. The Source in a Copy Activity can be a sql query (including where clause) or you can use a View. If you are talking only one table it is pretty straight forward. For a large number of tables it is better to use a control file and ForEach loop to do it. – Scott Mildenberger Sep 16 '22 at 15:22
  • @ScottMildenberger Thanks of course you have my +1. Database is huge... really huge. But Data Sync comes free as a bonus right, Price can blow up with Data Factory to thousands of dollars per month for large data say 10s of millions records. Considering if even if nothing changed every upset is incurs cost. – Abhijeet Sep 23 '22 at 11:09

1 Answers1

2

A common way of syncing SQL databases in Azure is by activating replication. This will keep databases in sync for the purpose of high availability or redundancy. For your use case, this is not an option, since you only want to sync specific rows.

The recommended tool to use here is Azure Data Factory, which has a built-in copy activity. This allows you to copy data from a source to a sink. In your case, both source and sink are SQL databases.To get started with the copy activity, follow the instructions from this tutorial. You should also make use of a tumbling window trigger to run the pipeline at a specific interval.

Below you'll find a very simple example of a copy activity to give you an idea. The sqlReaderQuery property gives you a lot of flexibility since it enables you to select which specific data you want to sync. To sync the data incrementally, you'll have to make use of existing fields in your dataset like LastUpdated. This way you can conditionally load the data that was created after the last time the pipeline ran.

"activities":[
    {
        "name": "CopyFromSQLServer",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<SQL Server input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
                "sqlReaderQuery": "SELECT * FROM Customers WHERE LastUpdated BETWEEN @{pipeline().parameters.LastModified_From} AND @{pipeline().parameters.LastModified_To}"
            },
            "sink": {
                "type": "SqlSink"
            }
        }
    }
]

This idea is based on this azure tutorial, which copies blob files between two storage containers using the LastModifiedDate. In this GitHub repository you'll find the terraform implementation.

Cloudkollektiv
  • 11,852
  • 3
  • 44
  • 71
  • 1
    Thanks ofcourse you have my +1. Database is huge... really guge. But Data Sync comes free as a bonus right, Price can blow up with Data Factory to thousands of dollars per month for large data say 10s of millions records. Considering if even if nothing changed every upset is incurs cost. – Abhijeet Sep 23 '22 at 11:08
  • 1
    That is why you should always copy based on a field with a timestamp. I recently build a pipeline for adls gen2, based on this [tutorial](https://learn.microsoft.com/en-us/azure/data-factory/solution-template-copy-new-files-lastmodifieddate). You can apply the simple logic here by setting a default start_date and end_date, and then updating those parameters using a tumbling window. – Cloudkollektiv Sep 23 '22 at 11:12
  • And I would also suggest to first try this in an test environment on a database with the same schema but only a few records. – Cloudkollektiv Sep 23 '22 at 11:28