1

I have a few tables in MS SQL, these tables gets updated every second and query more or less look like this

SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID 
WHERE table2.UpdateTime >= ${lastUpdateTime} AND table2.G_ID > ${lastID}

Let assume the select inner join query results 5 records as shown below.

If the query runs for the first time ${lastUpdateTime} and ${lastG_ID} is set to 0, and it will return below 5 records. After processing the records, the query will store the max(G_ID) i.e. 5 and max(UpdateTime) i.e. 1512010479 in etl_stat table.

 G_ID       UpdateTime   ID            Name             T_NAME 
-------------------------------------------------------------------
 1          1512010470  12591225      DUMMY_DATA       DUMMY_ID    
 2          1512096873  12591538      DUMMY_DATA       DUMMY_ID    
 3          1512096875  12591539      DUMMY_DATA       DUMMY_ID    
 4          1512010477  12591226      DUMMY_DATA       DUMMY_ID    
 5          1512010479  12591227      DUMMY_DATA       DUMMY_ID    

if table add another 5 new records as shown below:

 G_ID       UpdateTime   ID            Name             T_NAME 
-------------------------------------------------------------------
 1          1512010470  12591225      DUMMY_DATA       DUMMY_ID    
 2          1512096873  12591538      DUMMY_DATA       DUMMY_ID    
 3          1512096875  12591539      DUMMY_DATA       DUMMY_ID    
 4          1512010477  12591226      DUMMY_DATA       DUMMY_ID    
 5          1512010479  12591227      DUMMY_DATA       DUMMY_ID 
 6          1512010480  12591230      DUMMY_DATA       DUMMY_ID 
 7          1512010485  12591231      DUMMY_DATA       DUMMY_ID 
 8          1512010490  12591232      DUMMY_DATA       DUMMY_ID 
 9          1512010493  12591233      DUMMY_DATA       DUMMY_ID 
 10         1512010500  12591234      DUMMY_DATA       DUMMY_ID 

The query will read first the max(G_ID) and max(UpdateTime) from etl_stat table and will frame query as follows SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID WHERE table2.UpdateTime >= 1512010479 AND table2.G_ID > 5, so that the query returns only 5 delta records as shown below.

G_ID        UpdateTime   ID            Name             T_NAME 
-------------------------------------------------------------------
 6          1512010480  12591230      DUMMY_DATA       DUMMY_ID 
 7          1512010485  12591231      DUMMY_DATA       DUMMY_ID 
 8          1512010490  12591232      DUMMY_DATA       DUMMY_ID 
 9          1512010493  12591233      DUMMY_DATA       DUMMY_ID 
 10         1512010500  12591234      DUMMY_DATA       DUMMY_ID 

So Every time a query run it should first read the max(G_ID) and max(UpdateTime) from etl_stat table and frame the select inner join query as shown above and get the delta changes.

AS IS ARCHITECTURE USING SPARK SQL

I have implemented the above use case as follows:

1) Spark JDBC reads phoenix table to get the max(G_ID) and max(UpdateTime) from etl_stat table.

2) Spark JDBC frames the select inner join query like this SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID WHERE table2.UpdateTime >= 1512010479 AND table2.G_ID > 5

3) Spark JDBC runs step 2 inner join query, reads the delta messages from MS SQL server process the records and inserts into HBase.

4) After successful insert into HBase, Spark update the etl_stat table with latest G_ID i.e. 10 and UpdateTime i.e. 1512010500.

5) This job has been cron scheduled to run every 1 minute.

TO BE ARCHITECTURE USING NIFI

I want to move this use case to Nifi, I want to use NiFi to read records from MS SQL DB and send this record to Kafka.

After successfully publish to Kafka, NiFi will save G_ID and UpdateTime in the database.

Once message will reach to Kafka, spark streaming will read messages from Kafka and will save to HBase using existing business logic.

On every run Nifi processor should frame select inner join query using max(G_ID) and max(UpdateTime) in order to get delta records and publish to Kafka.

I am new to Nifi/HDF. I need your help and guidance in order to implement this using Nifi/HDF. if you have better solution/architecture for this use case please suggest.

Sorry for such a long post.

nilesh1212
  • 1,561
  • 2
  • 26
  • 60

1 Answers1

1

What you're describing is what the JDBC Kafka Connect connector does out of the box. Set up your configuration file, load it, off you go. Done. Kafka Connect is part of Apache Kafka. No need for extra tools and technologies.

You might want to also consider proper Change-Data-Capture (CDC). For proprietary RDBMS (Oracle, DB2, MS SQL etc) you have commercial tools such as GoldenGate, Attunity, DBVisit and so on. For open source RDBMS (e.g. MySQL, PostgreSQL) you should look at the open source Debezium tool. All of these CDC tools integrate directly with Kafka.

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92