I have a few tables in MS SQL, these tables gets updated every second and query more or less look like this
SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID
WHERE table2.UpdateTime >= ${lastUpdateTime} AND table2.G_ID > ${lastID}
Let assume the select inner join query results 5 records as shown below.
If the query runs for the first time ${lastUpdateTime}
and ${lastG_ID}
is set to 0, and it will return below 5 records. After processing the records, the query will store the max(G_ID)
i.e. 5 and max(UpdateTime)
i.e. 1512010479 in etl_stat
table.
G_ID UpdateTime ID Name T_NAME
-------------------------------------------------------------------
1 1512010470 12591225 DUMMY_DATA DUMMY_ID
2 1512096873 12591538 DUMMY_DATA DUMMY_ID
3 1512096875 12591539 DUMMY_DATA DUMMY_ID
4 1512010477 12591226 DUMMY_DATA DUMMY_ID
5 1512010479 12591227 DUMMY_DATA DUMMY_ID
if table add another 5 new records as shown below:
G_ID UpdateTime ID Name T_NAME
-------------------------------------------------------------------
1 1512010470 12591225 DUMMY_DATA DUMMY_ID
2 1512096873 12591538 DUMMY_DATA DUMMY_ID
3 1512096875 12591539 DUMMY_DATA DUMMY_ID
4 1512010477 12591226 DUMMY_DATA DUMMY_ID
5 1512010479 12591227 DUMMY_DATA DUMMY_ID
6 1512010480 12591230 DUMMY_DATA DUMMY_ID
7 1512010485 12591231 DUMMY_DATA DUMMY_ID
8 1512010490 12591232 DUMMY_DATA DUMMY_ID
9 1512010493 12591233 DUMMY_DATA DUMMY_ID
10 1512010500 12591234 DUMMY_DATA DUMMY_ID
The query will read first the max(G_ID)
and max(UpdateTime)
from etl_stat table
and will frame query as follows
SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID WHERE table2.UpdateTime >= 1512010479 AND table2.G_ID > 5
, so that the query returns only 5 delta records as shown below.
G_ID UpdateTime ID Name T_NAME
-------------------------------------------------------------------
6 1512010480 12591230 DUMMY_DATA DUMMY_ID
7 1512010485 12591231 DUMMY_DATA DUMMY_ID
8 1512010490 12591232 DUMMY_DATA DUMMY_ID
9 1512010493 12591233 DUMMY_DATA DUMMY_ID
10 1512010500 12591234 DUMMY_DATA DUMMY_ID
So Every time a query run it should first read the max(G_ID)
and max(UpdateTime)
from etl_stat
table and frame the select inner join query as shown above and get the delta changes.
AS IS ARCHITECTURE USING SPARK SQL
I have implemented the above use case as follows:
1) Spark JDBC reads phoenix table to get the max(G_ID)
and max(UpdateTime)
from etl_stat
table.
2) Spark JDBC frames the select inner join query like this SELECT G_ID,UpdateTime,ID,Name,T_NAME FROM TABLE1 AS table1 INNER JOIN TABLE2 AS table2 ON table1.IP=table2.ID WHERE table2.UpdateTime >= 1512010479 AND table2.G_ID > 5
3) Spark JDBC runs step 2 inner join query, reads the delta messages from MS SQL server process the records and inserts into HBase.
4) After successful insert into HBase, Spark update the etl_stat
table with latest G_ID
i.e. 10 and UpdateTime
i.e. 1512010500.
5) This job has been cron scheduled to run every 1 minute.
TO BE ARCHITECTURE USING NIFI
I want to move this use case to Nifi, I want to use NiFi to read records from MS SQL DB and send this record to Kafka.
After successfully publish to Kafka, NiFi will save G_ID and UpdateTime in the database.
Once message will reach to Kafka, spark streaming will read messages from Kafka and will save to HBase using existing business logic.
On every run Nifi processor should frame select inner join query using max(G_ID)
and max(UpdateTime)
in order to get delta records and publish to Kafka.
I am new to Nifi/HDF. I need your help and guidance in order to implement this using Nifi/HDF. if you have better solution/architecture for this use case please suggest.
Sorry for such a long post.