Consider 2 data flows below
1. Front End Box ----> S3 Bucket-1
2. Front End Box ----> Kafka --> Storm ---> S3 Bucket-2
The logs from the boxes are being transferred to S3 buckets. The requirement is to replace flow 1 by flow 2.
Now the data needs to validated between Bucket-1
and Bucket-2
to guarantee that flow 2 can be used.
Following technological options were tried :
1. Python : boto3 Apis
2. Qubole
Both work on a limited data set, qubole is more scalable than python script. But still it takes very long time to do it(never finished, had to kill after running overnight). We are looking at half billion entries here.
Query
SELECT
count(*)
FROM
TableA LEFT OUTER JOIN TableB
ON TableA.id = TableB.id
WHERE
TableB.id IS NULL
AND TableA.id IS NOT NULL
Question
Any suggestion for tools, ways to achieve this faster ?
Are there any ways to avoid the join ?