Comparing one day worth of data from S3 buckets faster

Question

Consider 2 data flows below

1. Front End Box ----> S3 Bucket-1

2. Front End Box ----> Kafka --> Storm ---> S3 Bucket-2

The logs from the boxes are being transferred to S3 buckets. The requirement is to replace flow 1 by flow 2.

Now the data needs to validated between Bucket-1 and Bucket-2 to guarantee that flow 2 can be used.

Following technological options were tried :

1. Python : boto3 Apis
2. Qubole

Both work on a limited data set, qubole is more scalable than python script. But still it takes very long time to do it(never finished, had to kill after running overnight). We are looking at half billion entries here.

Query

SELECT
    count(*)
FROM
    TableA LEFT OUTER JOIN TableB
        ON TableA.id = TableB.id
WHERE
    TableB.id IS NULL
    AND TableA.id IS NOT NULL

Question

Any suggestion for tools, ways to achieve this faster ?

Are there any ways to avoid the join ?

Where do you host your `Front End Box` ? If you are on AWS then you can look into AWS Lamda. — , Apr 25 '17 at 21:17
You can try to log things using AWS CloudWatch and then run your validation in AWS Lambda and create temp files which you can move to S3 with `python`. — , Apr 25 '17 at 21:24

Albatross · Accepted Answer · 2017-05-02T18:02:55.277

Finally was able to avoid the join. Following solution works fine

select sum_cat, count(*)
FROM
(
   select id, sum(category) as sum_cat 
   from 
   (
       select distinct id, 1 as category
       from Table-1

       UNION ALL 

       select distinct id, 1 as category
       from Table-2 

       UNION ALL 

       select distinct id, 2 as category
       from Table-3

       UNION ALL 

       select distinct id, 2 as category
       from Table-4

  )all_ids
   group by log_id
)a
 group by sum_cat;

Explanation

The data comparison it to be done with the data from Table-1 and Table-2 with that in Table-3 and Table-4
So we assign a category to id from these set of tables
All the ids from set A will have category = 1 and set B records have category = 2
Now we sum of the category values and group by ids. So when id is present in both sets it will have value 3. The ids present only in set A will have value 1, which are missing records.

Comparing one day worth of data from S3 buckets faster

1 Answers1