Sqoop import validation

Question

could anyone please help me understand after importing the data from source system(Postgres , oracle,sqlserver) to hdfs using sqoop. What's are the checks you perform to see if the all the data is imported correctly without any discrepancies . How do you make sure that data you imported is not duplicate data. What are the other checks you perform?

score 0 · Answer 1 · answered Apr 27 '17 at 08:43

For automated data quality checks after import you can check for example

count using sqoop eval = count in hdfs (hive) for loaded partition. This is simplest and useful to be executed as a final step of ETL process. This check shows that all data was most probably loaded and without dups.
sum some column using sqoop eval = sum in hive. Also for loaded partition. This check shows that with some probability data was loaded correctly and columns are in order(not messed).

Applying few such checks at a time will increase probability of discovering a bug in data load.

Of course it's difficult to cover all possible bugs in load using simple and fast queries. But for automated data quality check it's quite enough.

Sqoop import validation

1 Answers1