could anyone please help me understand after importing the data from source system(Postgres , oracle,sqlserver) to hdfs using sqoop. What's are the checks you perform to see if the all the data is imported correctly without any discrepancies . How do you make sure that data you imported is not duplicate data. What are the other checks you perform?
Asked
Active
Viewed 789 times
1 Answers
0
For automated data quality checks after import you can check for example
count using sqoop eval = count in hdfs (hive) for loaded partition. This is simplest and useful to be executed as a final step of ETL process. This check shows that all data was most probably loaded and without dups.
sum some column using sqoop eval = sum in hive. Also for loaded partition. This check shows that with some probability data was loaded correctly and columns are in order(not messed).
Applying few such checks at a time will increase probability of discovering a bug in data load.
Of course it's difficult to cover all possible bugs in load using simple and fast queries. But for automated data quality check it's quite enough.

leftjoin
- 36,950
- 8
- 57
- 116