I've a doubt how the data is partitioned into part files if the data is skewed. If possible, please help me clarifying this.
Let's say this my department
table with department_id
as primary key.
mysql> select * from departments;
2 Fitness
3 Footwear
4 Apparel
5 Golf
6 Outdoors
7 Fan Shop
If I use sqoop import
by mentioning -m 1
in the import command, I know I will have only one part file generated with all the records in that.
Now I ran the command without specifying any mappers. So by default it should take 4 mappers and it created 4 part files in HDFS. Below is how the records got distributed per part file.
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00000
2,Fitness
3,Footwear
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00001
4,Apparel
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00002
5,Golf
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00003
6,Outdoors
7,Fan Shop
As per the BoundingValsQuery, Min(department_id)=2, Max(department_id)=8 and 4 mappers are to be used by default.
Upon calculation, each mapper should get (8-2)/4=1.5 records.
Here I'm not getting how to distribute the data. I couldn't understand how 2 records came in part-m-00000 and only one in part-m-00001, part-m-00002 and again two in part-m-00003.