How data is split into part files in sqoop

Question

I've a doubt how the data is partitioned into part files if the data is skewed. If possible, please help me clarifying this.

Let's say this my department table with department_id as primary key.

mysql> select * from departments;
2 Fitness
3 Footwear
4 Apparel
5 Golf
6 Outdoors
7 Fan Shop

If I use sqoop import by mentioning -m 1 in the import command, I know I will have only one part file generated with all the records in that.

Now I ran the command without specifying any mappers. So by default it should take 4 mappers and it created 4 part files in HDFS. Below is how the records got distributed per part file.

[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00000
2,Fitness
3,Footwear
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00001
4,Apparel
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00002
5,Golf
[cloudera@centsosdemo ~]$ hadoop fs -cat /user/cloudera/departments/part-m-00003
6,Outdoors
7,Fan Shop

As per the BoundingValsQuery, Min(department_id)=2, Max(department_id)=8 and 4 mappers are to be used by default.

Upon calculation, each mapper should get (8-2)/4=1.5 records.

Here I'm not getting how to distribute the data. I couldn't understand how 2 records came in part-m-00000 and only one in part-m-00001, part-m-00002 and again two in part-m-00003.

you should explain better want you want if you want someone to help you. You should at least explain what do you want to achieve. — Elmer Dantas, Jul 14 '17 at 10:27
I wanted to know how data is split among the part files i.e. which record is going to which part file. — iamteja, Jul 14 '17 at 10:29
Sqoop creates sql query for each mapper. You can check the SQL from all Workers Node.Can you Run all sql query on your sql server and share the result? — Sandeep Singh, Jul 14 '17 at 11:07
@SandeepSingh Firstly I'm using mysql. Basically I'm aware that sqoop creates sql query for each mapper. But I wanted to know like how the query is prepared and how the outliers for each mapper are decided. — iamteja, Jul 14 '17 at 13:30
@ElmerDantas- Actually I wanted to see how query are getting formed in your case, since maximum value in data is 7 but you have assigned 8 — Sandeep Singh, Jul 14 '17 at 15:27

score 0 · Answer 1 · answered Jul 17 '17 at 09:34

If you get a chance to look into library. There are sequence of steps involved in it.

Sqoop job Read records. via DBRecordReader

 org.apache.sqoop.mapreduce.db.DBRecordReader

Two methods will do work here.

method 1.

protected ResultSet executeQuery(String query) throws SQLException {
Integer fetchSize = dbConf.getFetchSize();
/*get fetchSize according to split which is calculated via getSplits() method of 
org.apache.sqoop.mapreduce.db.DBInputFormat.And no. of splits are calculated
via no. of (count from table/no. of mappers). */
 }

Split Calculation:-

org.apache.sqoop.mapreduce.db.DBInputFormat
 public List<InputSplit> getSplits(JobContext job) throws IOException {
 .......//here splits are calculated accroding to count of source table
 .......query.append("SELECT COUNT(*) FROM " + tableName);
}

method 2.

 protected String getSelectQuery() {
    if (dbConf.getInputQuery() == null) {
      query.append("SELECT ");

      for (int i = 0; i < fieldNames.length; i++) {
        query.append(fieldNames[i]);
        if (i != fieldNames.length -1) {
          query.append(", ");
        }
      }

      query.append(" FROM ").append(tableName);
      query.append(" AS ").append(tableName); 
      if (conditions != null && conditions.length() > 0) {
        query.append(" WHERE (").append(conditions).append(")");
      }

      String orderBy = dbConf.getInputOrderBy();
      if (orderBy != null && orderBy.length() > 0) {
        query.append(" ORDER BY ").append(orderBy);
      }
    } else {
      //PREBUILT QUERY
      query.append(dbConf.getInputQuery());
    }

    try {// main logic to decide division of records between mappers.
      query.append(" LIMIT ").append(split.getLength());
      query.append(" OFFSET ").append(split.getStart());
    } catch (IOException ex) {
      // Ignore, will not throw.
    }

    return query.toString();
  }

check out for code section under comments main logic to....... Here records are divided according to LIMIT and OFFSET. And this logic is implemented differently for every RDBMS. just look for org.apache.sqoop.mapreduce.db.OracleDBRecordReader it has little different implementation of getSelectQuery() method.

Hope this gives quick idea about how records are divided into different mappers.

score 0 · Answer 2 · answered Nov 06 '18 at 09:04

Sqoop finds the min and max value in the primary key column or the split by column and then tries to divide the range for the given number of mappers.

Example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

Here min val =2 max=7 hence sqoop will run four processes with following ranges ( 2-4) , (4-5) , (5-6),(6-7) which means

2nd and 3rd together
4th record
5th record
6th and 7th in this range

How data is split into part files in sqoop

2 Answers2