why is the default maximum mappers are 4 in Sqoop? can we give more than 4 mappers in -m parameter?

Question

I am trying to understand the reason behind the default maximum mappers in a sqoop job. Can we set more than four mappers in a sqoop job to achieve higher parallelism.

http://stackoverflow.com/questions/43199789/sqoop-import-how-many-max-mapper-could-be-executed/43270927#43270927 --refer my answer's 1st point — Taha Naqvi, Apr 21 '17 at 11:56

score 6 · Answer 1 · answered Apr 21 '17 at 09:12

If you are using integer column in your split-by then the default number of mappers are 4. And it is strongly recomonded that you always use integer column not the string/char/Text column. see the code here for more explaination. https://github.com/apache/sqoop/blob/660f3e8ad07758aabf0a9b6ede3accdfac5fb1be/src/java/org/apache/sqoop/mapreduce/db/TextSplitter.java#L100

Yes you can give increase/decrease the parallelism by specifying -m

from Sqoop Guide Sqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. Do not increase the degree of parallelism greater than that available within your MapReduce cluster; tasks will run serially and will likely increase the amount of time required to perform the import. Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.

@Sandeep. Please up vote and accept the answer if my answer resolved your query. Thanks — Aditya Agarwal, Apr 23 '17 at 03:38

score 4 · Answer 2 · answered Apr 21 '17 at 08:51

My guess is that 4 is a default number that works well on practice for most use cases. Use the parameter --num-mappers if you want Sqoop to use a different number of mappers. For example, to use 8 concurrent tasks you would use the following sqoop command:

sqoop import \ --connect jdbc:mysql://mysql.example.com/testdb \ --username abcdef \ --password 123456 \ --table test \ --num-mappers 8

Using more mappers will lead to a higher number of concurrent data transfer tasks, which can result in faster job completion. However, it will also increase the load on the database as Sqoop will execute more concurrent queries. You might want to keep this in mind of you are pulling data from your production environment.

score 0 · Answer 3 · answered May 28 '21 at 16:35

when we don't mention the number of mappers while transferring the data from RDBMS to HDFS file system sqoop will use default number of mapper 4. Sqoop imports data in parallel from most database sources. Sqoop only uses mappers as it does parallel import and export.

Shashank Seth · Answer 4 · 2022-12-25T14:44:02.467

If you're not mentioning number of mapper in sqoop it will by default use 4 mapper in parallel to do the sqoop import. If you want to use more then 4 mapper you can use --num-mappers in your sqoop command you can use any number of mapper Also, if you are not sure you have primary key or not in source table --autoreset-to-one-mapper come handy in that case. If there is primary key it will use the mentioned number of mappers to execute the job or else it will just use one mapper to import the table without primary key

sqoop-import \
-- connect jdbc:mysql://localhost/databasename \
-- username root \
-- password xxxxxxxx \
-- warehouse-dir /directory/path/from/home \
-- autoreset-to-one-mapper \
-- num-mappers 6

Also, it comes handy when you doing sqoop import-all-tables to import multiple tables for all the tables with primary key it will use the mentioned number of mappers and for all the tables without primary key it will reset the number of mappers to 1 without failing the job.

Note: The tables without Primary Key use only one mapper for sqoop import until and unless you're not giving --split-by column

why is the default maximum mappers are 4 in Sqoop? can we give more than 4 mappers in -m parameter?

4 Answers4