0

The sqoop uses the table primary key or --split-by <columns> to transfer from RDBMS to HDFS and I think the default number of mappers is four. However, by --direct the transfer can be faster without using the mapper. My question is if no mapper is being used, then how Sqoop can handle the transfer in Hadoop framework?

ARASH
  • 418
  • 2
  • 6
  • 18

2 Answers2

1

As per sqoop docs,

MySQL Direct Connector allows faster import and export to/from MySQL using mysqldump and mysqlimport tools functionality instead of SQL selects and inserts.

Generally, It is faster than running range queries using multiple mappers via JDBC.

Dev
  • 13,492
  • 19
  • 81
  • 174
1

Sqoop with the --direct argument internally using mysqldump tool to import data from MySQL. mysqldump is basically MySQL built-in export tools or you can also also say it as database backup program. This utility performs logical backups, producing a set of SQL statements that can be executed to reproduce the original database object definitions and table data. The mysqldump command can also generate output in CSV, other delimited text, or XML format.

If your delimiters exactly match the delimiters used by mysqldump, then Sqoop will use a fast-path that copies the data directly from mysqldump's output into HDFS. Otherwise, Sqoop will parse mysqldump's output into fields and transcode them into the user-specified delimiter set. This incurs additional processing, so performance may suffer. For convenience, the --mysql-delimiters argument will set all the output delimiters to be consistent with mysqldump's format.

This link can be useful to understand it more.

http://archive.cloudera.com/docs-backup/sqoop/_direct_mode_imports.html https://dev.mysql.com/doc/refman/5.7/en/mysqldump.html

Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68