What is the difference between --split-by and --boundary-query in SQOOP?

Question

Assuming we don't have a column where values are equally distributed, let's say we have a command like this:

sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...

What's the point of using --boundary-query here while --split-by does the same thing? Is there any other way to use --boundary-query? Or any other way to split data more efficiently when there is no key(unique) column?

Dev · Answer 1 · 2016-11-28T13:29:41.270

15

--split-by id will split your data uniformly on the basis of number of mappers (default 4).

Now boundary query by default is something like this.

--boundary-query "SELECT min(id), max(id) from some_table"

But if you know id starts from val1 and ends with val2. Then there is no point to calculate min() and max() operations. This will make sqoop command execution faster.

You can specify any arbitrary query returning val1 and val2.

Edit:

Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.

For example, you have data like:

1,2,3,51,52,191,192,193,194,195,196,197,198,199,200

If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.

Then it will split it into 4 parts:

Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.

But there is no way to define custom partition like :

1-10
51-60
190-200

For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.

edited Nov 28 '16 at 13:29

answered Nov 28 '16 at 12:50

Dev

13,492
19
81
174

I could not understand how --split-by splits it uniformly. If I don't have a near uniformly distributed column and have to choose one goes like; {1, 25430002, 25430002, 25340003, 25520001, 27345001, ...24466002} I know this is an absurd example but what I'm trying to say is the way --split-by handles this with 4 nodes (or more, unless you have 1k nodes or something) by assigning ~a-million of those with >25000000 to a single node just because it has 1 there. How do I split uniformly with --split-by (if table does not have uniformly distributed column)? – burakongun Nov 28 '16 at 13:16
1

@burakongun check out my [answer](http://stackoverflow.com/questions/37042357/sqoop-data-splitting/37048390#37048390) to understand how data splitting works in sqoop. – Dev Nov 28 '16 at 13:24
Thanks for the clarification. That was also what I was thinking. I guess, what I can do now is directly import Oracle partitions or play with row_num etc. (?) or modify queries to get a uniform column. Custom partitions seems harder to implement for large tables, correct me if I'm wrong. – burakongun Nov 29 '16 at 13:45
1

@burakongun I am not much familiar with oracle but I guess row_num is uniformly distributed so all of your partitions will get balanced load. If you have a column having uniform values _(not necessarily primary key)_ choose that. **Custom partition is not supported by sqoop** _(as I mentioned in the answer)_. You need to modify source code of sqoop for adding that feature. – Dev Nov 29 '16 at 16:41

score 1 · Answer 2 · answered Feb 03 '18 at 11:40

--split-by For free-form query imports, you need to specify 'split-by' . When you are importing the result of any particular query, sqoop needs to know the column-name using which it will create splits. Whereas, while importing tables, if not specified, it uses the primary key of the table being imported for creating splits. In case your primary key is uneven and not consistent, you can also specify any other column using split-by.

--boundary-query During sqoop import process, it uses this query to calculate the boundary for creating splits: select min(), max() from table_name.

In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument. This saves min(split-by) and max(split-by) operations and thereby is more efficient.

Eric Tabot · Answer 3 · 2018-06-10T23:17:13.010

I did not read from the answers what I was expecting.

--split-by:

I would say --split-by is mostly being used when you have a table that hasn't got a primary key, sqoop will normally spit out an error message if the table has no primary key. --split-by is being used to determine another column to be used to compute the min() & max in the absence of a pk. Some requirements are:

Table should have numeric values
Table should not contain null
etc

use --split-by only on indexed columns for performance reasons. In case you have to import data from multiple tables, its naturally hard to inspect all those tables to see which ones have primary keys and which don't have primary keys.Here, u use --autoreset-to-one-mapper alongside the mappers used with -m #of_mappers. so your command will look like this:

sqoop-import --connect jdbc:mysql://localhost/dbName --table sometable --username uname --warehouse-dir whdir --autoreset-to-one-mapper -m 5

Tables containing no primary key will use one thread ( sequentially ) and those with primary keys will use 5 threads or mappers as specified. You cannot use both --autoreset-to-one-mapper and -split-by in one command.

--boundary-queries:

If you know the min and value on a table, you can waive complex default computation to get them, you can simple hardcode them as argument to boundary queries. @burakongun explained this well.

score 0 · Answer 4 · edited Oct 27 '21 at 08:32

--split-by

we use split by mostly for the table when we have a primary key in the table, but if we don't have the primary key then by using --split-by we do give the command to distribute the numerical data/rows. the command line is as below

--sqoop import --connect jdbc:mysql://localhost:3306/database --username user --password pass --table tablename --target-dir dirname --delete-target-dir<if present already> --split-by column_name

--boundary-queries

if you know the min and max value in the table then it calculates boundary to create splits

007 · Answer 5 · 2022-09-28T04:21:13.307

-1

sqoop import
... --query "select c1, c2 from (select c1, c2, c1%100 as r from your_table)a where 1=1 AND $CONDITIONS"
--boundary-query "SELECT 0 as MIN, 99 as MAX"
--split-by r
...

it's useful to find a split key to uniform

edited Sep 28 '22 at 04:21

answered Sep 28 '22 at 03:04

007

1
2

What is the difference between --split-by and --boundary-query in SQOOP?

5 Answers5