Sqoop: Understanding how num-mappers and fetch-size work together

Question

I am trying to import a table from MySQL incrementally using following configuration:

--split-by
date_format(updated_at, '%l')
--boundary-query
select 1, 12 from ${table}
--m
12
--incremental
lastmodified
--last-value
${lastValue}
--check-column
updated_at
--merge-key
id

When I run this, I am getting Java Heap Space error. After searching a bit, I got to know about another config --fetch-size <n>, which defaults to 1000, in sqoop which controls the number number of entries to read from database at once.

Default container memory allocation is 1 GB and the table which I am pulling is of size around 100 GB.

I am trying to figure out why its throwing Java Heap Space error as I am sure if it is going to pull 1000 rows at once, data size of 1000 rows is not going to exceed 1GB.

Is fetch-data config being overwritten by by the split-by, boundary-query and mapper config?

Idea behind this config was to ensure that data distribution is now skewed and few mappers only don't end up pulling all the data. So with this config, I am doing a split by hour in 12 hour format so that hour 1 and 13 get assigned to same mapper.

Any guidance on this will be really helpful.

Sqoop: Understanding how num-mappers and fetch-size work together

0 Answers0