0

How to create dynamic partition using java map reduce, like sql we have group by country column. Example i have country based dataset and need to separate the records based on country ( partition). We can't limit the coutry. since every day will get new country data.

Learn Hadoop
  • 2,760
  • 8
  • 28
  • 60

1 Answers1

1

You can leverage the dynamic partitioning feature of Hive to automatically populate partitions based on incoming data. Below example demonstrates auto-partitioning of raw data based upon country information.

Create a raw data file (country1.csv), which has data for multiple countries

1,USA
2,Canada
3,USA
4,Brazil
5,Brazil
6,USA
7,Canada

Upload this file to a location in HDFS

hadoop fs -mkdir /example_hive
hadoop fs -mkdir /example_hive/country
hadoop fs -put country1.csv /example_hive/country

Create a non-partitioned Hive table on top of the data

CREATE EXTERNAL TABLE country
(
id int, 
country string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 'hdfs:///example_hive/country';

Verify that the Hive table is created correctly

hive (default)> select * from country;
1   USA
2   Canada
3   USA
4   Brazil
5   Brazil
6   USA
7   Canada

Create a partitioned Hive table, with country as the partition

hive (default)> CREATE TABLE country_par
(
id int
)
PARTITIONED BY (country string);

Enable dynamic partitioning

hive (default)> SET hive.exec.dynamic.partition = true;
hive (default)> SET hive.exec.dynamic.partition.mode = nonstrict;

Populate the partitioned table, with Hive automatically putting the data in the right country partition

hive (default)> INSERT INTO TABLE country_par 
PARTITION(country)
SELECT id,country FROM country;

Verify that the partitions were created, and populated correctly

hive (default)> show partitions country_par;
country=Brazil
country=Canada
country=USA

hive (default)> select * from country_par where country='Brazil';
4   Brazil
5   Brazil

hive (default)> select * from country_par where country='USA';
1   USA
3   USA
6   USA

hive (default)> select * from country_par where country='Canada';
2   Canada
7   Canada

hive (default)> select country,count(*) from country_par group by country;
Brazil  2
Canada  2
USA 3
Jagrut Sharma
  • 4,574
  • 3
  • 14
  • 19
  • Is there any other way to approach using java mapreduce – Learn Hadoop Apr 29 '18 at 06:49
  • Using Hive to do this makes things much easier. If you are writing a MapReduce job from scratch, you can output [key=country, value=record] from the mapper, set number of reducers to number of output files desired, and do a straight write-through in the reducer. You will get n output files, each one will be sorted by country, but a file may have multiple countries if number of distinct countries in data > number of reducers. – Jagrut Sharma Apr 29 '18 at 07:52
  • Sharama.. Thanks a lot – Learn Hadoop Apr 29 '18 at 15:56