1

I have created a table with dynamic partition in hive as below

create table sample(uuid String,date String,Name String,EmailID String,Comments String,CompanyName String,country String,url String,keyword String,source String)  PARTITIONED BY (id String) Stored as parquet;

Also I have set the following in hive shell

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=100000000;
set hive.exec.max.dynamic.partitions.pernode=100000000;
set hive.exec.max.created.files = 100000000;

Is this a good practise as I am setting the values 100 million for each dynamic partitions configuration as shown above?

wazza
  • 770
  • 5
  • 17
  • 42

1 Answers1

0

The dynamic partitions are designed to those tables which will have new partition values. If your table will be affected by INSERT clause it is okey, in case you don't have dynamic partition you have to execute another query to create the new ones, or you have to know the value of those before:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'

In the official Hive tutorial you could check an example.

The best practise on partitioning are related with the kind of data stored. For example:

  • It is not recomendable to use unique values like Ids. (If each row will have a different id value, this is a bad practice)
  • The data have to have enough dispersion, if a partition have few different values (like use a boolean field or similar), it is a bad practice.
Miguel
  • 1,361
  • 1
  • 13
  • 24
  • Hi Miguel..Thanks for the response but I want to store my data in the same format as dynamic partition stores like /output/id=100/data.parquet. Is there any way to achieve this? – wazza Nov 03 '15 at 09:12
  • Hi, theorically the partition data (independently on if it is a partition or dynamic partition) is always stored like that I mean ( partitioning_field=value/data.paquet). Dynamically only refers that Hive doesn't need that you specify explicitly the value of the field partitioning using a constant value (string for example), instead hive could retrieve it from another field value. – Miguel Nov 03 '15 at 09:20
  • yes I know that I want to store it base on id values. You have specified that using id as partition is a bad practise for hive. So I am asking is there any other way to achieve like this may be java mapreduce,pig,ect – wazza Nov 03 '15 at 09:23
  • Only if the id value is unique (or have few rows related), I mean, for example the typical user id into a user table. It is a bad practice because every query you make will open as many files as rows are retrieved, that impacts negativly on your server performance. But as everything on development that doesn't mean you don't have to do that, depens on the situation (I don't know exactly your situation) and may be for you is the best choice – Miguel Nov 03 '15 at 09:44
  • I have tried that 10 GB data and 1 million id values but it throws java heap space error – wazza Nov 03 '15 at 09:47
  • I work with greatest set of data using the things I posted you and I have never had those kind of error. That looks like a configuration problem... but in this point I can't help you yet, my knowladge is not enougth, sorry. – Miguel Nov 03 '15 at 09:56