1

I have a date time table, fact table and a partitioned table. Goal is to join date_time to fact and insert in to partitioned table. It is working as I can verify /apps/hive/warehouse/dbname.db/p_tbl/p_year=2016/p_month=01/p_day=01 as well as several other year, month, day folders are present. However I also have p_year=__HIVE_DEFAULT_PARTITION__/p_month=__HIVE_DEFAULT_PARTITION__/p_day=__HIVE_DEFAULT_PARTITION__ full of data that should have their own partitions. Why are some year, month, day partitions working and for other dates they are being written to HIVE_DEFAULT? And yes I renamed to generic for my post, not real db,tbl,col names...

CREATE EXTERNAL TABLE ${DB_NAME_HIVE}.date_time(
date_time_key bigint,
label_yyyy varchar(32),
label_mm varchar(32),
label_dd varchar(32)
)
row format delimited
fields terminated by ','
lines terminated by '\n'
LOCATION '${TARGET_PATH}date_time'
;

CREATE EXTERNAL TABLE ${DB_NAME_HIVE}.p_table(
date_time_key bigint,
column_0 varchar(50),
column_1 char(32),
column_2 timestamp,
column_3 bigint,
column_4 tinyint
)
PARTITIONED BY(p_year string, p_month string, p_day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '${TARGET_PATH}my_table'
;


USE ${DB_NAME_HIVE};
INSERT INTO TABLE ${DB_NAME_HIVE}.p_tbl PARTITION(p_year, p_month, p_day)
SELECT
mt.date_time_key,
mt.col_0,
mt.col_1,
mt.col_2,
mt.col_3,
mt.col_4,
dt.label_yyyy,
dt.label_mm,
dt.label_dd
FROM my_tbl mt
LEFT OUTER JOIN date_time dt on (mt.date_time_key = dt.date_time_key)
;

INFO  :          Time taken for load dynamic partitions : 12557
INFO  :         Loading partition {p_year=2016, p_month=11, p_day=15}
INFO  :         Loading partition {p_year=2016, p_month=07, p_day=20}
INFO  :         Loading partition {p_year=2014, p_month=08, p_day=06}
INFO  :         Loading partition {p_year=2015, p_month=04, p_day=05}
INFO  :         Loading partition {p_year=2012, p_month=10, p_day=31}
INFO  :         Loading partition {p_year=2014, p_month=04, p_day=16}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=06}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=09}
INFO  :         Loading partition {p_year=2012, p_month=11, p_day=22}
INFO  :         Loading partition {p_year=2014, p_month=11, p_day=19}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=03}
INFO  :         Loading partition {p_year=2013, p_month=12, p_day=24}
INFO  :         Loading partition {p_year=2014, p_month=10, p_day=29}
INFO  :         Loading partition {p_year=2015, p_month=09, p_day=04}
INFO  :         Loading partition {p_year=2015, p_month=05, p_day=11}
INFO  :         Loading partition {p_year=2016, p_month=06, p_day=13}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=24}
INFO  :         Loading partition {p_year=2014, p_month=10, p_day=21}
INFO  :         Loading partition {p_year=2016, p_month=01, p_day=06}
INFO  :         Loading partition {p_year=2014, p_month=11, p_day=05}
INFO  :         Loading partition {p_year=2012, p_month=12, p_day=04}
INFO  :         Loading partition {p_year=2016, p_month=11, p_day=25}
INFO  :         Loading partition {p_year=2014, p_month=10, p_day=13}
INFO  :         Loading partition {p_year=2013, p_month=06, p_day=21}
INFO  :         Loading partition {p_year=2013, p_month=06, p_day=27}
INFO  :         Loading partition {p_year=2014, p_month=10, p_day=22}
INFO  :         Loading partition {p_year=2016, p_month=11, p_day=22}
INFO  :         Loading partition {p_year=2012, p_month=10, p_day=26}
INFO  :         Loading partition {p_year=2014, p_month=08, p_day=28}
INFO  :         Loading partition {p_year=2013, p_month=10, p_day=21}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=04}
INFO  :         Loading partition {p_year=2013, p_month=11, p_day=14}
INFO  :         Loading partition {p_year=2013, p_month=10, p_day=22}
INFO  :         Loading partition {p_year=2014, p_month=08, p_day=26}
INFO  :         Loading partition {p_year=2012, p_month=10, p_day=30}
INFO  :         Loading partition {p_year=2014, p_month=05, p_day=23}
INFO  :         Loading partition {p_year=2012, p_month=10, p_day=29}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=02}
INFO  :         Loading partition {p_year=__HIVE_DEFAULT_PARTITION__, p_month=__HIVE_DEFAULT_PARTITION__, p_day=__HIVE_DEFAULT_PARTITION__}
INFO  :         Loading partition {p_year=2014, p_month=09, p_day=05}
INFO  :         Loading partition {p_year=2016, p_month=07, p_day=25}
INFO  :         Loading partition {p_year=2016, p_month=08, p_day=10}
AM_Hawk
  • 661
  • 1
  • 15
  • 33

2 Answers2

2

With dynamic partitioning in Hive if null values exist then they are written to a Hive default partition. So in order to resolve this you could clean up the source data to populate the nulls or remove them.

Jared
  • 2,904
  • 6
  • 33
  • 37
  • just to clarify is it a null value in any column in the "row"? I know my date_time and fact tables are clean, there will always be a year,month,day. – AM_Hawk Nov 30 '16 at 18:40
  • it's either nulls in the source, nulls being created by the select statement in the join, or some sort of datatype error that is storing the values as null which may be the case if your source tables are good. – Jared Dec 01 '16 at 01:20
0

This issue has been resolved by NOT using a DATE_TIME table and using the built in HIVE functions YEAR();MONTH();DAY(); against a TIMESTAMP column from the data. These functions return an INT which is passed to the partition.

Example

CREATE EXTERNAL TABLE ${DB_NAME_HIVE}.my_tbl(
unique_id char(32),
some_name varchar(50),
some_group varchar(50),
in_time_start timestamp,
in_time_finish timestamp
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION '${TARGET_PATH}my_tbl';

CREATE EXTERNAL TABLE ${DB_NAME_HIVE}.my_tbl_p(
unique_id char(32),
some_name varchar(50),
some_group varchar(50),
in_time_start timestamp,
in_time_finish timestamp
)
PARTITIONED BY(p_year int, p_month int, p_day int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '${TARGET_PATH}my_tbl_p'
;

INSERT INTO TABLE ${DB_NAME_HIVE}.my_tbl_p PARTITION(p_year, p_month, p_day)
SELECT
unique_id,
some_name,
some_group,
in_time_start,
in_time_finish,
year(in_time_start) as p_year,
month(in_time_start) as p_month,
day(in_time_start) as p_day
FROM my_tbl
;
AM_Hawk
  • 661
  • 1
  • 15
  • 33
  • In the question, `p_table` is created and `p_tbl` is inserted, probably a typo. This solution is correct, but the reason the original failed is the left join.A `date_time` dimension table should have every possible (practical) `date_time_key` preloaded, which implies a safe inner join. The null partition values means a row in `my_tbl` matches no row in the `date_time` table. i.e. either a null or invalid `date_time_key` in the fact table, or the required row is missing in the `date_time` dimension table. But yes, the functions are dynamic, faster and safer than the join. – Davos Aug 09 '19 at 13:08