Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
1
vote
1 answer

How to add one extra partition to external hive table?

I have hive table like below create external table transaction( id int, name varchar(60)) month string ) PARTITIONED BY ( year string, transaction_type_code varchar(20) ) STORED AS PARQUET LOCATION 'hdfs://xyz'; I am…
1
vote
1 answer

How to drop rows from partitioned hive table?

I need to drop specific rows from a Hive table, which is partitioned. These rows for deletion matches certain conditions, so entire partitions can not be dropped in order to do so. Lets say the table Table has three columns: partner, date and…
1
vote
1 answer

How partitioning and clustered by works in Hive table?

I'm trying to understand below query by using that how data is going to be placed. CREATE TABLE mytable ( name string, city string, employee_id int ) PARTITIONED BY (year STRING, month STRING, day STRING) CLUSTERED BY…
nut
  • 51
  • 7
1
vote
1 answer

hive script failing due to heap space issue to process too many partitions

my script failing due to a heap space issue to process too many partitions. To avoid the issue I am trying to insert all the partitions into a single partition but I am facing the below error FAILED: SemanticException [Error 10044]: Line 1:23 Cannot…
Never_Give_Up
  • 126
  • 1
  • 9
1
vote
1 answer

joining hive partitioned , bucketed table with only bucketed table (not partitioned table) in hive

i have 2 tables: q6_cms_list_key1 (bucketed by cm and se) partitioned by tr_dt ... 99 000 000 000 rows q6_cm_first_visit (bucketed by cm and se) 25 000 000 000 rows making another table using below conditions insert into table…
1
vote
1 answer

Hive: why to use partition by in selects?

I cannot understand partitioning concept in Hive completely. I understand what are partitions and how to create them. What I cannot get is why people are writing select statements which have "partition by" clause like it is done here: SQL most…
MiamiBeach
  • 3,261
  • 6
  • 28
  • 54
1
vote
1 answer

Can I create buckets in a Hive External Table?

I am creating an external table that refers to ORC files in an HDFS location. That ORC files are stored in such a way that the external table is partitioned by date (Mapping to date wise folders on HDFS, as partitions). However, I am wondering if I…
1
vote
1 answer

How to insert Hive partition column and value into data (parquet) file?

Request:- How can I insert partition key pair into each parquet file while inserting data into Hive/Impala table. Hive Table DDL [ create external table db.tbl_name ( col1 string, col2 string) Partitioned BY (date_col string) STORED AS…
Peace_Dude
  • 11
  • 3
1
vote
1 answer

Drop partitions in Hive with different date format in the same partition column

I have 2 types of value in the partition column of string datatype: yyyyMMdd yyyy-MM-dd E.g. there are partition column values 20200301, 2020-03-05, 2020-05-07, 20200701, etc. I need to drop partitions less than 20200501 with a DDL statement…
1
vote
1 answer

HDFS:Exact meaning of dfs.block.size

In our cluster the dfs.block.size is configured 128M, but I have seen quite a few files which is of the size of 68.8M which is a weird size. I have been confused on how exactly this configuration option affects how files look like on HDFS. First…
Boyu Zhang
  • 219
  • 2
  • 12
1
vote
1 answer

How to call Partitioner in Haoop v 0.21

In my application I want to create as many reducer jobs as possible based on the keys. Now my current implementation writes all the keys and values in a single (reducer) output file. So to solve this, I have used one partitioner but I cannot call…
Kal
  • 161
  • 3
  • 14
1
vote
1 answer

Reducer Selection in Hive

I have following record set to process like 1000, 1001, 1002 to 1999, 2000, 2001, 2002 to 2999, 3000, 3001, 3002 to 3999 And I want to process the following record set using HIVE in such a way so that reducer-1 will process data 1000 to 1999…
Suvo
  • 19
  • 1
1
vote
1 answer

Hive Date Partitioned table - Streaming Data in S3 with mixed dates

I have extensive experience working with Hive Partitioned tables. I use Hive 2.X. I was interviewing for a Big Data Solution Architect role and I was asked the below question. Question: How would you ingest a streaming data in a Hive table…
1
vote
0 answers

How to accelerate large hive table spark group by query?

I have an input table intab: create table intab ( ds string comment 'date partition filed' , id1 string comment 'id1' , id2 string comment 'id2' , n int comment 'n' ) comment 'test' partition by list(ds)(partition default); I need to…
Changwang Zhang
  • 2,467
  • 7
  • 38
  • 64
1
vote
0 answers

Spark Save job is taking a long time

I am trying to save the Dataframe to HDFS location. But my save is taking a long time. The action before this is joining two tables using Spark SQL. Need to know why the save is having four stages and how to improve the performance. I have attached…