Highest Voted 'hadoop-partitioning' Questions

3

votes

5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…

asked Aug 20 '15 at 06:26

Prashant

132
1
13

3

votes

1 answer

Which logic should be followed using custom partitioner in map reduce to solve this

If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?

java hadoop mapreduce load-balancing hadoop-partitioning

asked May 13 '15 at 12:07

Anjul Tiwari

55
7

3

votes

2 answers

FAILED: ParseException: cannot recognize input near 'exchange' 'string' ',' in column specification

I am using latest AWS Hive version 0.13.0. FAILED: ParseException: cannot recognize input near 'exchange' 'string' ',' in column specification I am getting the above error when I run the below(create table) query. CREATE EXTERNAL TABLE test ( foo…

hadoop amazon-web-services hive amazon-emr hadoop-partitioning

asked Jan 12 '15 at 09:43

Brisi

1,781
7
26
41

3

votes

0 answers

Performance tuning of HIVE tables using index - works and issues?

I have an external hive table abc with 3 columns - c1 string, c2 int, c3 string I did create a COMPACT index on the column c1 as part of create index statement with deferred rebuild. Now, I do an alter index on abc with rebuild; so my index table…

hadoop hive hiveql hadoop-partitioning

asked Jun 13 '14 at 21:04

user3739108

31
2

3

votes

2 answers

Hadoop how to allocate to reducers to handle unbalanced load - CustomPartition

I have a map reducer job which has to output in multiple outputs, I am using multipleOutputFormat as in this example: http://grepalex.com/2013/05/20/multipleoutputs-part1/ Here is the challenge: If my partitioner sends each reducer one key (assume…

hadoop mapreduce reduce hadoop-partitioning

asked Jan 30 '14 at 15:12

sahara

143
1
8

2

votes

2 answers

How to write to Hive table with static partition using PySpark?

I've created a Hive table with a partition like this: CREATE TABLE IF NOT EXISTS my_table (uid INT, num INT) PARTITIONED BY (dt DATE) Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like…

apache-spark pyspark hive hadoop-partitioning

asked Apr 29 '22 at 17:51

Michael

791
2
12
32

2

votes

1 answer

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng. Table Name :…

hadoop hive hql hiveql hadoop-partitioning

asked Apr 25 '22 at 23:17

trougc

329
3
14

2

votes

1 answer

Is it relevant to partition by Business Date and Ingest Date for a FACT table on Delta Lake?

I am working on a data engineering case where i have a table Table_Movie partitionned by ingest date. Now, from time to time, i receive some old data. And I need to perform operations based on business date. For example : Today, I received new data…

apache-spark databricks azure-databricks delta-lake hadoop-partitioning

asked Aug 12 '21 at 12:50

OrganicMustard

1,158
1
15
36

2

votes

1 answer

Querying based on Partition and non-partition column in Hive

I have an external Hive table as follows :- CREATE external TABLE sales ( ItemNbr STRING, itemShippedQty INT, itemDeptNbr SMALLINT, gateOutUserId STRING, code VARCHAR(3), trackingId STRING, baseDivCode STRING ) PARTITIONED BY (countryCode STRING,…

hive parquet hadoop-partitioning hive-partitions

asked Jul 24 '21 at 18:06

Neer1009

304
1
5
18

2

votes

0 answers

Spark Job only processes file in a single spark container

I am reading in a csv file from gcs and I need to go through each row and call an api to get some data back and appended to a new dataframe. the code goes something like this: DataFrame df = sparkSession.read().option("header",…

java apache-spark hadoop2 google-cloud-dataproc hadoop-partitioning

asked Jan 13 '21 at 06:29

Darshan Kothari

63
2

2

votes

1 answer

How map reduce is being performed in this HiveQL query?

FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'python mapper.py' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced SELECT TRANSFORM map_output.dt, map_output.uid USING 'python…

hadoop hive mapreduce hiveql hadoop-partitioning

asked Nov 04 '20 at 05:14

Dhairya Verma

661
1
9
15

2

votes

1 answer

Kafka S3 Sink Connector - how to mark a partition as complete

I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh. This data is used by a batch job downstream. So, before starting the downstream job, I need to be…

apache-spark apache-kafka batch-processing hadoop-partitioning system-design

asked Oct 20 '20 at 06:59

nish

6,952
18
74
128

2

votes

1 answer

Hive SQL force shuffle

I have a simple query: Select * from A left join b on A.b = b.b left join c on A.c = c.c left join d on A.d = d.d left join e on A.e = e.e ... ~20 tables All tables b,c,d,e etc are small and therefore all joins are broadcast joins The problem is…

optimization hive hiveql hadoop-partitioning hint

asked Jul 01 '20 at 22:18

Saedry

23
6

2

votes

0 answers

How to merge partitions in HDFS?

Assuming I have a partitioned table in my HDFS, that gets new information all the time. New data will be partitioned by days by default, while all of the other files are partitioned by months. How can I merge partitions so by this example I would be…

apache-spark hadoop hdfs hadoop2 hadoop-partitioning

asked Feb 18 '20 at 16:53

user7551211

649
1
6
25

2

votes

1 answer

Does order of partitioning columns matter in Hive?

Lets say I have a partitioned table with multiple columns as partition keys e.g. partitioned by (department string,year int, month int,day int) So does this specific order really matter? All the online resources refer to advantage of scanning only…

hadoop hive azure-hdinsight hadoop-partitioning

asked Oct 07 '19 at 10:51

Dhiraj

3,396
4
41
80

Questions tagged [hadoop-partitioning]