Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
3
votes
5 answers

Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to.…
3
votes
1 answer

Which logic should be followed using custom partitioner in map reduce to solve this

If in a file key distribution is like 99% of the words start with 'A' and 1% start with 'B' to 'Z' and you have to count the number of words starting with each letter, how would you distribute your keys efficiently?
3
votes
2 answers

FAILED: ParseException: cannot recognize input near 'exchange' 'string' ',' in column specification

I am using latest AWS Hive version 0.13.0. FAILED: ParseException: cannot recognize input near 'exchange' 'string' ',' in column specification I am getting the above error when I run the below(create table) query. CREATE EXTERNAL TABLE test ( foo…
Brisi
  • 1,781
  • 7
  • 26
  • 41
3
votes
0 answers

Performance tuning of HIVE tables using index - works and issues?

I have an external hive table abc with 3 columns - c1 string, c2 int, c3 string I did create a COMPACT index on the column c1 as part of create index statement with deferred rebuild. Now, I do an alter index on abc with rebuild; so my index table…
3
votes
2 answers

Hadoop how to allocate to reducers to handle unbalanced load - CustomPartition

I have a map reducer job which has to output in multiple outputs, I am using multipleOutputFormat as in this example: http://grepalex.com/2013/05/20/multipleoutputs-part1/ Here is the challenge: If my partitioner sends each reducer one key (assume…
sahara
  • 143
  • 1
  • 8
2
votes
2 answers

How to write to Hive table with static partition using PySpark?

I've created a Hive table with a partition like this: CREATE TABLE IF NOT EXISTS my_table (uid INT, num INT) PARTITIONED BY (dt DATE) Then with PySpark, I'm having a dataframe and I've tried to write it to the Hive table like…
Michael
  • 791
  • 2
  • 12
  • 32
2
votes
1 answer

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng. Table Name :…
trougc
  • 329
  • 3
  • 14
2
votes
1 answer

Is it relevant to partition by Business Date and Ingest Date for a FACT table on Delta Lake?

I am working on a data engineering case where i have a table Table_Movie partitionned by ingest date. Now, from time to time, i receive some old data. And I need to perform operations based on business date. For example : Today, I received new data…
2
votes
1 answer

Querying based on Partition and non-partition column in Hive

I have an external Hive table as follows :- CREATE external TABLE sales ( ItemNbr STRING, itemShippedQty INT, itemDeptNbr SMALLINT, gateOutUserId STRING, code VARCHAR(3), trackingId STRING, baseDivCode STRING ) PARTITIONED BY (countryCode STRING,…
Neer1009
  • 304
  • 1
  • 5
  • 18
2
votes
0 answers

Spark Job only processes file in a single spark container

I am reading in a csv file from gcs and I need to go through each row and call an api to get some data back and appended to a new dataframe. the code goes something like this: DataFrame df = sparkSession.read().option("header",…
2
votes
1 answer

How map reduce is being performed in this HiveQL query?

FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'python mapper.py' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced SELECT TRANSFORM map_output.dt, map_output.uid USING 'python…
Dhairya Verma
  • 661
  • 1
  • 9
  • 15
2
votes
1 answer

Kafka S3 Sink Connector - how to mark a partition as complete

I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh. This data is used by a batch job downstream. So, before starting the downstream job, I need to be…
2
votes
1 answer

Hive SQL force shuffle

I have a simple query: Select * from A left join b on A.b = b.b left join c on A.c = c.c left join d on A.d = d.d left join e on A.e = e.e ... ~20 tables All tables b,c,d,e etc are small and therefore all joins are broadcast joins The problem is…
Saedry
  • 23
  • 6
2
votes
0 answers

How to merge partitions in HDFS?

Assuming I have a partitioned table in my HDFS, that gets new information all the time. New data will be partitioned by days by default, while all of the other files are partitioned by months. How can I merge partitions so by this example I would be…
user7551211
  • 649
  • 1
  • 6
  • 25
2
votes
1 answer

Does order of partitioning columns matter in Hive?

Lets say I have a partitioned table with multiple columns as partition keys e.g. partitioned by (department string,year int, month int,day int) So does this specific order really matter? All the online resources refer to advantage of scanning only…
Dhiraj
  • 3,396
  • 4
  • 41
  • 80