Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
0
votes
1 answer
Copying Hive managed table by copying partition directories into warehouse
I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have…

ktmq
- 43
- 1
- 8
0
votes
1 answer
avoid partitions unbalancing Spark
I have a performance problem with a code I'm revisioning, everytime will give an OOM while performing a count.
I think I found the problem, basically after keyBy tranformation, being executed aggregateByKey.
The problem lies to the fact that almost…

Giorgio
- 1,073
- 3
- 15
- 33
0
votes
1 answer
How to read multiple line elements in Spark , where each record of log is starting with yyyy-MM-dd format and each record of log is multi-line?
I have implemented below logic in scala so far for this :
val hadoopConf = new Configuration(sc.hadoopConfiguration);
//hadoopConf.set("textinputformat.record.delimiter", "2016-")
hadoopConf.set("textinputformat.record.delimiter",…

Ashish Tyagi
- 33
- 7
0
votes
1 answer
I can't ping windows azure VM's VIP from my local machine
I have created Windows azure VM and also installed HADOOP in it. Now I want to access HDFS by using URL from my local machine so that i can perform read and write operation. Please guide me the steps to perform this task. Thanks in Advance.

sourabh pandey
- 31
- 1
- 4
0
votes
3 answers
HIVE. Dynamic partitioning and Insert into specific column
There is a HIVE table with around 100 columns, partitioned by columns ClientNumber and Date.
I am trying to insert data from another HIVE table into only 30 columns as well as create Date partitions dynamically.
The issue is that all data gets…

VasiliK
- 1
- 1
- 1
0
votes
0 answers
How can I use the custom Writable in the mapper? Hadoop
I am trying to write mapreducer program for the following problem.
Problem:
Determine the length of each tweet that is stored in csv file
how many time a particular length of tweet occur
Compute their averages
The custome writable(Pair)below was…

elyon
- 37
- 6
0
votes
1 answer
What is difference between hadoop 2.7.3 vs hadoop 2.6.5
I recently came across Hadoop version, in this I noticed that, both 2.6.5 and 2.7.3 are been developed parallel and simultaneous.If possible someone please give me difference between them.
08 October, 2016: Release 2.6.5 available
A point release…

Devendra Bhat
- 1,149
- 2
- 14
- 19
0
votes
1 answer
Hive select query failed on ORC table
Exception:
Failed with exception java.io.IOException:java.io.IOException: Somehow
read -1 bytes trying to skip 6257 more bytes t o seek to position
6708, size: 1290047
Does anyone has any idea about how to fix it on cloud dataproc ?

Revan
- 541
- 1
- 5
- 13
0
votes
3 answers
how to check partition data sets in oozie work flow?
how to check the partition location exist or not with oozie work flow using decision node.
example: /user/cloudera/year=2016/month=201609/day=20150912
in my hdfs location i will get one data set every day like…

Sai
- 1,075
- 5
- 31
- 58
0
votes
1 answer
Hadoop partitioning. How do you efficiently design a Hive/Impala table?
How do you efficiently design a Hive/Impala table considering the following facts?
The table receives tool data of about 100 million rows every
day. The date on which it receives the data is stored in a column in
the table along with its tool…

Outlander
- 25
- 3
0
votes
0 answers
Elasticsearch monthly index on nested field
How to create a monthly index based on field in Nested document. Example for below document i want to partition based on Joindate. My purging and query search logic is based on that.
{
"pkClmn": "100",
"organizationName": "Microsoft",
…

user2526641
- 319
- 1
- 4
- 19
0
votes
1 answer
Distributing Hadoop Streaming Output files on basis of Keys
I have written a mapper function that parses the XML and outputs the result as columns separted by "\t" as shown below
Name Age
ABC 23
XYZ 24
ERT 25
Using the Hadoop Streaming Code as mentioned below, I am trying to partition the data on the…

Rohit Guglani
- 1
- 3
0
votes
1 answer
hive hadoop: selecting data from table getting error
after I created an external table in Hive I wanted to know to the number of tweets so I wrote the following query but I got this error,please how to solve this problem and this is the configuration of mapred-site.xml
…

javac
- 2,819
- 1
- 20
- 22
0
votes
1 answer
Aggregate queries fail in hive if partition directory doesn't exist
I am using Hive v1.2.1 with Tez. I have an external partitioned table. The partitions are hourly and of the form p=yyyy_mm_dd_hh. The situation is that these partition directories in hdfs are likely to be deleted sometime. After they are deleted,…

Ankit Khettry
- 997
- 1
- 13
- 33
0
votes
1 answer
What are the advantages of increasing the partition size and decreasing partitions number in spark?
I have 1 master and 3 slaves(4 cores each)
By Default the min partition size in my spark cluster is 32MB and my file size is 41 Gb.
So i am trying to reduce the number of partitions by changing the minsize to…

Pavan Kumar Aryasomayajulu
- 948
- 10
- 18