Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
0
votes
1 answer

Copying Hive managed table by copying partition directories into warehouse

I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have…
ktmq
  • 43
  • 1
  • 8
0
votes
1 answer

avoid partitions unbalancing Spark

I have a performance problem with a code I'm revisioning, everytime will give an OOM while performing a count. I think I found the problem, basically after keyBy tranformation, being executed aggregateByKey. The problem lies to the fact that almost…
Giorgio
  • 1,073
  • 3
  • 15
  • 33
0
votes
1 answer

How to read multiple line elements in Spark , where each record of log is starting with yyyy-MM-dd format and each record of log is multi-line?

I have implemented below logic in scala so far for this : val hadoopConf = new Configuration(sc.hadoopConfiguration); //hadoopConf.set("textinputformat.record.delimiter", "2016-") hadoopConf.set("textinputformat.record.delimiter",…
0
votes
1 answer

I can't ping windows azure VM's VIP from my local machine

I have created Windows azure VM and also installed HADOOP in it. Now I want to access HDFS by using URL from my local machine so that i can perform read and write operation. Please guide me the steps to perform this task. Thanks in Advance.
0
votes
3 answers

HIVE. Dynamic partitioning and Insert into specific column

There is a HIVE table with around 100 columns, partitioned by columns ClientNumber and Date. I am trying to insert data from another HIVE table into only 30 columns as well as create Date partitions dynamically. The issue is that all data gets…
VasiliK
  • 1
  • 1
  • 1
0
votes
0 answers

How can I use the custom Writable in the mapper? Hadoop

I am trying to write mapreducer program for the following problem. Problem: Determine the length of each tweet that is stored in csv file how many time a particular length of tweet occur Compute their averages The custome writable(Pair)below was…
0
votes
1 answer

What is difference between hadoop 2.7.3 vs hadoop 2.6.5

I recently came across Hadoop version, in this I noticed that, both 2.6.5 and 2.7.3 are been developed parallel and simultaneous.If possible someone please give me difference between them. 08 October, 2016: Release 2.6.5 available A point release…
0
votes
1 answer

Hive select query failed on ORC table

Exception: Failed with exception java.io.IOException:java.io.IOException: Somehow read -1 bytes trying to skip 6257 more bytes t o seek to position 6708, size: 1290047 Does anyone has any idea about how to fix it on cloud dataproc ?
Revan
  • 541
  • 1
  • 5
  • 13
0
votes
3 answers

how to check partition data sets in oozie work flow?

how to check the partition location exist or not with oozie work flow using decision node. example: /user/cloudera/year=2016/month=201609/day=20150912 in my hdfs location i will get one data set every day like…
Sai
  • 1,075
  • 5
  • 31
  • 58
0
votes
1 answer

Hadoop partitioning. How do you efficiently design a Hive/Impala table?

How do you efficiently design a Hive/Impala table considering the following facts? The table receives tool data of about 100 million rows every day. The date on which it receives the data is stored in a column in the table along with its tool…
Outlander
  • 25
  • 3
0
votes
0 answers

Elasticsearch monthly index on nested field

How to create a monthly index based on field in Nested document. Example for below document i want to partition based on Joindate. My purging and query search logic is based on that. { "pkClmn": "100", "organizationName": "Microsoft", …
0
votes
1 answer

Distributing Hadoop Streaming Output files on basis of Keys

I have written a mapper function that parses the XML and outputs the result as columns separted by "\t" as shown below Name Age ABC 23 XYZ 24 ERT 25 Using the Hadoop Streaming Code as mentioned below, I am trying to partition the data on the…
0
votes
1 answer

hive hadoop: selecting data from table getting error

after I created an external table in Hive I wanted to know to the number of tweets so I wrote the following query but I got this error,please how to solve this problem and this is the configuration of mapred-site.xml
javac
  • 2,819
  • 1
  • 20
  • 22
0
votes
1 answer

Aggregate queries fail in hive if partition directory doesn't exist

I am using Hive v1.2.1 with Tez. I have an external partitioned table. The partitions are hourly and of the form p=yyyy_mm_dd_hh. The situation is that these partition directories in hdfs are likely to be deleted sometime. After they are deleted,…
Ankit Khettry
  • 997
  • 1
  • 13
  • 33
0
votes
1 answer

What are the advantages of increasing the partition size and decreasing partitions number in spark?

I have 1 master and 3 slaves(4 cores each) By Default the min partition size in my spark cluster is 32MB and my file size is 41 Gb. So i am trying to reduce the number of partitions by changing the minsize to…