Highest Voted 'hadoop-partitioning' Questions

2

votes

2 answers

How to prevent bucket creation if it is not exists in spark on emr

I'm, running spark step on emr cluster. it gathers all small files and accumulated them to one big file. So i receive list of buckets to process, but before processing bucket i want to check if bucket exists and if it contains any files. For that…

asked Jul 17 '19 at 14:25

Yehor

67
6

2

votes

1 answer

passing multiple dates as a paramters to Hive query

I am trying to pass a list of dates as parameter to my hive query. #!/bin/bash echo "Executing the hive query - Get distinct dates" var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"` echo…

shell hive parameters hiveql hadoop-partitioning

asked Jul 10 '19 at 02:01

vikrant rana

4,509
6
32
72

2

votes

0 answers

PySpark: Partitioning and hashing multiple dataframes, then joining

Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames, which I will join at a later stage after first…

python apache-spark hash pyspark hadoop-partitioning

asked Nov 22 '18 at 13:24

cph_sto

7,189
12
42
78

2

votes

0 answers

Spark Clustered By/Bucket by dataset not using memory

I recently came across Spark bucketby/clusteredby here. I tried to mimic this for a 1.1TB source file from S3 (already in parquet). Plan is to completely avoid shuffle as most of the datasets are always joined on "id" column. Here are is what I am…

apache-spark join amazon-s3 amazon-emr hadoop-partitioning

asked Nov 20 '18 at 18:03

androboy

817
1
12
24

2

votes

1 answer

Maximum number of partitions in hive

I have 1500 partition in my hive tables but while doing query it is taking more time then expected. Maximum number of partitions can be created in hive table.

hive bigdata hiveql hadoop2 hadoop-partitioning

asked Oct 22 '18 at 12:55

user6546116

2

votes

1 answer

Hive insert overwrites truncates the table in few cases

I was working on one solution and found that in some particular cases, hive insert overwrite truncates the table however in few cases it doesn't. Would someone please explain me what it's behaving like that? to explain this, I am table two tables,…

hadoop hive hiveql hadoop-partitioning

asked May 31 '18 at 19:07

Gaurang Shah

11,764
9
74
137

2

votes

1 answer

Delete/Update partition with sparklyr

I'm using the spark_write_table function from sparklyr to write tables into HDFS, using the partition_by parameter to define how to store them: R> my_table %>% spark_write_table(., path="mytable", mode="append", …

r hadoop apache-spark sparklyr hadoop-partitioning

asked Jan 26 '18 at 10:13

dalloliogm

8,718
6
45
55

2

votes

0 answers

What is the correct way to know where a file is located on HDFS cluster?

I need to develop my own job executor (it is not homework) leveraging on datanode locality. I have a cluster (2 datanodes) of Hadoop 2.7.1 . (see http://jugsi.blogspot.it/2017/08/configuring-hadoop-271-on-windows-w-ssh.html) My code: public static…

java hadoop hdfs hadoop-partitioning

asked Aug 18 '17 at 13:49

venergiac

7,469
2
48
70

2

votes

1 answer

How to deal with hive partitioning for performance versus over-partitioning

We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month. Performance is poor for a number of reasons. Nearly all of our queries can be further…

hadoop hive hadoop-partitioning

asked Aug 16 '17 at 20:12

Tom Harrison

13,533
3
49
77

2

votes

0 answers

Spark repartition operation duplicated

I've got an RDD in Spark which I've cached. Before I cache it I repartition it. This works, and I can see in the storage tab in spark that it has the expected number of partitions. This is what the stages look like on subsequent runs: It's…

java apache-spark hadoop-partitioning

asked Mar 15 '17 at 08:20

yarrichar

423
5
17

2

votes

1 answer

AggregateByKey Partitioning?

I have : A_RDD = anRDD.map() B_RDD = A_RDD.aggregateByKey() Alright, my Question is : If i put partitionBy(new HashPartitioner) after A_RDD like : A_RDD = anRDD.map().partitionBy(new HashPartitioner(2)) B_RDD = A_RDD.aggregateByKey() 1)Will this…

scala apache-spark hadoop-partitioning

asked Dec 10 '16 at 10:01

Spar

463
1
5
23

2

votes

2 answers

Basics of Hadoop and MapReduce functioning

I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward: From what I Understand: Hadoop is specifically used when there is a huge amount of data involved.…

mapreduce hadoop2 hadoop-partitioning

asked Oct 08 '16 at 13:52

theimpatientcoder

1,184
3
19
32

2

votes

1 answer

TotalOrderPartitioner and Partition file

I am learning hadoop mapreduce, and I am working with the Java API. I learnt about the TotalOrderPartitioner used to 'globally' sort the output by keys, across the cluster and that it needs a partition file (generated using…

java hadoop mapreduce hadoop-partitioning

asked Feb 08 '16 at 08:52

Ankit Khettry

997
1
13
33

2

votes

3 answers

How to create a key, value pair in mapreduce program if values are stored across the boundaries ?

In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB). The data would be in below format HEADER 1 Record…

hadoop mapreduce hadoop-partitioning

asked Sep 28 '15 at 10:43

Vamsinag R

51
4

2

votes

1 answer

Can't put file from local directory to HDFS

I have created a file with name "file.txt" in the local directory , now I want to put it in HDFS by using :- ]$ hadoop fs -put file.txt abcd I am getting a response like put: 'abcd': no such file or directory I have never worked on Linux. Please…

hadoop hdfs hadoop-partitioning

asked Sep 23 '15 at 21:31

lamiaheart

29
1
3

Questions tagged [hadoop-partitioning]