Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).
Questions tagged [hadoop-partitioning]
339 questions
3
votes
0 answers
Repartitioning prior to saving a DataFrame into parquet format necessary?
I am having multiple DataFrames (DFs), storing monthly data of customers for the last 5 years. Some DFs store Revenue information, other stores Complaints data and so on. All these DataFrames are Customer ID and Month based, as you can see in the…

cph_sto
- 7,189
- 12
- 42
- 78
3
votes
1 answer
How Mapper and Reducer works together "without" sorting?
I know how the map reduces works and what steps I have:
Mapping
Shuffle and sorting
Reducing
Off course I have Partitioning, Combiners but that's not important right now.
The interesting is that when I run map reduce jobs, looks like mappers and…

grep
- 5,465
- 12
- 60
- 112
3
votes
2 answers
Spark RDD: partitioning according to text file format
I have a text file containing tens of GBs of data, which I need to load from HDFS and parallelize as an RDD. This text file describes items with the following format. Note that the alphabetic strings are not present (the meaning of each row is…

cppstudy
- 323
- 4
- 21
3
votes
0 answers
Where does Spark schedule .textFile() task
Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host).
sc.textFile("hdfs://external_host/file.txt")
I understand that Spark schedules tasks…

Joe
- 31
- 1
3
votes
1 answer
Spark spends a long time on HadoopRDD: Input split
I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples.
When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this:
18/02/07 04:44:50…

andrewmzhang
- 41
- 6
3
votes
1 answer
Skew vs Partition in Hive
After going through Skewed tables in Hive, I got confused with the way the data is stored for Skewed tables and the way it is treated for partitioned tables. Can someone clearly state the differences with marked examples as to where these two…

NeoWelkin
- 332
- 1
- 3
- 12
3
votes
1 answer
Spark mapPartitionsWithIndex : Identify a partition
Identify a partition :
mapPartitionsWithIndex(index, iter)
The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter.
Numerous examples have used this method to remove the…

Kanav Sharma
- 307
- 1
- 5
- 13
3
votes
1 answer
spark repartition data for small file
I am pretty new to Spark and I am using a cluster mainly for paralellizing purpose. I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing.
I want to use a 10 node cluster and parallelize…

epsilones
- 11,279
- 21
- 61
- 85
3
votes
1 answer
Hive: GC Overhead or Heap space error - dynamic partitioned table
Could you please guide me to resolve this GC overhead and heap space error.
I am trying to insert partitioned table from another table (dynamic partition) using the below query:
INSERT OVERWRITE table tbl_part PARTITION(county)
SELECT col1,…

Aavik
- 967
- 19
- 48
3
votes
3 answers
Unable to alter partition location in hive
I am trying to change the partition location of my external hive table.
Command that I try to run:
ALTER TALBE sl_uploads PARTITION (hivetimestamp='2016-07-26 15:00:00') SET LOCATION '/data/dev/event/uploads/hivetimestamp=2016-07-26 15:00:00'
Error…

Austin
- 135
- 4
- 17
3
votes
1 answer
Partition Location of RDD/Dataframe
I have a (pretty large, think 10e7 Rows) DataFrame from which i filter elements based on some property
val res = data.filter(data(FieldNames.myValue) === 2).select(pk.name, FieldName.myValue)
My DataFrame has n Partitions…

silvanheller
- 45
- 7
3
votes
1 answer
Spark partitioning for file write is very slow
When writing a file to HDFS using Spark, this is quite fast when not using partitioning. Instead of that, when I use partitioning for writing the file, the write delay increases by factor ~24.
For the same file, writing without partition takes…

AlexL
- 761
- 1
- 6
- 20
3
votes
0 answers
java.io.EOFException: Premature EOF: no length prefix available
I am inserting a table's data into another table with dynamic partitions in hive.
insert into table x partition(c) select a,b,c from y distribute by c;
I get this error while inserting data from one of the input tables.…

zniv
- 166
- 1
- 2
- 12
3
votes
0 answers
Hadoop mapreduce defining separators for streaming
I'm using Hadoop 2.7.1
I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…

James Owers
- 7,948
- 10
- 55
- 71
3
votes
0 answers
How to get rid of the suffix "-r-00xxx" when using Hadoop MultipleOutputs?
In the MR job
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
MultipleOutputs.addNamedOutput(job, OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class);
In my Reducer
String…

frankilee
- 77
- 1
- 7