Questions tagged [hadoop-partitioning]

Hadoop partitioning deals with questions about how hadoop decides which key/value pairs are to be sent to which reducer (partition).

339 questions
3
votes
0 answers

Repartitioning prior to saving a DataFrame into parquet format necessary?

I am having multiple DataFrames (DFs), storing monthly data of customers for the last 5 years. Some DFs store Revenue information, other stores Complaints data and so on. All these DataFrames are Customer ID and Month based, as you can see in the…
cph_sto
  • 7,189
  • 12
  • 42
  • 78
3
votes
1 answer

How Mapper and Reducer works together "without" sorting?

I know how the map reduces works and what steps I have: Mapping Shuffle and sorting Reducing Off course I have Partitioning, Combiners but that's not important right now. The interesting is that when I run map reduce jobs, looks like mappers and…
grep
  • 5,465
  • 12
  • 60
  • 112
3
votes
2 answers

Spark RDD: partitioning according to text file format

I have a text file containing tens of GBs of data, which I need to load from HDFS and parallelize as an RDD. This text file describes items with the following format. Note that the alphabetic strings are not present (the meaning of each row is…
cppstudy
  • 323
  • 4
  • 21
3
votes
0 answers

Where does Spark schedule .textFile() task

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host). sc.textFile("hdfs://external_host/file.txt") I understand that Spark schedules tasks…
Joe
  • 31
  • 1
3
votes
1 answer

Spark spends a long time on HadoopRDD: Input split

I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples. When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this: 18/02/07 04:44:50…
3
votes
1 answer

Skew vs Partition in Hive

After going through Skewed tables in Hive, I got confused with the way the data is stored for Skewed tables and the way it is treated for partitioned tables. Can someone clearly state the differences with marked examples as to where these two…
NeoWelkin
  • 332
  • 1
  • 3
  • 12
3
votes
1 answer

Spark mapPartitionsWithIndex : Identify a partition

Identify a partition : mapPartitionsWithIndex(index, iter) The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter. Numerous examples have used this method to remove the…
Kanav Sharma
  • 307
  • 1
  • 5
  • 13
3
votes
1 answer

spark repartition data for small file

I am pretty new to Spark and I am using a cluster mainly for paralellizing purpose. I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing. I want to use a 10 node cluster and parallelize…
epsilones
  • 11,279
  • 21
  • 61
  • 85
3
votes
1 answer

Hive: GC Overhead or Heap space error - dynamic partitioned table

Could you please guide me to resolve this GC overhead and heap space error. I am trying to insert partitioned table from another table (dynamic partition) using the below query: INSERT OVERWRITE table tbl_part PARTITION(county) SELECT col1,…
3
votes
3 answers

Unable to alter partition location in hive

I am trying to change the partition location of my external hive table. Command that I try to run: ALTER TALBE sl_uploads PARTITION (hivetimestamp='2016-07-26 15:00:00') SET LOCATION '/data/dev/event/uploads/hivetimestamp=2016-07-26 15:00:00' Error…
Austin
  • 135
  • 4
  • 17
3
votes
1 answer

Partition Location of RDD/Dataframe

I have a (pretty large, think 10e7 Rows) DataFrame from which i filter elements based on some property val res = data.filter(data(FieldNames.myValue) === 2).select(pk.name, FieldName.myValue) My DataFrame has n Partitions…
3
votes
1 answer

Spark partitioning for file write is very slow

When writing a file to HDFS using Spark, this is quite fast when not using partitioning. Instead of that, when I use partitioning for writing the file, the write delay increases by factor ~24. For the same file, writing without partition takes…
AlexL
  • 761
  • 1
  • 6
  • 20
3
votes
0 answers

java.io.EOFException: Premature EOF: no length prefix available

I am inserting a table's data into another table with dynamic partitions in hive. insert into table x partition(c) select a,b,c from y distribute by c; I get this error while inserting data from one of the input tables.…
zniv
  • 166
  • 1
  • 2
  • 12
3
votes
0 answers

Hadoop mapreduce defining separators for streaming

I'm using Hadoop 2.7.1 I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…
James Owers
  • 7,948
  • 10
  • 55
  • 71
3
votes
0 answers

How to get rid of the suffix "-r-00xxx" when using Hadoop MultipleOutputs?

In the MR job FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); MultipleOutputs.addNamedOutput(job, OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class); In my Reducer String…
1 2
3
22 23