Highest Voted 'hadoop-partitioning' Questions

3

votes

0 answers

Repartitioning prior to saving a DataFrame into parquet format necessary?

I am having multiple DataFrames (DFs), storing monthly data of customers for the last 5 years. Some DFs store Revenue information, other stores Complaints data and so on. All these DataFrames are Customer ID and Month based, as you can see in the…

asked Jul 24 '19 at 10:29

cph_sto

7,189
12
42
78

3

votes

1 answer

How Mapper and Reducer works together "without" sorting?

I know how the map reduces works and what steps I have: Mapping Shuffle and sorting Reducing Off course I have Partitioning, Combiners but that's not important right now. The interesting is that when I run map reduce jobs, looks like mappers and…

hadoop hadoop-streaming hadoop-partitioning

asked May 29 '19 at 22:56

grep

5,465
12
60
112

3

votes

2 answers

Spark RDD: partitioning according to text file format

I have a text file containing tens of GBs of data, which I need to load from HDFS and parallelize as an RDD. This text file describes items with the following format. Note that the alphabetic strings are not present (the meaning of each row is…

apache-spark hadoop rdd hadoop-partitioning

asked Jun 28 '18 at 19:09

cppstudy

323
4
21

3

votes

0 answers

Where does Spark schedule .textFile() task

Say I want to read data from an external HDFS database, and I have 3 workers in my cluster (one maybe a bit closer to external_host - but not on the same host). sc.textFile("hdfs://external_host/file.txt") I understand that Spark schedules tasks…

apache-spark hdfs hadoop-partitioning

asked Apr 25 '18 at 14:25

Joe

31
1

3

votes

1 answer

Spark spends a long time on HadoopRDD: Input split

I'm running logistic regression with SGD on a large libsvm file. The file is about 10 GB in size with 40 million training examples. When I run my scala code with spark-submit, I notice that spark spends a lot of time logging this: 18/02/07 04:44:50…

scala apache-spark rdd apache-spark-mllib hadoop-partitioning

asked Feb 07 '18 at 05:02

andrewmzhang

41
6

3

votes

1 answer

Skew vs Partition in Hive

After going through Skewed tables in Hive, I got confused with the way the data is stored for Skewed tables and the way it is treated for partitioned tables. Can someone clearly state the differences with marked examples as to where these two…

hive hiveql partitioning hadoop-partitioning skew

asked Jun 27 '17 at 11:51

NeoWelkin

332
1
3
12

3

votes

1 answer

Spark mapPartitionsWithIndex : Identify a partition

Identify a partition : mapPartitionsWithIndex(index, iter) The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter. Numerous examples have used this method to remove the…

scala apache-spark rdd hadoop-partitioning

asked Jun 12 '17 at 14:30

Kanav Sharma

307
1
5
13

3

votes

1 answer

spark repartition data for small file

I am pretty new to Spark and I am using a cluster mainly for paralellizing purpose. I have a 100MB file, each line of which is processed by some algorithm, which is quite a heavy and long processing. I want to use a 10 node cluster and parallelize…

java hadoop apache-spark hadoop-partitioning

asked Dec 14 '16 at 09:55

epsilones

11,279
21
61
85

3

votes

1 answer

Hive: GC Overhead or Heap space error - dynamic partitioned table

Could you please guide me to resolve this GC overhead and heap space error. I am trying to insert partitioned table from another table (dynamic partition) using the below query: INSERT OVERWRITE table tbl_part PARTITION(county) SELECT col1,…

hive out-of-memory reduce memory-efficient hadoop-partitioning

asked Aug 14 '16 at 07:11

Aavik

967
19
48

3

votes

3 answers

Unable to alter partition location in hive

I am trying to change the partition location of my external hive table. Command that I try to run: ALTER TALBE sl_uploads PARTITION (hivetimestamp='2016-07-26 15:00:00') SET LOCATION '/data/dev/event/uploads/hivetimestamp=2016-07-26 15:00:00' Error…

hadoop hive hadoop-partitioning

asked Jul 28 '16 at 18:32

Austin

135
4
17

3

votes

1 answer

Partition Location of RDD/Dataframe

I have a (pretty large, think 10e7 Rows) DataFrame from which i filter elements based on some property val res = data.filter(data(FieldNames.myValue) === 2).select(pk.name, FieldName.myValue) My DataFrame has n Partitions…

apache-spark rdd apache-spark-sql hadoop-partitioning

asked Jul 22 '16 at 09:10

silvanheller

45
7

3

votes

1 answer

Spark partitioning for file write is very slow

When writing a file to HDFS using Spark, this is quite fast when not using partitioning. Instead of that, when I use partitioning for writing the file, the write delay increases by factor ~24. For the same file, writing without partition takes…

hadoop apache-spark hdfs parquet hadoop-partitioning

asked Apr 01 '16 at 09:52

AlexL

761
1
6
20

3

votes

0 answers

java.io.EOFException: Premature EOF: no length prefix available

I am inserting a table's data into another table with dynamic partitions in hive. insert into table x partition(c) select a,b,c from y distribute by c; I get this error while inserting data from one of the input tables.…

hadoop hive hadoop-partitioning

asked Jan 07 '16 at 15:24

zniv

166
1
2
12

3

votes

0 answers

Hadoop mapreduce defining separators for streaming

I'm using Hadoop 2.7.1 I'm really struggling to understand at what point in the streaming process sorts are applied, how you can change the sort order, and the separator. Reading the documentation has confused me further since some config variables…

hadoop mapreduce hadoop-streaming hadoop-partitioning

asked Dec 03 '15 at 22:09

James Owers

7,948
10
55
71

3

votes

0 answers

How to get rid of the suffix "-r-00xxx" when using Hadoop MultipleOutputs?

In the MR job FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); MultipleOutputs.addNamedOutput(job, OUTPUT, TextOutputFormat.class, NullWritable.class, Text.class); In my Reducer String…

hadoop mapreduce hadoop2 hadoop-partitioning multipleoutputs

asked Nov 17 '15 at 19:52

frankilee

77
1
7

Questions tagged [hadoop-partitioning]