Use for questions specific to Apache Spark 1.3 For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.3]
16 questions
16
votes
3 answers
Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?
I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps:
Read a big (1TB) sequence file (corresponding to 1 day of data)
Filter out most of it and get about 1GB of shuffle write
keyBy customer
aggregateByKey() to a…

Daniel Langdon
- 5,899
- 4
- 28
- 48
13
votes
3 answers
Pyspark dataframe: Summing over a column while grouping over another
I have a dataframe such as the following
In [94]: prova_df.show()
order_item_order_id order_item_subtotal
1 299.98
2 199.99
2 250.0
2 …

Paolo Lami
- 141
- 1
- 1
- 4
8
votes
1 answer
Scope of 'spark.driver.maxResultSize'
I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code:
def merge =…

Daniel Langdon
- 5,899
- 4
- 28
- 48
5
votes
1 answer
How to view the logs of a spark job after it has completed and the context is closed?
I am running pyspark, spark 1.3, standalone mode, client mode.
I am trying to investigate my spark job by looking at the jobs from the past and comparing them. I want to view their logs, the configuration settings under which the jobs were…

buzzinolops
- 311
- 1
- 3
- 7
5
votes
2 answers
Spark taking 2 seconds to count to 10 ...?
We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions:
Is this…
user4081921
5
votes
1 answer
GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
I have a Hive table in parquet format that was generated using
create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array>) stored as parquet;
I am able to verify that it was filled -- here is a sample…

Glenn Strycker
- 4,816
- 6
- 31
- 51
4
votes
1 answer
Spark SQL + Window + Streaming Issue - Spark SQL query is taking long to execute when running with spark streaming
We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to…

Prashant Agrawal
- 381
- 3
- 14
2
votes
0 answers
Spark 1.3.0: ExecutorLostFailure depending on input file size
I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker.
In the following code I'm trying to count the number of cakes occurring in a…

Michael Wyss
- 53
- 4
2
votes
0 answers
start spark remote metastore -- Hive from spark
I am trying to use remote metastore when using spark sql
--> using spark 1.3.1
--> copied hive-site.xml from hive/conf to spark/conf
--> using mysql remote metastore
--> added mysql jar to commute-classpath.sh and lib
when starting spark-sql…

user1234
- 41
- 3
2
votes
0 answers
Can't load a Hive table through Spark
I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark.
Background:
I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x,…

Mithila Joshi
- 21
- 5
1
vote
1 answer
What are the empty files after RDD.saveAsTextFile?
I'm learning Spark by working through some of the examples in Learning Spark: Lightning Fast Data Analysis and then adding my own developments in.
I created this class to get a look at basic transformations and actions.
/**
* Find errors in a log…

runnerpaul
- 5,942
- 8
- 49
- 118
1
vote
0 answers
running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet
this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark).
EDIT please see my update edits at the bottom of this post
I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY…

KBA
- 191
- 1
- 5
- 18
1
vote
1 answer
Incrementally adding to a Hive table w/Scala + Spark 1.3
Our cluster has Spark 1.3 and Hive
There is a large Hive table that I need to add randomly selected rows to.
There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for…

KBA
- 191
- 1
- 5
- 18
0
votes
2 answers
In Spark How do i read a field by its name itself instead by its index
I use Spark 1.3.
My data has 50 and more attributes and hence I went for a custom class.
How do I access a Field from a Custom Class by its name not by its position
Here every time I need to invoke a method productElement(0)
Also i am not supposed…

Surender Raja
- 3,553
- 8
- 44
- 80
0
votes
0 answers
Spark Streaming. Issues with Py4j: Error while obtaining a new communication channel
I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark…

Nitin Singh
- 76
- 1
- 1
- 8