Questions tagged [apache-spark-1.3]

Use for questions specific to Apache Spark 1.3 For general questions related to Apache Spark use the tag [apache-spark].

16 questions
16
votes
3 answers

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps: Read a big (1TB) sequence file (corresponding to 1 day of data) Filter out most of it and get about 1GB of shuffle write keyBy customer aggregateByKey() to a…
Daniel Langdon
  • 5,899
  • 4
  • 28
  • 48
13
votes
3 answers

Pyspark dataframe: Summing over a column while grouping over another

I have a dataframe such as the following In [94]: prova_df.show() order_item_order_id order_item_subtotal 1 299.98 2 199.99 2 250.0 2 …
Paolo Lami
  • 141
  • 1
  • 1
  • 4
8
votes
1 answer

Scope of 'spark.driver.maxResultSize'

I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code: def merge =…
Daniel Langdon
  • 5,899
  • 4
  • 28
  • 48
5
votes
1 answer

How to view the logs of a spark job after it has completed and the context is closed?

I am running pyspark, spark 1.3, standalone mode, client mode. I am trying to investigate my spark job by looking at the jobs from the past and comparing them. I want to view their logs, the configuration settings under which the jobs were…
buzzinolops
  • 311
  • 1
  • 3
  • 7
5
votes
2 answers

Spark taking 2 seconds to count to 10 ...?

We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions: Is this…
user4081921
5
votes
1 answer

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array>) stored as parquet; I am able to verify that it was filled -- here is a sample…
Glenn Strycker
  • 4,816
  • 6
  • 31
  • 51
4
votes
1 answer

Spark SQL + Window + Streaming Issue - Spark SQL query is taking long to execute when running with spark streaming

We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to…
2
votes
0 answers

Spark 1.3.0: ExecutorLostFailure depending on input file size

I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a…
2
votes
0 answers

start spark remote metastore -- Hive from spark

I am trying to use remote metastore when using spark sql --> using spark 1.3.1 --> copied hive-site.xml from hive/conf to spark/conf --> using mysql remote metastore --> added mysql jar to commute-classpath.sh and lib when starting spark-sql…
user1234
  • 41
  • 3
2
votes
0 answers

Can't load a Hive table through Spark

I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark. Background: I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x,…
1
vote
1 answer

What are the empty files after RDD.saveAsTextFile?

I'm learning Spark by working through some of the examples in Learning Spark: Lightning Fast Data Analysis and then adding my own developments in. I created this class to get a look at basic transformations and actions. /** * Find errors in a log…
runnerpaul
  • 5,942
  • 8
  • 49
  • 118
1
vote
0 answers

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark). EDIT please see my update edits at the bottom of this post I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY…
KBA
  • 191
  • 1
  • 5
  • 18
1
vote
1 answer

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive There is a large Hive table that I need to add randomly selected rows to. There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for…
KBA
  • 191
  • 1
  • 5
  • 18
0
votes
2 answers

In Spark How do i read a field by its name itself instead by its index

I use Spark 1.3. My data has 50 and more attributes and hence I went for a custom class. How do I access a Field from a Custom Class by its name not by its position Here every time I need to invoke a method productElement(0) Also i am not supposed…
Surender Raja
  • 3,553
  • 8
  • 44
  • 80
0
votes
0 answers

Spark Streaming. Issues with Py4j: Error while obtaining a new communication channel

I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark…
Nitin Singh
  • 76
  • 1
  • 1
  • 8
1
2