Highest Voted 'apache-spark-1.3' Questions

16

votes

3 answers

Why is "Error communicating with MapOutputTracker" reported when Spark tries to send GetMapOutputStatuses?

I'm using Spark 1.3 to do an aggregation on a lot of data. The job consists of 4 steps: Read a big (1TB) sequence file (corresponding to 1 day of data) Filter out most of it and get about 1GB of shuffle write keyBy customer aggregateByKey() to a…

scala apache-spark-1.3

asked Sep 09 '15 at 18:51

Daniel Langdon

5,899
4
28
48

13

votes

3 answers

Pyspark dataframe: Summing over a column while grouping over another

I have a dataframe such as the following In [94]: prova_df.show() order_item_order_id order_item_subtotal 1 299.98 2 199.99 2 250.0 2 …

python pyspark apache-spark-sql apache-spark-1.3

asked Nov 27 '15 at 16:57

Paolo Lami

141
1
1
4

8

votes

1 answer

Scope of 'spark.driver.maxResultSize'

I'm running a Spark job to aggregate data. I have a custom data structure called a Profile, which basically contains a mutable.HashMap[Zone, Double]. I want to merge all profiles that share a given key (a UUID), with the following code: def merge =…

scala apache-spark apache-spark-1.3

asked Sep 11 '15 at 18:47

Daniel Langdon

5,899
4
28
48

5

votes

1 answer

How to view the logs of a spark job after it has completed and the context is closed?

I am running pyspark, spark 1.3, standalone mode, client mode. I am trying to investigate my spark job by looking at the jobs from the past and comparing them. I want to view their logs, the configuration settings under which the jobs were…

apache-spark ssh pyspark tunneling apache-spark-1.3

asked Jul 15 '16 at 21:48

buzzinolops

311
1
3
7

5

votes

2 answers

Spark taking 2 seconds to count to 10 ...?

We're just trialling Spark, and it's proving really slow. To show what I mean, I've given an example below - it's taking Spark nearly 2 seconds to load in a text file with ten rows from HDFS, and count the number of lines. My questions: Is this…

hadoop apache-spark apache-spark-1.3

asked Nov 24 '15 at 05:50

user4081921

5

votes

1 answer

GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table

I have a Hive table in parquet format that was generated using create table myTable (var1 int, var2 string, var3 int, var4 string, var5 array>) stored as parquet; I am able to verify that it was filled -- here is a sample…

scala apache-spark hive apache-spark-sql apache-spark-1.3

asked Sep 22 '15 at 21:48

Glenn Strycker

4,816
6
31
51

4

votes

1 answer

Spark SQL + Window + Streaming Issue - Spark SQL query is taking long to execute when running with spark streaming

We are looking forward to implement a use case using Spark Streaming (with flume) and Spark SQL with windowing that allows us to perform CEP calculation over a set of data.(See below for how the data is captured and used). The idea is to use SQL to…

apache-spark apache-spark-sql spark-streaming apache-spark-1.3

asked Sep 08 '15 at 09:26

Prashant Agrawal

381
3
14

2

votes

0 answers

Spark 1.3.0: ExecutorLostFailure depending on input file size

I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a worker. In the following code I'm trying to count the number of cakes occurring in a…

apache-spark apache-spark-1.3

asked Aug 13 '15 at 10:06

Michael Wyss

53
4

2

votes

0 answers

start spark remote metastore -- Hive from spark

I am trying to use remote metastore when using spark sql --> using spark 1.3.1 --> copied hive-site.xml from hive/conf to spark/conf --> using mysql remote metastore --> added mysql jar to commute-classpath.sh and lib when starting spark-sql…

apache-spark hive apache-spark-1.3

asked Jul 27 '15 at 06:54

user1234

41
3

2

votes

0 answers

Can't load a Hive table through Spark

I am new to Spark and needed help in figuring out why my Hive databases are not accessible to perform a data load through Spark. Background: I am running Hive, Spark, and my Java program on a single machine. It's a Cloudera QuickStart VM, CDH5.4x,…

apache-spark apache-spark-sql apache-spark-1.3

asked Jul 23 '15 at 18:32

Mithila Joshi

21
5

1

vote

1 answer

What are the empty files after RDD.saveAsTextFile?

I'm learning Spark by working through some of the examples in Learning Spark: Lightning Fast Data Analysis and then adding my own developments in. I created this class to get a look at basic transformations and actions. /** * Find errors in a log…

java apache-spark data-analysis apache-spark-1.3

asked Jul 02 '17 at 10:41

runnerpaul

5,942
8
49
118

1

vote

0 answers

running tasks in parallel on separate Hive partitions using Scala and Spark to speed up loading Hive and writing results to Hive or Parquet

this question is a spin off from [this one] (saving a list of rows to a Hive table in pyspark). EDIT please see my update edits at the bottom of this post I have used both Scala and now Pyspark to do the same task, but I am having problems with VERY…

scala python-2.7 hive pyspark apache-spark-1.3

asked Apr 28 '16 at 19:44

KBA

191
1
5
18

1

vote

1 answer

Incrementally adding to a Hive table w/Scala + Spark 1.3

Our cluster has Spark 1.3 and Hive There is a large Hive table that I need to add randomly selected rows to. There is a smaller table that I read and check a condition, if that condition is true, then I grab the variables I need to then query for…

scala hive apache-spark-1.3

asked Apr 26 '16 at 14:19

KBA

191
1
5
18

0

votes

2 answers

In Spark How do i read a field by its name itself instead by its index

I use Spark 1.3. My data has 50 and more attributes and hence I went for a custom class. How do I access a Field from a Custom Class by its name not by its position Here every time I need to invoke a method productElement(0) Also i am not supposed…

apache-spark apache-spark-1.3

asked May 14 '17 at 15:12

Surender Raja

3,553
8
44
80

0

votes

0 answers

Spark Streaming. Issues with Py4j: Error while obtaining a new communication channel

I am currently running a real time Spark Streaming job on a cluster with 50 nodes on Spark 1.3 and Python 2.7. The Spark streaming context reads from a directory in HDFS with a batch interval of 180 seconds. Below are the configuration for the Spark…

python-2.7 streaming real-time py4j apache-spark-1.3

asked Feb 01 '16 at 22:23

Nitin Singh

76
1
1
8

Questions tagged [apache-spark-1.3]