Use for questions specific to Apache Spark 1.4. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.4]
31 questions
1
vote
1 answer
Spark: DecoderException: java.lang.OutOfMemoryError
I am running a Spark streaming application on a cluster with 3 worker nodes. Once in a while jobs are failing due to the following exception:
Job aborted due to stage failure: Task 0 in stage 4508517.0 failed 4 times, most recent failure: Lost task…

user3646174
- 45
- 5
1
vote
1 answer
Slow or incomplete saveAsParquetFile from EMR Spark to S3
I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random. I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of…

Kirk Broadhurst
- 27,836
- 16
- 104
- 169
1
vote
1 answer
Spark 1.4 Mllib LDA topicDistributions() returning wrong number of documents
I have an LDA model running on corpus size of 12,054 documents with vocab size of 9,681 words and 60 clusters. I am trying to get the topic distribution over documents by calling .topicDistributions() or .javaTopicDistributions(). Both of these…

smannan
- 136
- 1
- 1
- 4
1
vote
2 answers
Spark SQL + Streaming issues
We are trying to implement a use case using Spark Streaming and Spark SQL that allows us to run user-defined rules against some data (See below for how the data is captured and used). The idea is to use SQL to specify the rules and return the…

Subhash Vaddiparty
- 11
- 1
1
vote
2 answers
Spark grouping and custom aggregation
I have data as below,
n1 d1 un1 mt1 1
n1 d1 un1 mt2 2
n1 d1 un1 mt3 3
n1 d1 un1 mt4 4
n1 d2 un1 mt1 3
n1 d2 un1 mt3 3
n1 d2 un1 mt4 4
n1 d2 un1 mt5 6
n1 d2 un1 mt2 3
Ii want to get the output as below
n1 d1 un1 0.75
n1 d2 un1…

Akash
- 355
- 4
- 11
1
vote
1 answer
Compile error while calling updateStateByKey
Compile Error :
The method updateStateByKey(Function2
- ,Optional
- ,Optional

dexter
- 451
- 1
- 4
- 19
1
vote
1 answer
CaseWhen in spark DataFrame
I'd like to understand how to use the CaseWhen expressions with the new DataFrame api.
I can't see any reference to it in the documentation, and the only place I saw it was in the…

lev
- 3,986
- 4
- 33
- 46
0
votes
1 answer
pyspark 1.4 how to get list in aggregated function
I want to get list of a column values in aggregated function, in pyspark 1.4. The collect_list is not available. Does anyone have suggestion how to do it?
Original columns:
ID, date, hour, cell
1, 1030, 01, cell1
1, 1030, 01, cell2
2, 1030, 01,…

Helen Z
- 21
- 1
- 8
0
votes
1 answer
Python versions in worker node and master node vary
Running spark 1.4.1 on CentOS 6.7. Have both python 2.7 and python 3.5.1 installed on it with anaconda.
MAde sure that PYSPARK_PYTHON env var is set to python3.5 but when I open pyspark shell and execute a simple rdd transformation, it errors out…

Abhi
- 1,153
- 1
- 23
- 38
0
votes
1 answer
Spark worker node removed but not gone
I am using Spark standalone with a master and a single worker just to test. At first I used one worker box but now I decided to use a different worker box. To do this, I stopped the Master that was running, I changed the IP in the conf/slave file,…

user1342645
- 655
- 3
- 8
- 13
0
votes
1 answer
Select values from a dataframe column
I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last()…

the3rdNotch
- 637
- 2
- 8
- 18
0
votes
1 answer
Databricks - How to create a Library with updated maven artifacts
We initially created a library in databricks using a maven artifact. We see all the jars are present in library and please note that this maven artifact is ours.
We found few issues with the artifact. Fixed it and updated in maven central…

sag
- 5,333
- 8
- 54
- 91
0
votes
1 answer
Apache Spark 1.4.1 Build Failed
I have download Apache Spark 1.4.1 from the official site. As follows:
I don't have hadoop installed in my machine.
Apache provides build command. So, I tried to start building the project using following command
build/mvn -Pyarn -Phadoop-2.4…

Avinash Mishra
- 1,346
- 3
- 21
- 41
0
votes
1 answer
Spark 1.4 image for Google Cloud?
With bdutil, the latest version of tarball I can find is on spark 1.3.1:
gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz
There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or…

Haiying Wang
- 652
- 7
- 10
0
votes
4 answers
Why does insertInto fail when working with tables in non-default database?
I'm using Spark 1.4.0 (PySpark). I have a DataFrame loaded from Hive table using this query:
sqlContext = HiveContext(sc)
table1_contents = sqlContext.sql("SELECT * FROM my_db.table1")
When I attempt to insert data from table1_contents after some…

oikonomiyaki
- 7,691
- 15
- 62
- 101