Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].
Questions tagged [apache-spark-1.6]
111 questions
2
votes
1 answer
pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space
Folks,
Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file
Cluster Info:
9 datanodes
128 GB Memory /48 vCore CPU /Node
Job config
conf = SparkConf().setAppName('test') \
…
2
votes
0 answers
Spark temp tables not found
I'm trying to run a pySpark job with custom inputs, for testing purposes.
The job has three sets of input, each read from a table in a different metastore database.
The data is read in spark with: hiveContext.table('myDb.myTable')
The test inputs…

summerbulb
- 5,709
- 8
- 37
- 83
2
votes
3 answers
Pyspark: How to return a tuple list of existing non null columns as one of the column values in dataframe
i'm working with a pyspark dataframe which is:
+----+----+---+---+---+----+
| a| b| c| d| e| f|
+----+----+---+---+---+----+
| 2|12.3| 5|5.6| 6|44.7|
|null|null| 9|9.3| 19|23.5|
| 8| 4.3| 7|0.5| 21| 8.2|
| 9| 3.8| 3|6.5| 45|…

Mia21
- 119
- 2
- 10
2
votes
1 answer
Exception in thread "main" java.lang.NoClassDefFoundError: org/ejml/simple/SimpleBase
It seems it's missing the Java library Efficient Java Matrix Library(ejml), so I have downloaded from the sources here. I'm creating Maven Jar executable file and running on Openstack EDP Spark environment.
I'm having trouble figuring out how to…

Dheeraj Chitara
- 31
- 1
- 5
2
votes
1 answer
Why does importing SparkSession in spark-shell fail with "object SparkSession is not a member of package org.apache.spark.sql"?
I use Spark 1.6.0 on my VM, Cloudera machine.
I'm trying to enter some data into Hive table from Spark shell.
To do that, I am trying to use SparkSession. But the below import is not working.
scala> import…

Metadata
- 2,127
- 9
- 56
- 127
2
votes
1 answer
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext
I am using IntelliJ 2016.3 version.
import sbt.Keys._
import sbt._
object ApplicationBuild extends Build {
object Versions {
val spark = "1.6.3"
}
val projectName = "example-spark"
val common = Seq(
version := "1.0",
…

Mahesh
- 178
- 3
- 14
2
votes
2 answers
How to use different Hive metastore for saveAsTable?
I am using Spark SQL (Spark 1.6.1) using PySpark and I have a requirement of loading a table from one Hive metastore and writing the result of the dataframe into a different Hive metastore.
I am wondering how can I use two different metastores for…

Srinivas Bandaru
- 311
- 1
- 4
- 16
2
votes
2 answers
How to read a space-delimited text file and save it to Hive?
I have a string like below. The first row is the header, and the rest are the column values.
I want to create a dataframe (Spark 1.6 and Java7) from the String , and convert the values under col3 and col4 as DOUBLE .
col1 col2 col3 col4 col5
val1…

John Thomas
- 212
- 3
- 21
2
votes
2 answers
How to do GROUP BY on exploded field in Spark SQL's?
Zeppelin 0.6
Spark 1.6
SQL
I am trying to find the top 20 occurring words in some tweets. filtered contains an array of words for each tweet. The following:
select explode(filtered) AS words from tweettable
lists each word as you would expect,…

schoon
- 2,858
- 3
- 46
- 78
1
vote
1 answer
Convert a (String, List[(String, String)]) to JSON object
I have the data as:
(ID001,List((BookType,[text]),(author,xyz abc),(time,01/12/2019[22:00] CST/PM))),(ID002,List((BookType,[text]),(author,klj fgh),(time,19/02/2019[12:00] CST/AM)))
I need to convert this to a JSON object:
{"ID001":{
…

chris
- 43
- 2
1
vote
1 answer
How to display mismatched report with a label in spark 1.6 - scala except function?
Consider there are 2 dataframes df1 and df2.
df1 has below data
A | B
-------
1 | m
2 | n
3 | o
df2 has below data
A | B
-------
1 | m
2 | n
3 | p
df1.except(df2) returns
A | B
-------
3 | o
3 | p
How to display the result as…

voidpro
- 1,652
- 13
- 27
1
vote
2 answers
Repartition() causes spark job to fail
I have a spark job that runs file with the below code. However this step create several files in the output folder.
sampledataframe.write.mode('append').partitionBy('DATE_FIELD').save(FILEPATH)
So I have started to use the below line of code to…

Bob
- 335
- 1
- 4
- 16
1
vote
1 answer
Pyspark - DataFrame persist() errors out java.lang.OutOfMemoryError: GC overhead limit exceeded
Pyspark job fails when I try to persist a DataFrame that was created on a table of size ~270GB with error
Exception in thread "yarn-scheduler-ask-am-thread-pool-9"
java.lang.OutOfMemoryError: GC overhead limit exceeded
This issue happens only…

Sam
- 17
- 5
1
vote
0 answers
Spark 1.6 - Overwrite directory with avro files failing using dataframes
I have a directory in HDFS which contains avro files. While I try to overwrite the directory with dataframe it fails.
Syntax: avroData_df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("")
The error is:
Caused by:…

Mnav505
- 13
- 3
1
vote
1 answer
Spark Streaming 1.6 + Kafka: Too many batches in "queued" status
I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below:
def createStreamingContext(conf: Conf): StreamingContext = {
val…

Jorge Cespedes
- 547
- 1
- 11
- 21