Questions tagged [apache-spark-1.6]

Use for questions specific to Apache Spark 1.6. For general questions related to Apache Spark use the tag [apache-spark].

111 questions
2
votes
1 answer

pyspark memory issue :Caused by: java.lang.OutOfMemoryError: Java heap space

Folks, Am running a pyspark code to read 500mb file from hdfs and constructing a numpy matrix from the content of the file Cluster Info: 9 datanodes 128 GB Memory /48 vCore CPU /Node Job config conf = SparkConf().setAppName('test') \ …
2
votes
0 answers

Spark temp tables not found

I'm trying to run a pySpark job with custom inputs, for testing purposes. The job has three sets of input, each read from a table in a different metastore database. The data is read in spark with: hiveContext.table('myDb.myTable') The test inputs…
summerbulb
  • 5,709
  • 8
  • 37
  • 83
2
votes
3 answers

Pyspark: How to return a tuple list of existing non null columns as one of the column values in dataframe

i'm working with a pyspark dataframe which is: +----+----+---+---+---+----+ | a| b| c| d| e| f| +----+----+---+---+---+----+ | 2|12.3| 5|5.6| 6|44.7| |null|null| 9|9.3| 19|23.5| | 8| 4.3| 7|0.5| 21| 8.2| | 9| 3.8| 3|6.5| 45|…
Mia21
  • 119
  • 2
  • 10
2
votes
1 answer

Exception in thread "main" java.lang.NoClassDefFoundError: org/ejml/simple/SimpleBase

It seems it's missing the Java library Efficient Java Matrix Library(ejml), so I have downloaded from the sources here. I'm creating Maven Jar executable file and running on Openstack EDP Spark environment. I'm having trouble figuring out how to…
2
votes
1 answer

Why does importing SparkSession in spark-shell fail with "object SparkSession is not a member of package org.apache.spark.sql"?

I use Spark 1.6.0 on my VM, Cloudera machine. I'm trying to enter some data into Hive table from Spark shell. To do that, I am trying to use SparkSession. But the below import is not working. scala> import…
Metadata
  • 2,127
  • 9
  • 56
  • 127
2
votes
1 answer

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SQLContext

I am using IntelliJ 2016.3 version. import sbt.Keys._ import sbt._ object ApplicationBuild extends Build { object Versions { val spark = "1.6.3" } val projectName = "example-spark" val common = Seq( version := "1.0", …
2
votes
2 answers

How to use different Hive metastore for saveAsTable?

I am using Spark SQL (Spark 1.6.1) using PySpark and I have a requirement of loading a table from one Hive metastore and writing the result of the dataframe into a different Hive metastore. I am wondering how can I use two different metastores for…
2
votes
2 answers

How to read a space-delimited text file and save it to Hive?

I have a string like below. The first row is the header, and the rest are the column values. I want to create a dataframe (Spark 1.6 and Java7) from the String , and convert the values under col3 and col4 as DOUBLE . col1 col2 col3 col4 col5 val1…
John Thomas
  • 212
  • 3
  • 21
2
votes
2 answers

How to do GROUP BY on exploded field in Spark SQL's?

Zeppelin 0.6 Spark 1.6 SQL I am trying to find the top 20 occurring words in some tweets. filtered contains an array of words for each tweet. The following: select explode(filtered) AS words from tweettable lists each word as you would expect,…
schoon
  • 2,858
  • 3
  • 46
  • 78
1
vote
1 answer

Convert a (String, List[(String, String)]) to JSON object

I have the data as: (ID001,List((BookType,[text]),(author,xyz abc),(time,01/12/2019[22:00] CST/PM))),(ID002,List((BookType,[text]),(author,klj fgh),(time,19/02/2019[12:00] CST/AM))) I need to convert this to a JSON object: {"ID001":{ …
chris
  • 43
  • 2
1
vote
1 answer

How to display mismatched report with a label in spark 1.6 - scala except function?

Consider there are 2 dataframes df1 and df2. df1 has below data A | B ------- 1 | m 2 | n 3 | o df2 has below data A | B ------- 1 | m 2 | n 3 | p df1.except(df2) returns A | B ------- 3 | o 3 | p How to display the result as…
voidpro
  • 1,652
  • 13
  • 27
1
vote
2 answers

Repartition() causes spark job to fail

I have a spark job that runs file with the below code. However this step create several files in the output folder. sampledataframe.write.mode('append').partitionBy('DATE_FIELD').save(FILEPATH) So I have started to use the below line of code to…
Bob
  • 335
  • 1
  • 4
  • 16
1
vote
1 answer

Pyspark - DataFrame persist() errors out java.lang.OutOfMemoryError: GC overhead limit exceeded

Pyspark job fails when I try to persist a DataFrame that was created on a table of size ~270GB with error Exception in thread "yarn-scheduler-ask-am-thread-pool-9" java.lang.OutOfMemoryError: GC overhead limit exceeded This issue happens only…
Sam
  • 17
  • 5
1
vote
0 answers

Spark 1.6 - Overwrite directory with avro files failing using dataframes

I have a directory in HDFS which contains avro files. While I try to overwrite the directory with dataframe it fails. Syntax: avroData_df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("") The error is: Caused by:…
Mnav505
  • 13
  • 3
1
vote
1 answer

Spark Streaming 1.6 + Kafka: Too many batches in "queued" status

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. I'm using direct approach to consume from kafka and the code can be found below: def createStreamingContext(conf: Conf): StreamingContext = { val…
Jorge Cespedes
  • 547
  • 1
  • 11
  • 21