Questions tagged [apache-spark-2.0]

Use for questions specific to Apache Spark 2.0. For general questions related to Apache Spark use the tag [apache-spark].

464 questions
0
votes
0 answers

Find maximum for a timestamp through Spark groupBy dataset

I would like to find the last record for an ID for a typed DataSet. I found a solution based on Dataframe : "Find minimum for a timestamp through Spark groupBy dataframe" Find minimum for a timestamp through Spark groupBy dataframe But how doing the…
0
votes
1 answer

Getting the first row of the element from an array

I want to get the first row from a spark 2 dataset..the dataset is as follow: |arrayValue | +-------------------------------------------------------------+ |[1.47527718E12, 134535353E12] …
Luckylukee
  • 575
  • 2
  • 9
  • 27
0
votes
1 answer

Apache Spark with Java, converting to Date Type from Varchar2 in Oracle fails

I have a usecase where I want to read data from one Oracle table where all fields are varchar type and save it to another Oracle table with similar fields but with ideally correct datatype. This has to be done only in java. So I want to read Dataset…
abhihello123
  • 1,668
  • 1
  • 22
  • 38
0
votes
1 answer

Spark Dataset or Dataframe for Aggregation

We have a MapR cluster with Spark version 2.0 We are trying to measure the performance difference of a Hive query which is currently running on TEZ engine and then running it on Spark-sql just by Writing the sql query in .hql file and then calling…
0
votes
1 answer

Spark on EMR "exceeding memory limits" for checkpointed/cached job

Is my understanding of caching wrong? The resulting RDD after all my transformations is incredibly small, like 1 GB. The data it was computed from is quite large, ~700 GB in size. I have to run logic to read in thousands of pretty big files, all to…
0
votes
0 answers

Spark 2.11 with Java, Saving DataFrame in Oracle creates columns with double quotes

Using the following code in Spark(Java), we save dataframe in Oracle, it creates a table too if doesn't exists. Dataset someAccountDF = sparkSession.createDataFrame(impalaAccountsDF.toJavaRDD(),…
0
votes
3 answers

Spark Job fails connecting to oracle in first attempt

We are running spark job which connect to oracle and fetch some data. Always attempt 0 or 1 of JDBCRDD task fails with below error. In subsequent attempt task get completed. As suggested in few portal we even tried with…
Rishi Saraf
  • 1,644
  • 2
  • 14
  • 27
0
votes
1 answer

Why there is no support for sparkSession with namedObject in spark job server?

I am trying to build an application with spark job server API(for spark 2.2.0). But I found that there is no support for namedObject with sparkSession. my looks like: import com.typesafe.config.Config import org.apache.spark.sql.SparkSession import…
arglee
  • 1,374
  • 4
  • 17
  • 30
0
votes
0 answers

Spark streaming saving dataframe fails

I am using Spark 2.2 to write to Redshift on AWS cluster and it is failing with the below error. I am using CDH 5.10 and scala 2.11.8. Any ideas on how to fix this? Is it missing the snappy dependency? WARN TaskSetManager:66 - Lost task 0.0 in…
0
votes
1 answer

Specify Azure key in Spark 2.x version

I'm trying to access a wasb(Azure blob storage) file in Spark and need to specify the account key. How do I specify the account in the spark-env.sh file? fs.azure.account.key.test.blob.core.windows.net …
user1050619
  • 19,822
  • 85
  • 237
  • 413
0
votes
3 answers

How to mask columns using Spark 2?

I have some tables in which I need to mask some of its columns. Columns to be masked vary from table to table and I am reading those columns from application.conf file. For example, for employee table as shown below +----+------+-----+---------+ |…
Shekhar
  • 11,438
  • 36
  • 130
  • 186
0
votes
1 answer

Spark on Hbase Jars

I am trying to run an example of SparkOnHbase as mentioned here -> Spark On Hbase But i am just trying to compile and run the code on my local windows machine. My build.sbt snippet below scalaVersion := "2.11.8" libraryDependencies…
AJm
  • 993
  • 2
  • 20
  • 39
0
votes
0 answers

Unable to save RDD to HDFS in Apache Spark

I am getting the following error while trying to save the RDD to HDFS 17/09/13 17:06:42 WARN TaskSetManager: Lost task 7340.0 in stage 16.0 (TID 100118, XXXXXX.com, executor 2358): java.io.IOException: Failing write. Tried pipeline recovery 5 times…
vdep
  • 3,541
  • 4
  • 28
  • 54
0
votes
2 answers

Checkpointing With NOT Serializable

Want to understand a basic issue. Here is my code: def createStreamingContext(sparkCheckpointDir: String,batchDuration: Int ) = { val ssc = new StreamingContext(spark.sparkContext, Seconds(batchDuration)) ssc } val ssc =…
Ayan Guha
  • 750
  • 3
  • 10
0
votes
2 answers

Kudu with PySpark2: Error with KuduStorageHandler

I am trying to read data in stored as Kudu using PySpark 2.1.0 >>> from os.path import expanduser, join, abspath >>> from pyspark.sql import SparkSession >>> from pyspark.sql import Row >>> spark = SparkSession.builder \ .master("local") \ …