Questions tagged [apache-spark-2.2]
36 questions
1
vote
1 answer
How to create emptyRDD using SparkSession - (since hivecontext got deprecated)
IN Spark version 1.*
Created emptyRDD like below:
var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema)
While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession)
Tried like:
var baseDF =…

Shabhana
- 59
- 2
- 2
- 12
1
vote
0 answers
How to prevent Apache Spark to read JDBC DataFrame multiple times?
I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table.
To log what has been…

Hanebambel
- 109
- 11
1
vote
0 answers
writestream aggregate windowed watermarked dataframe doesn't wok:
I am working with a CSV dataset as input, read by readStream as below:
inDF = spark \
.readStream \
.option("sep", ",") \
.option("maxFilesPerTrigger", 1) \
.schema(rawStreamSchema) \
.csv(rawEventsDir)
Below the schema:
inDF…

Roberto Patrizi
- 21
- 3
1
vote
0 answers
How to use table bucketing with ephemeral EMR clusters?
I'm using Spark 2.2 with ephemeral clusters on EMR.
I'd like to use spark bucketing and I don't care about Hive (Spark only workloads).
Can I use spark.sql.warehouse.dir with a s3 bucket to save metastore information in order to make them not…

Yann Moisan
- 8,161
- 8
- 47
- 91
1
vote
1 answer
Why does elasticsearch-spark 5.5.0 fail with AbstractMethodError when submitting to YARN cluster?
I wrote a spark job which main goal is to write into es, and submit it , the issue is when I submit it onto spark clusters, spark gave back
[ERROR][org.apache.spark.deploy.yarn.ApplicationMaster] User class threw exception:…

no123ff
- 307
- 5
- 16
0
votes
1 answer
How spark get the size of a dataframe for broadcast?
I set this setting : --conf spark.sql.autoBroadcastJoinThreshold=209715200 //200mb
And i want to decrease this amount to be just a little higher than a specific dataFrame (Let s call it bdrDf)
I tried to esmimate the bdrDf :
import…

Marwan02
- 45
- 6
0
votes
1 answer
Optimized way to apply transformation on several columns of a Spark DataFrame
In my spark jobs, I have to make transformations on multiple column for 2 use cases :
Casting columns
In my personal use case, i use it on a Df of 150 columns
def castColumns(inputDf: DataFrame, columnsDefs: Array[(String, DataType)]): DataFrame…

Marwan02
- 45
- 6
0
votes
2 answers
How to test spark application in Scala
I have a Spark Application, which receives data from files as RDD and sends it to another service (MyService). The processing scheme looks like this:
object Sender {
def handle(myService: MyService) = {
val rdd = getRdd()
…

Aliona
- 76
- 3
0
votes
1 answer
How to Convert Map of Struct type to Json in Spark2
I have a map field in dataset with below schema
|-- party: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- partyName: string (nullable = true)
| | |-- cdrId:…

Monika
- 143
- 2
- 3
- 11
0
votes
0 answers
Spark Program twice as slow in Spark 2.2 than Spark 1.6
We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C…

Doug T
- 21
- 2
0
votes
0 answers
Apache Spark orc read performance when reading large number of small files
When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs.
What spark is…

Giri
- 1
- 2
0
votes
3 answers
How to sort each line of a rdd in spark using scala?
My text file has got the below data:
10,14,16,19,52
08,09,12,20,45
55,56,70,78,53
I want to sort each row in a descending order. I have tried the below code
val file = sc.textFile("Maximum values").map(x=>x.split(","))
val sorted = file.sortBy(x=>…

abdul rahim
- 5
- 4
0
votes
1 answer
Spark2 Kafka Structured Streaming Java doesn't know from_json function
I've got a question with regards to Spark structured streaming on a Kafka stream.
I have a schema of type:
StructType schema = new StructType()
.add("field1", StringType)
.add("field2", StringType)
…

user2450954
- 19
- 2
- 4
0
votes
1 answer
how to load specific row and column from an excel sheet through pyspark to HIVE table?
I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My…

learner
- 73
- 2
- 9
0
votes
1 answer
Issue with executing spark sql job using oozie action
Facing a weird issue, trying to execute a spark-sql(Spark2) job using oozie action but the behavior of execution is quite weird, at times it executes fine but sometimes it continues to be in "Running" state forever, on checking the logs got the…

Sumit Khurana
- 159
- 1
- 10