Questions tagged [apache-spark-2.2]

36 questions
1
vote
1 answer

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.* Created emptyRDD like below: var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema) While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession) Tried like: var baseDF =…
Shabhana
  • 59
  • 2
  • 2
  • 12
1
vote
0 answers

How to prevent Apache Spark to read JDBC DataFrame multiple times?

I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table. To log what has been…
1
vote
0 answers

writestream aggregate windowed watermarked dataframe doesn't wok:

I am working with a CSV dataset as input, read by readStream as below: inDF = spark \ .readStream \ .option("sep", ",") \ .option("maxFilesPerTrigger", 1) \ .schema(rawStreamSchema) \ .csv(rawEventsDir) Below the schema: inDF…
1
vote
0 answers

How to use table bucketing with ephemeral EMR clusters?

I'm using Spark 2.2 with ephemeral clusters on EMR. I'd like to use spark bucketing and I don't care about Hive (Spark only workloads). Can I use spark.sql.warehouse.dir with a s3 bucket to save metastore information in order to make them not…
Yann Moisan
  • 8,161
  • 8
  • 47
  • 91
1
vote
1 answer

Why does elasticsearch-spark 5.5.0 fail with AbstractMethodError when submitting to YARN cluster?

I wrote a spark job which main goal is to write into es, and submit it , the issue is when I submit it onto spark clusters, spark gave back [ERROR][org.apache.spark.deploy.yarn.ApplicationMaster] User class threw exception:…
0
votes
1 answer

How spark get the size of a dataframe for broadcast?

I set this setting : --conf spark.sql.autoBroadcastJoinThreshold=209715200 //200mb And i want to decrease this amount to be just a little higher than a specific dataFrame (Let s call it bdrDf) I tried to esmimate the bdrDf : import…
Marwan02
  • 45
  • 6
0
votes
1 answer

Optimized way to apply transformation on several columns of a Spark DataFrame

In my spark jobs, I have to make transformations on multiple column for 2 use cases : Casting columns In my personal use case, i use it on a Df of 150 columns def castColumns(inputDf: DataFrame, columnsDefs: Array[(String, DataType)]): DataFrame…
Marwan02
  • 45
  • 6
0
votes
2 answers

How to test spark application in Scala

I have a Spark Application, which receives data from files as RDD and sends it to another service (MyService). The processing scheme looks like this: object Sender { def handle(myService: MyService) = { val rdd = getRdd() …
Aliona
  • 76
  • 3
0
votes
1 answer

How to Convert Map of Struct type to Json in Spark2

I have a map field in dataset with below schema |-- party: map (nullable = true) | |-- key: string | |-- value: struct (valueContainsNull = true) | | |-- partyName: string (nullable = true) | | |-- cdrId:…
Monika
  • 143
  • 2
  • 3
  • 11
0
votes
0 answers

Spark Program twice as slow in Spark 2.2 than Spark 1.6

We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C…
0
votes
0 answers

Apache Spark orc read performance when reading large number of small files

When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is…
Giri
  • 1
  • 2
0
votes
3 answers

How to sort each line of a rdd in spark using scala?

My text file has got the below data: 10,14,16,19,52 08,09,12,20,45 55,56,70,78,53 I want to sort each row in a descending order. I have tried the below code val file = sc.textFile("Maximum values").map(x=>x.split(",")) val sorted = file.sortBy(x=>…
0
votes
1 answer

Spark2 Kafka Structured Streaming Java doesn't know from_json function

I've got a question with regards to Spark structured streaming on a Kafka stream. I have a schema of type: StructType schema = new StructType() .add("field1", StringType) .add("field2", StringType) …
0
votes
1 answer

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further. Note: As per the requirement I am not supposed to delete the blank rows. My…
learner
  • 73
  • 2
  • 9
0
votes
1 answer

Issue with executing spark sql job using oozie action

Facing a weird issue, trying to execute a spark-sql(Spark2) job using oozie action but the behavior of execution is quite weird, at times it executes fine but sometimes it continues to be in "Running" state forever, on checking the logs got the…