Highest Voted 'apache-spark-2.2' Questions

1

vote

1 answer

How to create emptyRDD using SparkSession - (since hivecontext got deprecated)

IN Spark version 1.* Created emptyRDD like below: var baseDF = hiveContextVar.createDataFrame(sc.emptyRDD[Row], baseSchema) While migrating to Spark 2.0(since hiveContext got deprecated, using sparkSession) Tried like: var baseDF =…

apache-spark rdd apache-spark-2.2

asked Jul 30 '18 at 08:41

Shabhana

59
2
2
12

1

vote

0 answers

How to prevent Apache Spark to read JDBC DataFrame multiple times?

I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table. To log what has been…

scala apache-spark jdbc apache-spark-sql apache-spark-2.2

asked Jul 02 '18 at 18:43

Hanebambel

109
11

1

vote

0 answers

writestream aggregate windowed watermarked dataframe doesn't wok:

I am working with a CSV dataset as input, read by readStream as below: inDF = spark \ .readStream \ .option("sep", ",") \ .option("maxFilesPerTrigger", 1) \ .schema(rawStreamSchema) \ .csv(rawEventsDir) Below the schema: inDF…

python apache-spark pyspark spark-structured-streaming apache-spark-2.2

asked Jan 15 '18 at 11:49

Roberto Patrizi

21
3

1

vote

0 answers

How to use table bucketing with ephemeral EMR clusters?

I'm using Spark 2.2 with ephemeral clusters on EMR. I'd like to use spark bucketing and I don't care about Hive (Spark only workloads). Can I use spark.sql.warehouse.dir with a s3 bucket to save metastore information in order to make them not…

apache-spark apache-spark-sql emr apache-spark-2.2

asked Oct 16 '17 at 22:02

Yann Moisan

8,161
8
47
91

1

vote

1 answer

Why does elasticsearch-spark 5.5.0 fail with AbstractMethodError when submitting to YARN cluster?

I wrote a spark job which main goal is to write into es, and submit it , the issue is when I submit it onto spark clusters, spark gave back [ERROR][org.apache.spark.deploy.yarn.ApplicationMaster] User class threw exception:…

apache-spark elasticsearch apache-spark-sql apache-spark-2.2

asked Aug 04 '17 at 09:09

no123ff

307
5
16

0

votes

1 answer

How spark get the size of a dataframe for broadcast?

I set this setting : --conf spark.sql.autoBroadcastJoinThreshold=209715200 //200mb And i want to decrease this amount to be just a little higher than a specific dataFrame (Let s call it bdrDf) I tried to esmimate the bdrDf : import…

scala apache-spark apache-spark-2.2

asked Nov 16 '21 at 10:22

Marwan02

45
6

0

votes

1 answer

Optimized way to apply transformation on several columns of a Spark DataFrame

In my spark jobs, I have to make transformations on multiple column for 2 use cases : Casting columns In my personal use case, i use it on a Df of 150 columns def castColumns(inputDf: DataFrame, columnsDefs: Array[(String, DataType)]): DataFrame…

scala apache-spark apache-spark-2.2

asked Nov 01 '21 at 21:56

Marwan02

45
6

0

votes

2 answers

How to test spark application in Scala

I have a Spark Application, which receives data from files as RDD and sends it to another service (MyService). The processing scheme looks like this: object Sender { def handle(myService: MyService) = { val rdd = getRdd() …

scala unit-testing mockito apache-spark-2.2

asked Mar 29 '21 at 10:40

Aliona

76
3

0

votes

1 answer

How to Convert Map of Struct type to Json in Spark2

apache-spark apache-spark-sql apache-spark-2.2

asked Nov 24 '20 at 17:58

Monika

143
2
3
11

0

votes

0 answers

Spark Program twice as slow in Spark 2.2 than Spark 1.6

We're migrating our Scala Spark programs from 1.6.3 to 2.2.0. The program in question has four parts: let's call them sections A, B, C and D. Section A parses the input (parquet files) and then caches the DF and creates a table. Then sections B, C…

scala apache-spark apache-spark-1.6 apache-spark-2.2

asked Feb 01 '19 at 19:36

Doug T

21
2

0

votes

0 answers

Apache Spark orc read performance when reading large number of small files

When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs. What spark is…

apache-spark apache-spark-sql apache-spark-2.2

asked Oct 31 '18 at 16:59

Giri

1
2

0

votes

3 answers

How to sort each line of a rdd in spark using scala?

My text file has got the below data: 10,14,16,19,52 08,09,12,20,45 55,56,70,78,53 I want to sort each row in a descending order. I have tried the below code val file = sc.textFile("Maximum values").map(x=>x.split(",")) val sorted = file.sortBy(x=>…

scala apache-spark rdd apache-spark-2.0 apache-spark-2.2

asked Sep 27 '18 at 12:35

abdul rahim

5
4

0

votes

1 answer

Spark2 Kafka Structured Streaming Java doesn't know from_json function

I've got a question with regards to Spark structured streaming on a Kafka stream. I have a schema of type: StructType schema = new StructType() .add("field1", StringType) .add("field2", StringType) …

java apache-spark-sql spark-structured-streaming apache-spark-2.2

asked Sep 24 '18 at 05:47

user2450954

19
2
4

0

votes

1 answer

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further. Note: As per the requirement I am not supposed to delete the blank rows. My…

pyspark apache-spark-2.2

asked Sep 11 '18 at 08:29

learner

73
2
9

0

votes

1 answer

Issue with executing spark sql job using oozie action

Facing a weird issue, trying to execute a spark-sql(Spark2) job using oozie action but the behavior of execution is quite weird, at times it executes fine but sometimes it continues to be in "Running" state forever, on checking the logs got the…

apache-spark-sql hadoop-yarn oozie apache-spark-2.2

asked Jul 16 '18 at 13:12

Sumit Khurana

159
1
10

Questions tagged [apache-spark-2.2]