Highest Voted 'apache-spark-2.3' Questions

8

votes

2 answers

when is it not performance practical to use persist() on a spark dataframe?

While working on improving code performance as I had many jobs fail (aborted), I thought about using persist() function on Spark Dataframe whenever I need to use that same dataframe on many other operations. When doing it and following the jobs,…

asked Feb 12 '19 at 15:18

SarahData

769
1
12
38

7

votes

3 answers

Writing CSV file using Spark and java - handling empty values and quotes

Initial data is in Dataset and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes result.coalesce(1).write() …

java csv apache-spark java-8 apache-spark-2.3

asked Feb 26 '20 at 16:36

Ram Grandhi

397
10
27

6

votes

2 answers

Convert pyspark dataframe to pandas dataframe

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line : pd_df=spark_df.toPandas() I got this error: first Part Py4JJavaError: An error occurred while calling…

pandas pyspark apache-spark-2.3

asked Feb 25 '19 at 06:30

Ahmad Senousi

613
2
12
24

4

votes

2 answers

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements). However, next time when I try to read the same parquet…

hadoop hive pyspark parquet apache-spark-2.3

asked Jun 28 '19 at 21:57

user2717470

257
1
5
15

3

votes

1 answer

Writing DataFrame as parquet creates empty files

I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but…

apache-spark apache-spark-sql cloudera parquet apache-spark-2.3

asked Jun 19 '19 at 16:04

Niko

373
3
13

2

votes

1 answer

Airflow: Use LivyBatchOperator for submitting pyspark applications in yarn

I have encountered something called LivyBatchOperator but unable to find a very good example for it to submit pyspark applications in airflow. Any info on this would really be appreciated. Thanks in advance.

hadoop-yarn livy airflow apache-spark-2.3

asked Jun 30 '20 at 13:44

kavya

75
1
10

2

votes

0 answers

Apache Spark not connecting to Hive meta store (Database not found)

I have a Java Spark code where I'm trying to connect Hive database. But its has only default database and gives me NoSuchDatabaseException. I tried the following to set the hive metastore. Add Spark conf in code with Hive Metastore URI Add Spark…

java apache-spark hadoop hive apache-spark-2.3

asked Mar 16 '20 at 08:52

Gowtham

87
1
14

2

votes

2 answers

Optimizing reading data to spark from Azure blob

We have data residing for a table in Azure blob store which acts as a data lake. Data is ingested every 30 min thus forming time partitions as below in…

apache-spark apache-spark-sql azure-blob-storage apache-spark-2.3

asked Feb 25 '20 at 11:16

Kedar

63
4

2

votes

1 answer

Pyspark self-join with error "Resolved attribute(s) missing"

While doing a pyspark dataframe self-join I got a error message: Py4JJavaError: An error occurred while calling o1595.join. : org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286…

python python-3.x pyspark apache-spark-2.3

asked Jul 02 '19 at 18:24

Maviles

3,209
2
25
39

2

votes

0 answers

Spark spark.sql.session.timeZone doesn't work with JSON source

Does Spark v2.3.1 depends on local timezone when reading from JSON file? My src/test/resources/data/tmp.json: [ { "timestamp": "1970-01-01 00:00:00.000" } ] and Spark code: SparkSession.builder() .appName("test") .master("local") …

apache-spark apache-spark-sql apache-spark-2.3

asked Nov 11 '18 at 17:44

VB_

45,112
42
145
293

1

vote

1 answer

Spark 2.3 Stream-Stream Join lost left table key

I'm trying to implement a stream-stream join toy with Spark 2.3.0 The stream joins work fine when the condition matches, but lost the left stream value when the condition mismatched even using leftOuterJoin. Thanks in advance Here are my source code…

apache-spark left-join spark-structured-streaming apache-spark-2.3

asked Feb 27 '21 at 06:32

Xu Yan

13
4

1

vote

2 answers

write pyspark dataframe to csv with out outer quotes

I have a dataframe with a single column as below. I am using pyspark version 2.3 to write to csv. 18391860-bb33-11e6-a12d-0050569d8a5c,48,24,44,31,47,162,227,0,37,30,28 18391310-bc74-11e5-9049-005056b996a7,37,0,48,25,72,28,24,44,31,52,27,30,4 In…

python dataframe apache-spark pyspark apache-spark-2.3

asked Feb 16 '21 at 11:11

kavya

75
1
10

1

vote

0 answers

Should I enable shufflehashjoin when left data is large (~1B records) with power law and righ data is small (but > 2GB)

I have a dataset that is very large 350 million to 1 billion records depending on the batch. On the right side I have a much smaller data set usually in size of 10 million or so, not more. I cannot simply broadcast the right side (sometimes it grows…

apache-spark-2.3

asked Jul 06 '20 at 14:38

milos

261
1
6

1

vote

2 answers

SparkSubmitOperator vs SSHOperator for submitting pyspark applications in airflow

I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or…

airflow-scheduler spark-submit airflow apache-spark-2.3

asked Jun 25 '20 at 06:57

kavya

75
1
10

1

vote

2 answers

Quotes not displayed in CSV output file

Initial data is in Dataset and I am trying to write to csv file an each cell value to be placed in quotes. result.coalesce(1).write() .option("delimiter", "|") .option("header", "true") .option("nullValue",…

csv apache-spark apache-spark-2.3

asked Feb 04 '20 at 15:16

Ram Grandhi

397
10
27

Questions tagged [apache-spark-2.3]