Questions tagged [apache-spark-2.3]
39 questions
8
votes
2 answers
when is it not performance practical to use persist() on a spark dataframe?
While working on improving code performance as I had many jobs fail (aborted), I thought about using persist() function on Spark Dataframe whenever I need to use that same dataframe on many other operations. When doing it and following the jobs,…

SarahData
- 769
- 1
- 12
- 38
7
votes
3 answers
Writing CSV file using Spark and java - handling empty values and quotes
Initial data is in Dataset and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes
result.coalesce(1).write()
…

Ram Grandhi
- 397
- 10
- 27
6
votes
2 answers
Convert pyspark dataframe to pandas dataframe
I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line :
pd_df=spark_df.toPandas()
I got this error:
first Part
Py4JJavaError: An error occurred while calling…

Ahmad Senousi
- 613
- 2
- 12
- 24
4
votes
2 answers
Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command
I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements).
However, next time when I try to read the same parquet…

user2717470
- 257
- 1
- 5
- 15
3
votes
1 answer
Writing DataFrame as parquet creates empty files
I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but…

Niko
- 373
- 3
- 13
2
votes
1 answer
Airflow: Use LivyBatchOperator for submitting pyspark applications in yarn
I have encountered something called LivyBatchOperator but unable to find a very good example for it to submit pyspark applications in airflow. Any info on this would really be appreciated. Thanks in advance.

kavya
- 75
- 1
- 10
2
votes
0 answers
Apache Spark not connecting to Hive meta store (Database not found)
I have a Java Spark code where I'm trying to connect Hive database. But its has only default database and gives me NoSuchDatabaseException. I tried the following to set the hive metastore.
Add Spark conf in code with Hive Metastore URI
Add Spark…

Gowtham
- 87
- 1
- 14
2
votes
2 answers
Optimizing reading data to spark from Azure blob
We have data residing for a table in Azure blob store which acts as a data lake. Data is ingested every 30 min thus forming time partitions as below in…

Kedar
- 63
- 4
2
votes
1 answer
Pyspark self-join with error "Resolved attribute(s) missing"
While doing a pyspark dataframe self-join I got a error message:
Py4JJavaError: An error occurred while calling o1595.join.
: org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286…

Maviles
- 3,209
- 2
- 25
- 39
2
votes
0 answers
Spark spark.sql.session.timeZone doesn't work with JSON source
Does Spark v2.3.1 depends on local timezone when reading from JSON file?
My src/test/resources/data/tmp.json:
[
{
"timestamp": "1970-01-01 00:00:00.000"
}
]
and Spark code:
SparkSession.builder()
.appName("test")
.master("local")
…

VB_
- 45,112
- 42
- 145
- 293
1
vote
1 answer
Spark 2.3 Stream-Stream Join lost left table key
I'm trying to implement a stream-stream join toy with Spark 2.3.0
The stream joins work fine when the condition matches, but lost the left stream value when the condition mismatched even using leftOuterJoin.
Thanks in advance
Here are my source code…

Xu Yan
- 13
- 4
1
vote
2 answers
write pyspark dataframe to csv with out outer quotes
I have a dataframe with a single column as below. I am using pyspark version 2.3 to write to csv.
18391860-bb33-11e6-a12d-0050569d8a5c,48,24,44,31,47,162,227,0,37,30,28
18391310-bc74-11e5-9049-005056b996a7,37,0,48,25,72,28,24,44,31,52,27,30,4
In…

kavya
- 75
- 1
- 10
1
vote
0 answers
Should I enable shufflehashjoin when left data is large (~1B records) with power law and righ data is small (but > 2GB)
I have a dataset that is very large 350 million to 1 billion records depending on the batch.
On the right side I have a much smaller data set usually in size of 10 million or so, not more.
I cannot simply broadcast the right side (sometimes it grows…

milos
- 261
- 1
- 6
1
vote
2 answers
SparkSubmitOperator vs SSHOperator for submitting pyspark applications in airflow
I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or…

kavya
- 75
- 1
- 10
1
vote
2 answers
Quotes not displayed in CSV output file
Initial data is in Dataset and I am trying to write to csv file an each cell value to be placed in quotes.
result.coalesce(1).write()
.option("delimiter", "|")
.option("header", "true")
.option("nullValue",…

Ram Grandhi
- 397
- 10
- 27