Questions tagged [apache-spark-2.3]

39 questions
8
votes
2 answers

when is it not performance practical to use persist() on a spark dataframe?

While working on improving code performance as I had many jobs fail (aborted), I thought about using persist() function on Spark Dataframe whenever I need to use that same dataframe on many other operations. When doing it and following the jobs,…
SarahData
  • 769
  • 1
  • 12
  • 38
7
votes
3 answers

Writing CSV file using Spark and java - handling empty values and quotes

Initial data is in Dataset and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes result.coalesce(1).write() …
Ram Grandhi
  • 397
  • 10
  • 27
6
votes
2 answers

Convert pyspark dataframe to pandas dataframe

I have pyspark dataframe where its dimension is (28002528,21) and tried to convert it to pandas dataframe by using the following code line : pd_df=spark_df.toPandas() I got this error: first Part Py4JJavaError: An error occurred while calling…
Ahmad Senousi
  • 613
  • 2
  • 12
  • 24
4
votes
2 answers

Spark(2.3) not able to identify new columns in Parquet table added via Hive Alter Table command

I have a Hive Parquet table which I am creating using Spark 2.3 API df.saveAstable. There is a separate Hive process that alters the same parquet table to add columns (based on requirements). However, next time when I try to read the same parquet…
user2717470
  • 257
  • 1
  • 5
  • 15
3
votes
1 answer

Writing DataFrame as parquet creates empty files

I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but…
2
votes
1 answer

Airflow: Use LivyBatchOperator for submitting pyspark applications in yarn

I have encountered something called LivyBatchOperator but unable to find a very good example for it to submit pyspark applications in airflow. Any info on this would really be appreciated. Thanks in advance.
kavya
  • 75
  • 1
  • 10
2
votes
0 answers

Apache Spark not connecting to Hive meta store (Database not found)

I have a Java Spark code where I'm trying to connect Hive database. But its has only default database and gives me NoSuchDatabaseException. I tried the following to set the hive metastore. Add Spark conf in code with Hive Metastore URI Add Spark…
Gowtham
  • 87
  • 1
  • 14
2
votes
2 answers

Optimizing reading data to spark from Azure blob

We have data residing for a table in Azure blob store which acts as a data lake. Data is ingested every 30 min thus forming time partitions as below in…
2
votes
1 answer

Pyspark self-join with error "Resolved attribute(s) missing"

While doing a pyspark dataframe self-join I got a error message: Py4JJavaError: An error occurred while calling o1595.join. : org.apache.spark.sql.AnalysisException: Resolved attribute(s) un_val#5997 missing from day#290,item_listed#281,filename#286…
Maviles
  • 3,209
  • 2
  • 25
  • 39
2
votes
0 answers

Spark spark.sql.session.timeZone doesn't work with JSON source

Does Spark v2.3.1 depends on local timezone when reading from JSON file? My src/test/resources/data/tmp.json: [ { "timestamp": "1970-01-01 00:00:00.000" } ] and Spark code: SparkSession.builder() .appName("test") .master("local") …
VB_
  • 45,112
  • 42
  • 145
  • 293
1
vote
1 answer

Spark 2.3 Stream-Stream Join lost left table key

I'm trying to implement a stream-stream join toy with Spark 2.3.0 The stream joins work fine when the condition matches, but lost the left stream value when the condition mismatched even using leftOuterJoin. Thanks in advance Here are my source code…
1
vote
2 answers

write pyspark dataframe to csv with out outer quotes

I have a dataframe with a single column as below. I am using pyspark version 2.3 to write to csv. 18391860-bb33-11e6-a12d-0050569d8a5c,48,24,44,31,47,162,227,0,37,30,28 18391310-bc74-11e5-9049-005056b996a7,37,0,48,25,72,28,24,44,31,52,27,30,4 In…
kavya
  • 75
  • 1
  • 10
1
vote
0 answers

Should I enable shufflehashjoin when left data is large (~1B records) with power law and righ data is small (but > 2GB)

I have a dataset that is very large 350 million to 1 billion records depending on the batch. On the right side I have a much smaller data set usually in size of 10 million or so, not more. I cannot simply broadcast the right side (sometimes it grows…
milos
  • 261
  • 1
  • 6
1
vote
2 answers

SparkSubmitOperator vs SSHOperator for submitting pyspark applications in airflow

I have spark and airflow servers differently. And I don't have spark binary in airflow servers. I am able to use SSHOperator and run the spark jobs in cluster mode perfectly well. I would like to know what would be good using either SSHOperator or…
kavya
  • 75
  • 1
  • 10
1
vote
2 answers

Quotes not displayed in CSV output file

Initial data is in Dataset and I am trying to write to csv file an each cell value to be placed in quotes. result.coalesce(1).write() .option("delimiter", "|") .option("header", "true") .option("nullValue",…
Ram Grandhi
  • 397
  • 10
  • 27
1
2 3