Questions tagged [spark2.4.4]

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle Stream processing problematics with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related question please don't forget to provide a reproducible example (a.k.a MVCE), when applicable. You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest stable version:

Recommended reference sources:

17 questions
2
votes
1 answer

How to set Spark timeout ( application Killing itself )

I want to add a security measure to my spark job, if they don't finish after X hours kill them selves. (using spark 2.4.3 in cluster mode in yarn mode) Didn't find any configuration in spark that helps me with what I wanted . I tried to do it this…
Ilya Brodezki
  • 336
  • 2
  • 15
1
vote
1 answer

Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment

I'm using PySpark 2.4 and I noticed that the pyspark.sql.functions module is missing some methods like trim and col. In PyCharm, it shows as undefined. However, I have written some tasks using these methods and they run correctly in the local…
Simon Mau
  • 11
  • 1
1
vote
1 answer

Extension of compressed parquet file in Spark

In my Spark job, I write a compressed parquet file like this: df .repartition(numberOutputFiles) .write .option("compression","gzip") .mode(saveMode) .parquet(avroPath) Then, my files has this extension : file_name .gz.parquet How can I…
Marwan02
  • 45
  • 6
1
vote
0 answers

Can we set up both Spark2.4 and Spark3.0 in single system?

I have Spark 2.4 installation in my Windows . This is required as my Production env. uses Spark2.4 . Now, i wanted to test Spark3.0 feature Also . So can i install Spark-3.0 binaries ,in same Windows machine without disturbing Spark-2.4 installation…
HimanshuSPaul
  • 278
  • 1
  • 4
  • 19
1
vote
1 answer

Spark2.4 Unable to overwrite table from same table

I am trying to insert data into a table using insert overwrite statement but I am getting below error. org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.; command is as below spark.sql("INSERT OVERWRITE…
Ratan
  • 65
  • 8
1
vote
1 answer

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable. I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.…
1
vote
1 answer

Output Spark application name in driver log

I need to output the Spark application name (spark.app.name) in each line of the driver log (along with other attributes like message and date). So far I failed to find the correct log4j configuration or any other hints. How could it be done? I…
Valentina
  • 518
  • 7
  • 18
1
vote
1 answer

UDFs with Dictionaries on Spark 2.4

I am using Pyspark 2.4.4., and I need to use a UDF to create my desired output. This UDF uses a broadcasted dictionary. First, it looks like I need to modify the code for the UDF to accept the dictionary. Second, I am not sure that what I am doing…
0
votes
0 answers

Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

I am using Spark 3.3.1 to read input CSV file having below header and value ID, CREATE_DATE 1, 14:42:23.0 I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as…
0
votes
1 answer

Pyspark split the file while writing with specific limit

I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but it is not working as expected. Below is the one I have used and tested on a 90 GB table from…
Vikas T
  • 9
  • 2
0
votes
1 answer

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ... Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable) if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 -…
VimalK
  • 65
  • 1
  • 8
0
votes
0 answers

Specific Spark write operation gradually increase with time in streaming applicaiton

I have a long-running spark streaming job. The execution time gradually, linearly increasing, and in 60 minutes the processing goes from 40 seconds to 90 seconds. This increase is happening at an HDFS write statement: def write_checkpoint(self, df,…
ponthu
  • 311
  • 1
  • 3
  • 14
0
votes
0 answers

Convert Spark2.2's UDAF to 3.0 Aggregator

I have a already written UDAF in scala using Spark2.4. Since our Databricks cluster was in 6.4 runtime whose support is no more there, we need to move to 7.3 LTS which have the long term support and uses Spark3. UDAF is deprecated in Spark3 and will…
0
votes
1 answer

spark not downloading hive_metastore jars

Environment I am using spark v2.4.4 via the python API Problem According to the spark documentation I can force spark to download all the hive jars for interacting with my hive_metastore by setting the following…
Arran Duff
  • 1,214
  • 2
  • 11
  • 23
0
votes
1 answer

Reading HDFS small size partitions?

Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this…
1
2