Questions tagged [spark2.4.4]

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle Stream processing problematics with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related question please don't forget to provide a reproducible example (a.k.a MVCE), when applicable. You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest stable version:

Apache Spark 2.4.4 - Aug. 30, 2019

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to PAST EVENTS tab in the top)
Awesome Spark - Awesome collection of resources by Github Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast Big Data Analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

17 questions

votes

1 answer

How to set Spark timeout ( application Killing itself )

I want to add a security measure to my spark job, if they don't finish after X hours kill them selves. (using spark 2.4.3 in cluster mode in yarn mode) Didn't find any configuration in spark that helps me with what I wanted . I tried to do it this…

apache-spark hadoop-yarn spark2.4.4

asked Aug 11 '21 at 15:03

Ilya Brodezki

vote

1 answer

Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment

I'm using PySpark 2.4 and I noticed that the pyspark.sql.functions module is missing some methods like trim and col. In PyCharm, it shows as undefined. However, I have written some tasks using these methods and they run correctly in the local…

apache-spark pyspark spark2.4.4

asked Jul 06 '23 at 01:18

Simon Mau

vote

1 answer

Extension of compressed parquet file in Spark

In my Spark job, I write a compressed parquet file like this: df .repartition(numberOutputFiles) .write .option("compression","gzip") .mode(saveMode) .parquet(avroPath) Then, my files has this extension : file_name .gz.parquet How can I…

scala apache-spark parquet spark2.4.4

asked Dec 26 '22 at 16:17

Marwan02

vote

0 answers

Can we set up both Spark2.4 and Spark3.0 in single system?

I have Spark 2.4 installation in my Windows . This is required as my Production env. uses Spark2.4 . Now, i wanted to test Spark3.0 feature Also . So can i install Spark-3.0 binaries ,in same Windows machine without disturbing Spark-2.4 installation…

apache-spark spark3 spark2.4.4

asked May 14 '21 at 15:03

HimanshuSPaul

vote

1 answer

Spark2.4 Unable to overwrite table from same table

I am trying to insert data into a table using insert overwrite statement but I am getting below error. org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.; command is as below spark.sql("INSERT OVERWRITE…

scala apache-spark-sql spark2.4.4

asked Apr 23 '21 at 14:16

Ratan

vote

1 answer

Change spark _temporary directory path to avoid deletion of parquets

When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable. I'm writting a dataframe in append mode with spark 2.4.4 and I want to add a timestamp to the tmp dir of spark to avoid these deletion.…

scala hadoop spark2.4.4

asked Mar 19 '20 at 12:40

moez skanjii

vote

1 answer

Output Spark application name in driver log

I need to output the Spark application name (spark.app.name) in each line of the driver log (along with other attributes like message and date). So far I failed to find the correct log4j configuration or any other hints. How could it be done? I…

apache-spark log4j apache-spark-standalone spark2.4.4

asked Feb 06 '20 at 14:23

Valentina

vote

1 answer

UDFs with Dictionaries on Spark 2.4

I am using Pyspark 2.4.4., and I need to use a UDF to create my desired output. This UDF uses a broadcasted dictionary. First, it looks like I need to modify the code for the UDF to accept the dictionary. Second, I am not sure that what I am doing…

apache-spark dictionary pyspark user-defined-functions spark2.4.4

asked Jan 23 '20 at 11:03

morfara

votes

0 answers

Spark 3.3.1 picking up current date automatically in data frame if date is missing from given timestamp and not marking it as _corrupt record

I am using Spark 3.3.1 to read input CSV file having below header and value ID, CREATE_DATE 1, 14:42:23.0 I'm passing only time(HH:MM:SS.SSS) where as DATE(YYYY-MM-DD) is missing in CREATE_DATE field and reading CREATE_DATE field as…

python apache-spark pyspark apache-spark-3.0 spark2.4.4

asked Jul 09 '23 at 08:01

mayur kandekar

votes

1 answer

Pyspark split the file while writing with specific limit

I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but it is not working as expected. Below is the one I have used and tested on a 90 GB table from…

python apache-spark pyspark spark2.4.4

asked Sep 23 '22 at 17:26

Vikas T

votes

1 answer

Hive beeline and spark load count doesn't match for hive tables

I am using spark 2.4.4 and hive 2.3 ... Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable) if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 -…

apache-spark hive parquet spark2.4.4

asked Feb 19 '22 at 17:43

VimalK

votes

0 answers

Specific Spark write operation gradually increase with time in streaming applicaiton

I have a long-running spark streaming job. The execution time gradually, linearly increasing, and in 60 minutes the processing goes from 40 seconds to 90 seconds. This increase is happening at an HDFS write statement: def write_checkpoint(self, df,…

pyspark spark2.4.4

asked Aug 12 '21 at 13:52

ponthu

votes

0 answers

Convert Spark2.2's UDAF to 3.0 Aggregator

I have a already written UDAF in scala using Spark2.4. Since our Databricks cluster was in 6.4 runtime whose support is no more there, we need to move to 7.3 LTS which have the long term support and uses Spark3. UDAF is deprecated in Spark3 and will…

scala apache-spark spark3 spark2.4.4

asked May 21 '21 at 13:03

Girish Rawat

votes

1 answer

spark not downloading hive_metastore jars

Environment I am using spark v2.4.4 via the python API Problem According to the spark documentation I can force spark to download all the hive jars for interacting with my hive_metastore by setting the following…

pyspark hive spark2.4.4

asked Feb 25 '21 at 19:49

Arran Duff

1,214
2
11
23

votes

1 answer

Reading HDFS small size partitions?

Our data loads into hdfs with partition columns as date daily. The issue is each partition has small file size less than 50mb. So when we read the data from all these partition to load the data to next table take hours. How can we address this…

java scala apache-spark cloudera-cdh spark2.4.4

asked Jun 03 '20 at 00:23

developforacause

2 Next