Questions tagged [spark2.4.4]

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle Stream processing problematics with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related question please don't forget to provide a reproducible example (a.k.a MVCE), when applicable. You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest stable version:

Recommended reference sources:

17 questions
0
votes
1 answer

Error on Spark 2.4.4 metrics properties in BinaryClassificationMetrics

I try to replicate this Spark / Scala example but when I try to extract some metrics from a .csv file processed I've got an error. My code snippet: val splitSeed = 5043 val Array(trainingData, testData) = df3.randomSplit(Array(0.7, 0.3),…
Joe Taras
  • 15,166
  • 7
  • 42
  • 55
-1
votes
1 answer

In pyspark 2.4, how to handle columns with the same name resulting of a self join?

Using pyspark 2.4, I am doing a left join of a dataframe on itself. df = df.alias("t1") \ .join(df.alias("t2"), col(t1_anc_ref) == col(t2_anc_ref), "left") The resulting structure of this join is the following: root |-- anc_ref_1:…
Itération 122442
  • 2,644
  • 2
  • 27
  • 73
1
2