7

The following code reads the same csv twice even though only one action is called

End to end runnable example:

import pandas as pd
import numpy as np

df1=  pd.DataFrame(np.arange(1_000).reshape(-1,1))
df1.index = np.random.choice(range(10),size=1000)
df1.to_csv("./df1.csv",index_label = "index")
############################################################################

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StringType, StructField

spark = SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold","-1").\
config("spark.sql.adaptive.enabled","false").getOrCreate()

schema = StructType([StructField('index', StringType(), True),
                     StructField('0', StringType(), True)])

df1 = spark.read.csv("./df1.csv", header=True, schema = schema)

df2 = df1.groupby("index").agg(F.mean("0"))
df3 = df1.join(df2,on='index')

df3.explain()
df3.count()

The sql tab in the web UI shows the following:

enter image description here

As you can see, the df1 file is read twice. Is this the expected behavior? Why is that happening? I have just one action so the same part of the pipeline should not run multiple times.

I have read the answer here. The question is almost the same, but in that question RDDs are used and I am using dataframes in pyspark API. In that question it is suggested that if multiple file scans are to be avoided then DataFrames API would help and this is the reason why DataFrama API exists in the first place

However, as it turns out, I am facing the exactly same issue with the DataFrames as well. It seems rather weird of spark, which is celebrated for its efficiency, to be this inefficient (Mostly I am just missing something and that is not a valid criticism :))

Koedlt
  • 4,286
  • 8
  • 15
  • 33
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Is it the same if you combine them into a single line? `df3=df1.join(df1.groupby("index").agg(F.mean("0")),on='index')`? – BeRT2me May 06 '23 at 21:13

2 Answers2

2

Yes. This is typical for JOIN and UNION operations for both DF's, RDD's when reading from same underlying source, unless a .cache() is used. The Spark UI then shows reads, but with green dot for caching having been applied. Nothing to see here as they say.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Thank you for your time and help :). Is there any principled way to think about this? So far the dictum I followed was "Nothing in the DAG is rerun for 1 action. Use ```.persist``` if You want to use DAG intermediate results in more than one action." But, as you say, that is not necessarily true – figs_and_nuts May 08 '23 at 06:02
  • As your case shows, there are two situations in fact. The first that we all see, and the 2nd as you present. Some will think it odd, but there is the .cache() and we have the lineage for Worker failure to consider for an independent query regardless of whether for single or multiple Actions. – thebluephantom May 08 '23 at 08:18
  • I'm sorry but, i didn't get what you are trying to say starting from "some will think it odd....." – figs_and_nuts May 08 '23 at 18:42
2

The way I see this is in terms of stages. Each stage is a collection of identical tasks that run in parallel on different partitions of your data.

Your query has 4 stages:

enter image description here

So as you have noticed, Stage 0 and Stage 1 both read your CSV file. But that makes sense, Stage 0 and Stage 1 are completely independent from one another. They might be reading in the same data but they are doing different things.

In general, all the tasks in Stage 0 will get executed first before any task in Stage 1 starts to get computed. So if you would want to avoid double reading in of the data, you would need to do either of these:

  • Have a stage that computes 2 outputs (in this case for both inputs of Stage 2). This would significantly change the Spark architecture since now there only exist stages with 1 output always.

  • Spark could (under the hood) indeed (like thebluephantom says) decide to .cache this dataset for you, but that means it is actually filling up your storage memory without you even having asked this (and risk further computations to become less performant). That would make it hard to know how filled up your storage memory is, if underlying processes would actually start caching your data.

Koedlt
  • 4,286
  • 8
  • 15
  • 33