7

I'm trying to persist a temp view with the purpose of querying it again via sql:

val df = spark.sqlContext.read.option("header", true).csv("xxx.csv")
df.createOrReplaceTempView("xxx")

persist/cache:

df.cache()                          // or
spark.sqlContext.cacheTable("xxx")  // or
df.persist(MEMORY_AND_DISK)         // or
spark.sql("CACHE TABLE xxx")

Then I move the underlying xxx.csv, and:

spark.sql("select * from xxx")

Upon which, I find that only CACHE TABLE xxx stores a copy. What am I doing wrong, how can persist eg. DISK_ONLY a queryable view/table?

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
darnok
  • 91
  • 1
  • 4

1 Answers1

5

First cache it, as df.cache, then register as df.createOrReplaceTempView("dfTEMP"), so now every time you will query dfTEMP such as val df1 = spark.sql("select * from dfTEMP) you will read it from memory (1st action on df1 will actually cache it), do not worry about persistence for now as if df does not fit into memory, i will spill the rest to disk.

elcomendante
  • 1,113
  • 1
  • 11
  • 28
  • I'm doing this in notebooks, under Azure Synapse Analytics (for which there is still much less information, online) -- I'm defining dataframes and temp tables in `%%pyspark` cells, and then referencing them in `%%sql` cells -- and I have 2 questions re' your answer: (1) I'm seeing other posts that say to cache the *temp table*, which in your example might be `spark.table("dfTEMP").cache` -- do you know if there's any advantage or disadvantage to that approach? (2) With a table that you will repeatedly use, does it matter whether you use `cache` or `persist`? – Doug_Ivison Dec 26 '22 at 18:00
  • 1
    The same logic applies to cloud based notebooks, using `cache` is preferabkle but data will be memory only, and when notebok closes you will loose access to it, when you presist cloud provider will alocate data frame with some storage which then, you can access and cashed it again when rebooting notebook session – elcomendante Jan 05 '23 at 09:43
  • Thanks, Elcomendante. And re' the other question: do you know anything about caching temp table, versus caching dataframe? – Doug_Ivison Jan 05 '23 at 15:08
  • `temp table` is an actual `DF` that spark persisted already. (I think so, double check it pls ) – elcomendante Jan 05 '23 at 17:17
  • yes. I'm asking how an example like you gave -- caching a dataframe -- compares w an example like the one I gave -- caching a temp table.... Because I've seen that recommendation elsewhere. – Doug_Ivison Jan 06 '23 at 18:50
  • Not working. The view never uses cached data. Tried to create a view on a DF that has called cache(), and various permutations of persist(). It seems like the view is being very deliberate about ignoring my cache. The DF behaves as I would expect but someone needs to explain why the view ignores cache and goes all the way back to ADLS storage... – David Beavon May 11 '23 at 15:57