How to prevent Apache Spark to read JDBC DataFrame multiple times?

Question

I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table. To log what has been loaded, I count the records loaded via JDBC. My code basically looks like this:

val query = "(select a, b from table where c = 1) t"
val myDF = spark.read.jdbc(jdbc_url, query, partitionColumn, lowerBound, upperBound, 10, connectionProperties).cache()

myDF.write.mode(SaveMode.Append)
    .format("parquet")
    .saveAsTable("my_table_name")

val numRecs = myDF.count()

My assumption was, due to the cache(), the DataFrame is read once via JDBC, saved and used to count the records. But when I look at the sessions created in Oracle, I looks like the count operation itself creates 10 sessions on the Oracle database. At first I see 10 sessions, which basicaly look like that:

SELECT * FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y

And, after that is done, another 10 sessions appear like that:

SELECT 1 FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y

So it looks like Spark is loading data from the JDBC source only to count records, where it should already be sufficient to use the data already loaded. How can this be prevented and Spark forced to only read the data once from the JDBC source?

UPDATE

Turns out, I was blind: there was another count() within my code before saveAsTable was called. So it totally makes sense, the first action called on the DataFrame was indeed count(). After eliminating this, it al behaved like expected.

How much time it’s taking to read the data from step 1 and 2. Spark should not read the data twice as the data already cached so it’s already stored in memory. So I am thinking the second query is only to read the data from the cached dataframe so the second step should be faster compared to first — Chandan Ray, Jul 02 '18 at 20:53
does seem odd. can only think that on a busy system resources were freed up i.e. released. is this behaviour consistent? s3 delay issues? or spark thjnks quivker to go to oracle for this, keep us posted — thebluephantom, Jul 03 '18 at 04:54
Yes, the behaviour is consistent no matter how many records are affected. The amount of time it consumes depends on the number of records. In one example with about 1.000.000 records, saveAsTable() takes about 7,5 minutes where the count() takes 3,5 minutes. In another example with about 175.000.000 records count() takes 1,7 h vs 3,5 h for saveAsTable. — Hanebambel, Jul 03 '18 at 06:32

How to prevent Apache Spark to read JDBC DataFrame multiple times?

0 Answers0