I have to read data from an Oracle Database using JDBC with Spark (2.2). To minimize the transfered data, I use a pushdown query, which already filters the data to be loaded. That data then is appended to an existing Hive table. To log what has been loaded, I count the records loaded via JDBC. My code basically looks like this:
val query = "(select a, b from table where c = 1) t"
val myDF = spark.read.jdbc(jdbc_url, query, partitionColumn, lowerBound, upperBound, 10, connectionProperties).cache()
myDF.write.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("my_table_name")
val numRecs = myDF.count()
My assumption was, due to the cache(), the DataFrame is read once via JDBC, saved and used to count the records. But when I look at the sessions created in Oracle, I looks like the count operation itself creates 10 sessions on the Oracle database. At first I see 10 sessions, which basicaly look like that:
SELECT * FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
And, after that is done, another 10 sessions appear like that:
SELECT 1 FROM (select a, b from table where c = 1) t WHERE a >= x AND < a y
So it looks like Spark is loading data from the JDBC source only to count records, where it should already be sufficient to use the data already loaded. How can this be prevented and Spark forced to only read the data once from the JDBC source?
UPDATE
Turns out, I was blind: there was another count() within my code before saveAsTable was called. So it totally makes sense, the first action called on the DataFrame was indeed count(). After eliminating this, it al behaved like expected.