I explored the multiple SparkSessions (to connect to different data sources/data clusters) a bit. And I found a wired behavior.
Firstly I created a SparkSession to RW the iceberg table, and everything works.
Then if I use the new SparkSession
(with some incorrect parameters like spark.sql.catalog.mycatalog.uri
) to access the table created by the previous SparkSession
through (1) spark.read().*.load("*")
first, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter).
The full test is given as below:
// The test to use the new SparkSession access the dataset created by previous SparkSession, using spark.read().*.load(*) first, then sql. And the whole test still works.
@Test
public void multipleSparkSessions() throws AnalysisException {
// Create the 1st SparkSession
String endpoint = String.format("http://localhost:%s/metastore", port);
ctx = SparkSession
.builder()
.master("local")
.config("spark.ui.enabled", false)
.config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.mycatalog.type", "hive")
.config("spark.sql.catalog.mycatalog.uri", endpoint)
.config("spark.sql.catalog.mycatalog.cache-enabled", "false")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.getOrCreate();
// Create a table with the SparkSession
String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
+ "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));
// Create a new SparkSession
SparkSession newSession = ctx.newSession();
newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");
// Access the created dataset above with the new SparkSession through session.read()...load(), which succeeds
List<Row> dataset2 = newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList();
dataset2.forEach(r -> System.out.println(r));
// Access the dataset through SQL, which succeeds as well.
newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList();
}
But if I use the new SparkSession
to access the table through (1) newSession.sql
first, the execution fails, and then (2) the read().**.load("**")
will fail as well with error java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679
.
The updated test is given below, you will notice the assertThrows
which verifies the Exception is thrown.
IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession
shouldn't be able to locate that table.
@Test
public void multipleSparkSessions() throws AnalysisException {
..same as above...
// Access the dataset through SQL first, the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
String.format("select * from mycatalog.%s", tableName)).collectAsList());
// Access the created dataset above with the new SparkSession through session.read()...load(), the exception is thrown
assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
.format("iceberg")
.load(String.format("mycatalog.%s", tableName)).collectAsList());
}
Any idea what could lead to these two different behaviors with spark.read().load()
versus spark.sql()
in different sequences?