0

I explored the multiple SparkSessions (to connect to different data sources/data clusters) a bit. And I found a wired behavior.

Firstly I created a SparkSession to RW the iceberg table, and everything works.

Then if I use the new SparkSession (with some incorrect parameters like spark.sql.catalog.mycatalog.uri) to access the table created by the previous SparkSession through (1) spark.read().*.load("*") first, and then try (2) running some SQL on that table as well, everything still works(even with the incorrect parameter).

The full test is given as below:

// The test to use the new SparkSession access the dataset created by previous SparkSession, using spark.read().*.load(*) first, then sql. And the whole test still works.

@Test
public void multipleSparkSessions() throws AnalysisException {
    // Create the 1st SparkSession
    String endpoint = String.format("http://localhost:%s/metastore", port);

    ctx = SparkSession
        .builder()
        .master("local")
        .config("spark.ui.enabled", false)
        .config("spark.sql.catalog.mycatalog", "org.apache.iceberg.spark.SparkCatalog")
        .config("spark.sql.catalog.mycatalog.type", "hive")
        .config("spark.sql.catalog.mycatalog.uri", endpoint)
        .config("spark.sql.catalog.mycatalog.cache-enabled", "false")
        .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
        .getOrCreate();

    // Create a table with the SparkSession
    String tableName = String.format("%s.%s", "test", Integer.toHexString(RANDOM.nextInt()));
    ctx.sql(String.format("CREATE TABLE mycatalog.%s USING iceberg "
        + "AS SELECT * FROM VALUES ('michael', 31), ('david', 45) AS (name, age)", tableName));


    // Create a new SparkSession
    SparkSession newSession = ctx.newSession();
    newSession.conf().set("spark.sql.catalog.mycatalog.uri", "http://non_exist_address");

    // Access the created dataset above with the new SparkSession through session.read()...load(), which succeeds
    List<Row> dataset2 = newSession.read()
        .format("iceberg")
        .load(String.format("mycatalog.%s", tableName)).collectAsList();
    dataset2.forEach(r -> System.out.println(r));

    // Access the dataset through SQL, which succeeds as well.
    newSession.sql(
        String.format("select * from mycatalog.%s", tableName)).collectAsList();
  }

But if I use the new SparkSession to access the table through (1) newSession.sql first, the execution fails, and then (2) the read().**.load("**") will fail as well with error java.lang.RuntimeException: Failed to get table info from metastore test.3d79f679.

The updated test is given below, you will notice the assertThrows which verifies the Exception is thrown.

IMO this makes more sense, given I provided the incorrect catalog uri, so the SparkSession shouldn't be able to locate that table.

@Test
public void multipleSparkSessions() throws AnalysisException {
    ..same as above...


    // Access the dataset through SQL first, the exception is thrown
    assertThrows(java.lang.RuntimeException.class,() -> newSession.sql(
        String.format("select * from mycatalog.%s", tableName)).collectAsList());

    // Access the created dataset above with the new SparkSession through session.read()...load(), the exception is thrown
    assertThrows(java.lang.RuntimeException.class,() -> newSession.read()
        .format("iceberg")
        .load(String.format("mycatalog.%s", tableName)).collectAsList());
  }


Any idea what could lead to these two different behaviors with spark.read().load() versus spark.sql() in different sequences?

Oli
  • 9,766
  • 5
  • 25
  • 46
batilei
  • 401
  • 3
  • 6
  • 16

0 Answers0