How to read spark table back again in a new spark session?

Question

I can read the table just after it created, but how to read it again in another spark session?

Given code:

spark = SparkSession \
    .builder \
    .getOrCreate()

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
 .write
 .saveAsTable("people_partitioned_bucketed"))

# retrieve rows from table as expected
spark.sql("select * from people_partitioned_bucketed").show()

spark.stop()

# open spark session again
spark = SparkSession \
    .builder \
    .getOrCreate()

# table not exist this time
spark.sql("select * from people_partitioned_bucketed").show()

```

Execution result:

+------+----------------+--------------+
|  name|favorite_numbers|favorite_color|
+------+----------------+--------------+
|Alyssa|  [3, 9, 15, 20]|          null|
|   Ben|              []|           red|
+------+----------------+--------------+

Traceback (most recent call last):
  File "/home//workspace/spark/examples/src/main/python/sql/datasource.py", line 246, in <module>
    spark.sql("select * from people_partitioned_bucketed").show()
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: people_partitioned_bucketed; line 1 pos 14'

score 2 · Answer 1 · answered Aug 20 '19 at 14:39

I had the same problem as you and the solution was nowhere to be found. Then I read this and I have figured out a way.

Change the initialization of both your SparkSession objects, to:

from os.path import abspath

warehouse_location = abspath('spark-warehouse')

spark = SparkSession.builder \
            .config("spark.sql.warehouse.dir", warehouse_location) \
            .enableHiveSupport() \
            .getOrCreate()

This initialization, explicitly instructs Spark where to look for the Hive tables and also enables Hive support. You can change the location of the Hive tables (i.e spark-warehouse) if you want to, by changing the argument inside the abspath() function.

P.S: I don't know why one needs to explicitly enable Hive support since by default, saveAsTable() saves the dataframe to Hive table nor do I know why does someone need to explicitly define spark-warehouse location since the default location is the current directory. Nevertheless, the above solution just works :) (is it a bug?)

score 1 · Answer 2 · answered Jan 24 '18 at 07:02

1

Looking at the documentation:

For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the path option, e.g. df.write.option("path", "/some/path").saveAsTable("t"). When the table is dropped, the custom table path will not be removed and the table data is still there. If no custom table path is specified, Spark will write data to a default table path under the warehouse directory. When the table is dropped, the default table path will be removed too.

In other words, it is necessary to specify a path when saving the table using path(). If a path is not specified the table will be removed when you close the Spark session.

answered Jan 24 '18 at 07:02

Shaido

27,497
23
70
73

2

It's not the case. _When the table is dropped, the default table path will be removed too._ I did not drop the table and I confirm that the table data is still exist in `spark-warehose` directory after the script executed. – petertc Jan 24 '18 at 11:39
@okwap: It seems like when your close the session, spark forgets the table. Can you load it using `spark.read`? – Shaido Jan 24 '18 at 11:47
@okwap: Did you try if specifying a path helped or not? Or could you load the table with `spark.read` instead of using the sql command with the table name? – Shaido Jan 26 '18 at 07:48
@Shaido yes it is possible to read the table with spark.read.parquet("spark-warehouse-path-of-table"). but not through spark.table api or spark.sql("select * from table") method – Apoorve Dave Feb 27 '19 at 15:50

How to read spark table back again in a new spark session?

2 Answers2