16

I have a SparkSQL connection to an external database:

from pyspark.sql import SparkSession

spark = SparkSession \
  .builder \
  .appName("Python Spark SQL basic example") \
  .getOrCreate()

If I know the name of a table, it's easy to query.

users_df = spark \
  .read.format("jdbc") \
  .options(dbtable="users", **db_config) \
  .load()

But is there a good way to list/discover tables?

I want the equivalent of SHOW TABLES in mysql, or \dt in postgres.

I'm using pyspark v2.1, in case that makes any difference.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Abe
  • 22,738
  • 26
  • 82
  • 111
  • 2
    Would this be ok for your use case? http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.tables – Adonis Mar 19 '17 at 13:35
  • Yes, that's exactly what I was looking for! – Abe Mar 19 '17 at 21:02
  • If your remote DB has a way to query its metadata with SQL, such as `INFORMATION_SCHEMA.TABLE` (Postgres) or `INFORMATION_SCHEMA.TABLES` (MySQL, SQL Server) or `SYS.ALL_TABLES` (Oracle), then you can just use it from Spark to retrieve the list of local objects that you can access. You can also query for columns, primary keys, etc. – Samson Scharfrichter Mar 19 '17 at 21:06
  • My mistake: I scanned the docs quickly---didn't realize that `tables` reports on the metastore, not a remote db. I'll upvote your answer, but it doesn't solve the issue. – Abe Mar 20 '17 at 14:50
  • @SamsonScharfrichter Yes, this works. (That's what I'm doing currently.) I was hoping to find a unified API. Sounds like Spark hasn't implemented one yet. – Abe Mar 20 '17 at 14:51
  • You could build your own Python wrapper around `spark._jvm` to invoke the generic JDBC API in Java, cf. http://www.java2s.com/Code/Java/Database-SQL-JDBC/GetAvailableTableNameInADatabase.htm >> `javaCnx = spark._jvm.java.sql.DriverManager.getConnection(url, username, password)` etc. etc. – Samson Scharfrichter Mar 20 '17 at 17:31

1 Answers1

16

The answer to this question isn't actually spark specific. You'll just need to load the information_schema.tables.

The information schema consists of a set of views that contain information about the objects defined in the current database. The information schema is defined in the SQL standard and can therefore be expected to be portable and remain stable — unlike the system catalogs, which are specific to RDBMS and are modelled after implementation concerns.

I'll be using MySQL for my code snippet which contains a enwiki database on which I want to list tables :

# read the information schema table 
spark.read.format('jdbc'). \
     options(
         url='jdbc:mysql://localhost:3306/', # database url (local, remote)
         dbtable='information_schema.tables',
         user='root',
         password='root',
         driver='com.mysql.jdbc.Driver'). \
     load(). \
     filter("table_schema = 'enwiki'"). \ # filter on specific database.
     show()
# +-------------+------------+----------+----------+------+-------+----------+----------+--------------+-----------+---------------+------------+----------+--------------+--------------------+-----------+----------+---------------+--------+--------------+-------------+
# |TABLE_CATALOG|TABLE_SCHEMA|TABLE_NAME|TABLE_TYPE|ENGINE|VERSION|ROW_FORMAT|TABLE_ROWS|AVG_ROW_LENGTH|DATA_LENGTH|MAX_DATA_LENGTH|INDEX_LENGTH| DATA_FREE|AUTO_INCREMENT|         CREATE_TIME|UPDATE_TIME|CHECK_TIME|TABLE_COLLATION|CHECKSUM|CREATE_OPTIONS|TABLE_COMMENT|
# +-------------+------------+----------+----------+------+-------+----------+----------+--------------+-----------+---------------+------------+----------+--------------+--------------------+-----------+----------+---------------+--------+--------------+-------------+
# |          def|      enwiki|      page|BASE TABLE|InnoDB|     10|   Compact|   7155190|           115|  828375040|              0|   975601664|1965031424|      11359093|2017-01-23 08:42:...|       null|      null|         binary|    null|              |             |
# +-------------+------------+----------+----------+------+-------+----------+----------+--------------+-----------+---------------+------------+----------+--------------+--------------------+-----------+----------+---------------+--------+--------------+-------------+

Note: This solution can be applied to the scala and java with respectful languages constraints.

eliasah
  • 39,588
  • 11
  • 124
  • 154