1

There are many ways to verify the schema of two data frames in spark like here. But I want to verify the schema of two data frames only in SQL, I mean SparkSQL.

Sample query 1:

SELECT DISTINCT target_person FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME IN ('columnA','ColumnB') AND TABLE_SCHEMA='ad_facebook'

Sample query 2:

SELECT count(*) FROM information_schema.columns WHERE table_name = 'ad_facebook'

I know that there is no concept of a database (schema) in spark, but I read about metastore that it contains schema information etc.

Can we write SQL queries like above in SparkSQL?

EDIT:

I am just checking why show create table is not working on spark sql, is it because it's a temp table?

scala> val df1=spark.sql("SHOW SCHEMAS")
df1: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df1.show
+------------+
|databaseName|
+------------+
|     default|
+------------+


scala> val df2=spark.sql("SHOW TABLES in default")
df2: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]

scala> df2.show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|        |       df|       true|
+--------+---------+-----------+


scala> val df3=spark.sql("SHOW CREATE TABLE default.df")
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'df' not found in database 'default';
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:180)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:398)
  at org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:834)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:182)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
  ... 48 elided

3 Answers3

1

Schema can be queried using DESCRIBE [EXTENDED] [db_name.]table_name

See https://docs.databricks.com/spark/latest/spark-sql/index.html#spark-sql-language-manual

DW.
  • 464
  • 3
  • 5
  • Hi @DW can you please check my question, I just updated it. When I query for `show create table` for some reason spark sql is not working. Can you please help me understand why is that happening? –  Sep 05 '18 at 07:51
  • @JumpMan Pls check this "default.df" - default is database name and df is table name. It says there is no such table by the name "df" – Dasarathy D R Sep 05 '18 at 08:24
  • If you check the above query ‘show tables’ it is returning the df table right? –  Sep 05 '18 at 16:40
1

Try this code of extracting each schema and compare. This compares name of column, datatype of column, nullable or not column.

val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.

val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there

if(out.size == 0) { // size of `out` should be 0 if matching
    println("matching")
}
else println("not matching")
Praveen L
  • 937
  • 6
  • 13
  • Hi Praveen, I am only interested in sql in spark not the scala code, sortBy and zip are scala functions not spark sql functions. –  Sep 05 '18 at 07:52
  • As we are comparing only column names and datatypes (x , y which are Struct types), and their count will be in 10's or 100's, it can easily be done in using some language. I did not understand only why using only spark code for comparing just 2 Struct type values. – Praveen L Sep 05 '18 at 08:54
0

We can get the schema in 2 ways in SparkSQL.

Method 1:

spark.sql("desc db_name table_name").show()

This will display only top 20 rows which is exactly similar to the dataframe concept of df.show()

(meaning, any table with more than 20 columns - schema will be shown only for first 20 columns)

For Ex:

+--------------------+---------+-------+
|            col_name|data_type|comment|
+--------------------+---------+-------+
|                col1|   bigint|   null|
|                col2|   string|   null|
|                col3|   string|   null|
+--------------------+---------+-------+

Method 2:

spark.sql("desc db_name table_name").collect().foreach(println)

This will display the complete schema of all the columns.

For Ex:

[col1,bigint,null]
[col2,string,null]
[col3,string,null]
Dasarathy D R
  • 335
  • 2
  • 7
  • 20