1

In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks.

My pyspark sql:

 %sql
 set hive.support.quoted.identifiers=none;
 select a.*, '?!(b.year|b.month)$).+'
 from MY_TABLE_A as a
 left join 
      MY_TABLE_B as b
 on a.year = b.year and a.month = b.month 

I followed hive:select all column exclude two Hive How to select all but one column?

but, it does not work for me. All columns are in the results. I would like to remove the duplicated columns (year and month in the result).

thanks

notNull
  • 30,258
  • 4
  • 35
  • 50
user3448011
  • 1,469
  • 1
  • 17
  • 39
  • Will you be exceeding maximum query length by writing down all columns? You can query system tables to get your columns but It might not be a good idea. It might be lengthy to write them all down but execution will be faster than using some system tables to figure out your columns. – Martin Kumecký Jul 28 '20 at 05:14
  • There are too many columns to type them in the query. – user3448011 Jul 28 '20 at 05:50
  • 1
    @user3448011, If the answer helped you to resolve issue.. Could you **`upvote and accept`** the answer to close this thread! :-) https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work – notNull Aug 17 '20 at 02:43

3 Answers3

4

set hive.support.quoted.identifiers=nonenot supported in Spark.

Instead in Spark set spark.sql.parser.quotedRegexColumnNames=true to get same behavior as hive.

Example:

df=spark.createDataFrame([(1,2,3,4)],['id','a','b','c'])
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

#select all columns except a,b
sql("select `(a|b)?+.+` from tmp").show()
#+---+---+
#| id|  c|
#+---+---+
#|  1|  4|
#+---+---+
notNull
  • 30,258
  • 4
  • 35
  • 50
4

As of Databricks runtime 9.0, you can use the * except() command like this:

df = spark.sql("select a.* except(col1, col2, col3) from my_table_a...")

or if just using %sql as in your example

select a.* except(col1, col2, col3) from my_table_a...
David Maddox
  • 1,884
  • 3
  • 21
  • 32
3

In pyspark, you can do something like this:

df.select([col for col in df.columns if c not in {'col1', 'col2', 'col3'}])

where df is the resulting dataframe after the join operation is perfomed.

Frosty
  • 560
  • 2
  • 12