How to select all columns except 2 of them from a large table on pyspark sql?

Question

In joining two tables, I would like to select all columns except 2 of them from a large table with many columns on pyspark sql on databricks.

My pyspark sql:

 %sql
 set hive.support.quoted.identifiers=none;
 select a.*, '?!(b.year|b.month)$).+'
 from MY_TABLE_A as a
 left join 
      MY_TABLE_B as b
 on a.year = b.year and a.month = b.month

I followed hive:select all column exclude two Hive How to select all but one column?

but, it does not work for me. All columns are in the results. I would like to remove the duplicated columns (year and month in the result).

thanks

Will you be exceeding maximum query length by writing down all columns? You can query system tables to get your columns but It might not be a good idea. It might be lengthy to write them all down but execution will be faster than using some system tables to figure out your columns. — Martin Kumecký, Jul 28 '20 at 05:14
@user3448011, If the answer helped you to resolve issue.. Could you **`upvote and accept`** the answer to close this thread! :-) https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work — notNull, Aug 17 '20 at 02:43

notNull · Answer 1 · 2020-07-29T00:57:41.623

set hive.support.quoted.identifiers=nonenot supported in Spark.

Instead in Spark set spark.sql.parser.quotedRegexColumnNames=true to get same behavior as hive.

Example:

df=spark.createDataFrame([(1,2,3,4)],['id','a','b','c'])
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")

#select all columns except a,b
sql("select `(a|b)?+.+` from tmp").show()
#+---+---+
#| id|  c|
#+---+---+
#|  1|  4|
#+---+---+

David Maddox · Answer 2 · 2022-08-26T14:54:07.580

4

As of Databricks runtime 9.0, you can use the * except() command like this:

df = spark.sql("select a.* except(col1, col2, col3) from my_table_a...")

or if just using %sql as in your example

select a.* except(col1, col2, col3) from my_table_a...

edited Aug 26 '22 at 14:54

answered Apr 22 '22 at 03:13

David Maddox

1,884
3
21
32

score 3 · Answer 3 · answered Jul 28 '20 at 10:55

3

In pyspark, you can do something like this:

df.select([col for col in df.columns if c not in {'col1', 'col2', 'col3'}])

where df is the resulting dataframe after the join operation is perfomed.

answered Jul 28 '20 at 10:55

Frosty

560
2
12

How to select all columns except 2 of them from a large table on pyspark sql?

3 Answers3