80

In pandas, this can be done by column.name.

But how to do the same when it's a column of Spark dataframe?

E.g. the calling program has a Spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.

If it was pandas dataframe, we could use this:

>>> pandas_df['rank'].name
'rank'
aynber
  • 22,380
  • 8
  • 50
  • 63
Kaushik Acharya
  • 1,520
  • 2
  • 16
  • 25

8 Answers8

118

You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()
David
  • 11,245
  • 3
  • 41
  • 46
  • 6
    What I wanted to know is the name of the column which is the input parameter to my function. The calling program will call my_function by my_function(spark_df['rank']) Inside my_function how would I know the name of the column that is passed? – Kaushik Acharya Sep 29 '16 at 04:05
  • You can use `pyspark.sql.functions.col` to access a column by name. E.g., `df.filter( col(var_name) > 1)` – shuaiyuancn Sep 29 '16 at 10:17
  • 1
    @ShuaiYuan, That's not what I want. Inside my_function(col) how would I know the col name? Calling function calls by my_function(spark_df['rank']) Inside my_function, I want to extract 'rank' as the column name from the input parameter: col – Kaushik Acharya Sep 30 '16 at 04:40
  • That seems like an odd request. But you could change your function to take a string for the name of the column – David Sep 30 '16 at 13:24
  • 1
    You can change your functions to `myfunc(df, name)` then you have access to `name` in your function. When you need to use that column in the dataframe, do `df[name]` – shuaiyuancn Sep 30 '16 at 13:58
  • 1
    Or, you could make use of [`Dataframe.Column.__repr__`](https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L430) – shuaiyuancn Sep 30 '16 at 14:00
  • In my case in did a check for the name of the column `if 'col_name' in df.schema.names: # do something`. Thanks for the answer. – Pablo Adames Sep 12 '20 at 04:43
25

The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')
numeral
  • 504
  • 4
  • 7
  • 3
    This won't pull out the alias if there is one, unfortunately. – santon Feb 05 '18 at 23:21
  • 1
    True. but you can easily parse that out if there's an alias. `re.search('AS `(\S*)`', col.alias('some_alias')._jc.toString()).group(1)` -> `'some_alias'`. Of course this isn't perfect, since we're doing some regex parsing, but I would hope it's unlikely you have some column name called "AS `bad`" in it. – numeral Feb 08 '18 at 20:30
  • 1
    @numeral does the underlying JVM code expose any kind of parser logic that can be used instead of hand-rolling it? – shadowtalker Jan 07 '19 at 19:49
  • 1
    @shadowtalker It doesn't seem like it after checking https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/Column.html – numeral Jan 08 '19 at 20:15
  • The alias can also be extracted without using any regex: str(column).split(' AS ')[1].split('`')[1] – ciurlaro Nov 29 '20 at 16:42
10

Python

As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

I don't know Scala syntax, but I'm sure It can be done the same.

ciurlaro
  • 742
  • 10
  • 22
6

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'

JoeyC
  • 764
  • 11
  • 19
Pat
  • 697
  • 9
  • 12
4

I found the answer is very very simple...

// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable theNameOftheCol is "colName"

ZygD
  • 22,092
  • 39
  • 79
  • 102
Yucci Mel
  • 169
  • 1
  • 4
2

This should cover even the strangest cases:

  • column with/ without an alias
  • several aliases
  • aliases containing multiple words
  • column names surrounded with backticks
  • intentional backticks in aliases
def get_col_name(col):
    if str(col)[-3] != '`':
        return str(col).split("'")[-2].split(" AS ")[-1]
    return str(col).replace('``', '`').split(" AS `")[-1].split("`'")[-2]
ZygD
  • 22,092
  • 39
  • 79
  • 102
0

#table name as an example if you have multiple

loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder 

table_name = ['customer','department']

for i in table_name:
  print(i) # printing the existing table name

  df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
  for col in df.dtypes:
    print(col[0]) # column_name as per availability

    print(col[1]) # datatype information of the respective column
mc-user
  • 1,769
  • 4
  • 14
  • 25
0

Since none of the answers have been marked as the Answer - I may be over-simplifying the OPs ask but:

my_list = spark_df.schema.fields
for field in my_list:
    print(field.name)
Le Poissons
  • 39
  • 1
  • 4