How to get name of dataframe column in PySpark?

Question

In pandas, this can be done by column.name.

But how to do the same when it's a column of Spark dataframe?

E.g. the calling program has a Spark dataframe: spark_df

>>> spark_df.columns
['admit', 'gre', 'gpa', 'rank']

This program calls my function: my_function(spark_df['rank'])
In my_function, I need the name of the column, i.e. 'rank'.

If it was pandas dataframe, we could use this:

>>> pandas_df['rank'].name
'rank'

score 118 · Answer 1 · answered Sep 28 '16 at 13:47

118

You can get the names from the schema by doing

spark_df.schema.names

Printing the schema can be useful to visualize it as well

spark_df.printSchema()

answered Sep 28 '16 at 13:47

David

11,245
3
41
46

6

What I wanted to know is the name of the column which is the input parameter to my function. The calling program will call my_function by my_function(spark_df['rank']) Inside my_function how would I know the name of the column that is passed? – Kaushik Acharya Sep 29 '16 at 04:05
You can use `pyspark.sql.functions.col` to access a column by name. E.g., `df.filter( col(var_name) > 1)` – shuaiyuancn Sep 29 '16 at 10:17
1

@ShuaiYuan, That's not what I want. Inside my_function(col) how would I know the col name? Calling function calls by my_function(spark_df['rank']) Inside my_function, I want to extract 'rank' as the column name from the input parameter: col – Kaushik Acharya Sep 30 '16 at 04:40
That seems like an odd request. But you could change your function to take a string for the name of the column – David Sep 30 '16 at 13:24
1

You can change your functions to `myfunc(df, name)` then you have access to `name` in your function. When you need to use that column in the dataframe, do `df[name]` – shuaiyuancn Sep 30 '16 at 13:58
1

Or, you could make use of [`Dataframe.Column.__repr__`](https://github.com/apache/spark/blob/master/python/pyspark/sql/column.py#L430) – shuaiyuancn Sep 30 '16 at 14:00
In my case in did a check for the name of the column `if 'col_name' in df.schema.names: # do something`. Thanks for the answer. – Pablo Adames Sep 12 '20 at 04:43

score 25 · Answer 2 · answered Mar 31 '17 at 21:56

25

The only way is to go an underlying level to the JVM.

df.col._jc.toString().encode('utf8')

This is also how it is converted to a str in the pyspark code itself.

From pyspark/sql/column.py:

def __repr__(self):
    return 'Column<%s>' % self._jc.toString().encode('utf8')

answered Mar 31 '17 at 21:56

numeral

504
4
7

3

This won't pull out the alias if there is one, unfortunately. – santon Feb 05 '18 at 23:21
1

True. but you can easily parse that out if there's an alias. `re.search('AS `(\S*)`', col.alias('some_alias')._jc.toString()).group(1)` -> `'some_alias'`. Of course this isn't perfect, since we're doing some regex parsing, but I would hope it's unlikely you have some column name called "AS `bad`" in it. – numeral Feb 08 '18 at 20:30
1

@numeral does the underlying JVM code expose any kind of parser logic that can be used instead of hand-rolling it? – shadowtalker Jan 07 '19 at 19:49
1

@shadowtalker It doesn't seem like it after checking https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/Column.html – numeral Jan 08 '19 at 20:15
The alias can also be extracted without using any regex: str(column).split(' AS ')[1].split('`')[1] – ciurlaro Nov 29 '20 at 16:42

score 10 · Answer 3 · answered Nov 29 '20 at 16:49

Python

As @numeral correctly said, column._jc.toString() works fine in case of unaliased columns.

In case of aliased columns (i.e. column.alias("whatever") ) the alias can be extracted, even without the usage of regular expressions: str(column).split(" AS ")[1].split("`")[1] .

I don't know Scala syntax, but I'm sure It can be done the same.

score 6 · Answer 4 · edited Aug 26 '19 at 06:40

If you want the column names of your dataframe, you can use the pyspark.sql class. I'm not sure if the SDK supports explicitly indexing a DF by column name. I received this traceback:

>>> df.columns['High'] Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: list indices must be integers, not str

However, calling the columns method on your dataframe, which you have done, will return a list of column names:

df.columns will return ['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']

If you want the column datatypes, you can call the dtypes method:

df.dtypes will return [('Date', 'timestamp'), ('Open', 'double'), ('High', 'double'), ('Low', 'double'), ('Close', 'double'), ('Volume', 'int'), ('Adj Close', 'double')]

If you want a particular column, you'll need to access it by index:

df.columns[2] will return 'High'

Going along the idea: `[x[0] for x in df.dtypes]` – David C. Jun 12 '20 at 01:56 — David C., Jun 12 '20 at 01:56

score 4 · Answer 5 · edited Mar 24 '22 at 06:55

4

I found the answer is very very simple...

// It is in Java, but it should be same in PySpark
Column col = ds.col("colName"); //the column object
String theNameOftheCol = col.toString();

The variable theNameOftheCol is "colName"

edited Mar 24 '22 at 06:55

ZygD

22,092
39
79
102

answered Mar 20 '19 at 18:02

Yucci Mel

169
1
4

6

in python it'd be col._jc.toString() – justin cress Aug 01 '19 at 18:19
When the column is aliased, this returns `col_name AS alias`, while the expected result may be just `alias`. – ZygD Aug 23 '23 at 12:25

ZygD · Answer 6 · 2023-08-23T12:21:48.627

2

This should cover even the strangest cases:

column with/ without an alias
several aliases
aliases containing multiple words
column names surrounded with backticks
intentional backticks in aliases

def get_col_name(col):
    if str(col)[-3] != '`':
        return str(col).split("'")[-2].split(" AS ")[-1]
    return str(col).replace('``', '`').split(" AS `")[-1].split("`'")[-2]

edited Aug 23 '23 at 12:21

answered Mar 24 '22 at 07:43

ZygD

22,092
39
79
102

score 0 · Answer 7 · edited Jul 09 '22 at 07:30

#table name as an example if you have multiple

loc = '/mnt/tablename' or 'whatever_location/table_name' #incase of external table or any folder 

table_name = ['customer','department']

for i in table_name:
  print(i) # printing the existing table name

  df = spark.read.format('parquet').load(f"{loc}{i.lower()}/") # creating dataframe from the table name
  for col in df.dtypes:
    print(col[0]) # column_name as per availability

    print(col[1]) # datatype information of the respective column

score 0 · Answer 8 · answered Jul 21 '22 at 12:41

0

Since none of the answers have been marked as the Answer - I may be over-simplifying the OPs ask but:

my_list = spark_df.schema.fields
for field in my_list:
    print(field.name)

answered Jul 21 '22 at 12:41

Le Poissons

39
1
4

How to get name of dataframe column in PySpark?

8 Answers8

Linked