Remove spaces from all column names in pyspark

Question

I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with underscore'_'.

I have tried this

foreach(DataColumn c in cloned.Columns)
    c.ColumnName = String.Join("_", c.ColumnName.Split());

but it didn't work in Pyspark on databricks.

score 20 · Answer 1 · answered Aug 02 '19 at 00:52

20

I would use select in conjunction with a list comprehension:

from pyspark.sql import functions as F

renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])

answered Aug 02 '19 at 00:52

gmds

19,325
4
32
58

Excellent response (+1). If you want to replace only the leading an trailing spaces you can do: `renamed_df = df.select([F.col(col).alias(col.strip()) for col in df.columns])` – nam Jun 11 '22 at 01:46

score 2 · Answer 2 · answered Aug 02 '19 at 00:44

Two ways to remove the spaces from the column names: 1. Use a schema while importing the data to spark data frame: for example:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
Schema1 = StructType([StructField('field1', IntegerType(), True),
                     StructField('field2', StringType(), True),
                     StructField('field3', IntegerType(), True)])
df = spark.read.csv('/path/to/your/file.csv', header=True, schema=Schema1)

If you have already got the data imported into a dataframe, use dataframe.withColumnRenamed function to change the name of the column:

df=df.withColumnRenamed("field name","fieldName")

Interesting.. when I did `df =df.withColumnRenamed("field name", "fieldname")` , it did not work for me on databricks. This was the reason I was looking to replace "field name" with "field_name". — Balki, Oct 03 '20 at 11:11

score 2 · Answer 3 · answered Jan 22 '22 at 08:07

2

NewColumns=(column.replace(' ', 'any special character') for column in df.columns)
df = df.toDF(*NewColumns)

answered Jan 22 '22 at 08:07

Techno_Eagle

111
4

score 0 · Answer 4 · answered Sep 28 '22 at 10:27

0

This also works, Ive been using it since a very long time. you just have to import re.

import re 
schema1 = [re.sub("[^a-zA-Z0-9,]", "", i) for i in df1.columns] 
df2 = df1.toDF(*schema1)

answered Sep 28 '22 at 10:27

jun41D

3
2

While this answer is technically correct, it is actually an overreach. `[i.replace("_", "") for i in df.columns]` will do the work too, without the need of importing `re` – CheTesta Oct 05 '22 at 15:13

score 0 · Answer 5 · answered Mar 15 '23 at 07:52

0

you can use strip function which replace leading and trail spaces in columns. you may use

df = df.select([F.col(c).alias(c.strip()) for c in df.columns])

instead of strip, you may also use lstrip or rstrip functions as well in python.

answered Mar 15 '23 at 07:52

subro

1,167
4
20
32

Remove spaces from all column names in pyspark

5 Answers5

Linked