7

I am new to pySpark. I have received a csv file which has around 1000 columns. I am using databricks. Most of these columns have spaces in between eg "Total Revenue" ,"Total Age" etc. I need to updates all the column names with space with underscore'_'.

I have tried this

foreach(DataColumn c in cloned.Columns)
    c.ColumnName = String.Join("_", c.ColumnName.Split());

but it didn't work in Pyspark on databricks.

camille
  • 16,432
  • 18
  • 38
  • 60

5 Answers5

20

I would use select in conjunction with a list comprehension:

from pyspark.sql import functions as F

renamed_df = df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
gmds
  • 19,325
  • 4
  • 32
  • 58
  • Excellent response (+1). If you want to replace only the leading an trailing spaces you can do: `renamed_df = df.select([F.col(col).alias(col.strip()) for col in df.columns])` – nam Jun 11 '22 at 01:46
2

Two ways to remove the spaces from the column names: 1. Use a schema while importing the data to spark data frame: for example:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
Schema1 = StructType([StructField('field1', IntegerType(), True),
                     StructField('field2', StringType(), True),
                     StructField('field3', IntegerType(), True)])
df = spark.read.csv('/path/to/your/file.csv', header=True, schema=Schema1)
  1. If you have already got the data imported into a dataframe, use dataframe.withColumnRenamed function to change the name of the column:

    df=df.withColumnRenamed("field name","fieldName")

Kishan Vyas
  • 126
  • 2
  • Interesting.. when I did `df =df.withColumnRenamed("field name", "fieldname")` , it did not work for me on databricks. This was the reason I was looking to replace "field name" with "field_name". – Balki Oct 03 '20 at 11:11
2
NewColumns=(column.replace(' ', 'any special character') for column in df.columns)
df = df.toDF(*NewColumns)
Techno_Eagle
  • 111
  • 4
0

This also works, Ive been using it since a very long time. you just have to import re.

import re 
schema1 = [re.sub("[^a-zA-Z0-9,]", "", i) for i in df1.columns] 
df2 = df1.toDF(*schema1)
jun41D
  • 3
  • 2
  • While this answer is technically correct, it is actually an overreach. `[i.replace("_", "") for i in df.columns]` will do the work too, without the need of importing `re` – CheTesta Oct 05 '22 at 15:13
0

you can use strip function which replace leading and trail spaces in columns. you may use

df = df.select([F.col(c).alias(c.strip()) for c in df.columns])

instead of strip, you may also use lstrip or rstrip functions as well in python.

subro
  • 1,167
  • 4
  • 20
  • 32