Pyspark trimming all columns in a dataframe to 100 characters

Question

I am reading csv file with 350 columns all columns are type string. After reading into the Dataframe I want to substring all the columns values read form the csv file to a max of 1 to 100 characters while writing to a delta table. Can someone kindly guide me how to perform this task. Thanks

for field in df.schema.fields:
    vField = df.collect()[0][field.name]
    if  vField != None:
        field.name = vField[0:20]

It did not work.

score 1 · Answer 1 · answered Apr 27 '23 at 19:58

Use substring(column,start,end) function from spark and iterate through all columns get only the 100 characters from each column.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([('a','1','a')],['i','j','k'])
df.select([substring(col(f),0,100).alias(f) for f in df.columns]).show(10,False)
#+---+---+---+
#|i  |j  |k  |
#+---+---+---+
#|a  |1  |a  |
#+---+---+---+

Pyspark trimming all columns in a dataframe to 100 characters

1 Answers1