0

I am reading csv file with 350 columns all columns are type string. After reading into the Dataframe I want to substring all the columns values read form the csv file to a max of 1 to 100 characters while writing to a delta table. Can someone kindly guide me how to perform this task. Thanks

for field in df.schema.fields:
    vField = df.collect()[0][field.name]
    if  vField != None:
        field.name = vField[0:20]

It did not work.

Wasim Syed
  • 11
  • 1

1 Answers1

1

Use substring(column,start,end) function from spark and iterate through all columns get only the 100 characters from each column.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([('a','1','a')],['i','j','k'])
df.select([substring(col(f),0,100).alias(f) for f in df.columns]).show(10,False)
#+---+---+---+
#|i  |j  |k  |
#+---+---+---+
#|a  |1  |a  |
#+---+---+---+
notNull
  • 30,258
  • 4
  • 35
  • 50