select from Pyspark dataframe using variable

Question

I am trying to run substring function on a column(CompleteLine) using a variable(StringStartPoint) for the start position.

I tried few option as given below , but both are failing with different reason. How could I use variable inside select function easily.

StringStartPoint=10

df2 = df1.select(f.substring(f.col("CompleteLine"),StringStartPoint,f.col("StringLength"))).alias('MySubString')

TypeError: Column is not iterable . This is not recognizing the 3rd parameter as a value.

df2 = df1.select(f.expr("substring(col(CompleteLine),StringStartPoint,col(StringLength))").alias('MySubString')

AnalysisException: Cannot resolve StringStartPoint given input column . This is recognizing the 2nd parameter as a dataframe field.

Does this answer your question? [use length function in substring in spark](https://stackoverflow.com/questions/46353360/use-length-function-in-substring-in-spark) — notNull, Sep 03 '21 at 20:07

score 0 · Answer 1 · answered Sep 03 '21 at 20:03

0

3rd parameter of substring() should be of type integer not column.

Pass length of the column as argument for 3rd parameter.

StringStartPoint=10

df2 = df1.select(f.substring(f.col("CompleteLine"),StringStartPoint,f.length(f.col("CompleteLine"))).alias('MySubString'))

answered Sep 03 '21 at 20:03

Mohana B C

5,021
1
9
28

StringLength is an integer type field in df1. It is already evaluated through the length function. – iamaj Sep 03 '21 at 20:29
Type of the value which you are passing will be `Column` though the column type is `int`'. Do you need substring from starting point till the end of the main string. If that's the case no need to pass the length also if you use this - `df.select(expr('substring(CompleteLine, 10)'))` – Mohana B C Sep 03 '21 at 20:48

select from Pyspark dataframe using variable

1 Answers1