Adding a column with logarithm in Dataframe Pyspark

Question

I have a dataframe in which I'm trying to add a column which will basically be taking the logarithm of an existing column in the same dataframe. I am trying this :

    df = df.withColumn("logvalue", log(df["prediction_column"]) )

I have already checked the schema of the dataframe and the prediction column is float type. But I keep getting the error that TypeError: a float is required

What am I missing here ? Any suggestions will be of great help

You're using the wrong `log` function- my guess is you're trying `numpy.log` or `math.log`. Try adding `from pyspark.sql.functions import log` (this will be natural log). — pault, Oct 22 '18 at 13:55
any idea why `np.log` does not work? I was not able to figure out this... — Mr. Hobo, Dec 03 '20 at 09:38

score 3 · Answer 1 · answered Dec 10 '19 at 13:28

3

You can try the following, it worked for me

from pyspark.sql.functions import col
df = df.withColumn("logvalue", log10(col("prediction_column"))

answered Dec 10 '19 at 13:28

Natty

527
5
10

gaw · Answer 2 · 2018-10-22T13:21:11.117

Just try to use the column name without the dataframe or you can use the function col but here you have to import from pyspark.sql.functions import col and then log(col("double_col")):

df = spark.createDataFrame([
(1.3 ,"s"),
(10.3 ,"t"),
(3.3 ,"x"),
(1.5 ,"u"),
(1.3 ,"v")
], ("double_col", "char"))

print df.schema
print df.withColumn("bla", log("double_col")).show()

Output:

StructType(List(StructField(double_col,DoubleType,true),StructField(char,StringType,true)))
+----------+----+-------------------+
|double_col|char|                bla|
+----------+----+-------------------+
|       1.3|   s|0.26236426446749106|
|      10.3|   t|   2.33214389523559|
|       3.3|   x| 1.1939224684724346|
|       1.5|   u| 0.4054651081081644|
|       1.3|   v|0.26236426446749106|
+----------+----+-------------------+

Adding a column with logarithm in Dataframe Pyspark

2 Answers2