-1

I have a dataframe in which I'm trying to add a column which will basically be taking the logarithm of an existing column in the same dataframe. I am trying this :

    df = df.withColumn("logvalue", log(df["prediction_column"]) )

I have already checked the schema of the dataframe and the prediction column is float type. But I keep getting the error that TypeError: a float is required

What am I missing here ? Any suggestions will be of great help

arnab_0017
  • 41
  • 1
  • 1
  • 4
  • try `log("prediction_column")` just the column name – gaw Oct 22 '18 at 13:08
  • 4
    You're using the wrong `log` function- my guess is you're trying `numpy.log` or `math.log`. Try adding `from pyspark.sql.functions import log` (this will be natural log). – pault Oct 22 '18 at 13:55
  • any idea why `np.log` does not work? I was not able to figure out this... – Mr. Hobo Dec 03 '20 at 09:38

2 Answers2

3

You can try the following, it worked for me

from pyspark.sql.functions import col
df = df.withColumn("logvalue", log10(col("prediction_column"))
Natty
  • 527
  • 5
  • 10
-1

Just try to use the column name without the dataframe or you can use the function col but here you have to import from pyspark.sql.functions import col and then log(col("double_col")):

df = spark.createDataFrame([
(1.3 ,"s"),
(10.3 ,"t"),
(3.3 ,"x"),
(1.5 ,"u"),
(1.3 ,"v")
], ("double_col", "char"))

print df.schema
print df.withColumn("bla", log("double_col")).show()

Output:

StructType(List(StructField(double_col,DoubleType,true),StructField(char,StringType,true)))
+----------+----+-------------------+
|double_col|char|                bla|
+----------+----+-------------------+
|       1.3|   s|0.26236426446749106|
|      10.3|   t|   2.33214389523559|
|       3.3|   x| 1.1939224684724346|
|       1.5|   u| 0.4054651081081644|
|       1.3|   v|0.26236426446749106|
+----------+----+-------------------+
gaw
  • 1,960
  • 2
  • 14
  • 18