13

I'm new to Spark world and I would like to calculate an extra column with integers modulo in Pyspark. I have not find this operator in build in operators.

Does anyone have any idea?

ZygD
  • 22,092
  • 39
  • 79
  • 102

1 Answers1

23

You can simply use the % operator between columns, as you would in normal python:

from pyspark.sql.functions import col

df = spark.createDataFrame([(6,3), (7, 3), (13,6), (5, 0)], ["x", "y"])
df.withColumn("mod", col("x") % col("y")).show()

#+---+---+----+
#|  x|  y| mod|
#+---+---+----+
#|  6|  3|   0|
#|  7|  3|   1|
#| 13|  6|   1|
#|  5|  0|null|
#+---+---+----+

Alternatively, you can use the spark built-in function mod or % operator with SQL syntax:

from pyspark.sql.functions import expr

# using mod function
df.withColumn("mod", expr("mod(x, y)")).show()

# using SQL %
df.withColumn("mod", expr("x % y")).show()
pault
  • 41,343
  • 15
  • 107
  • 149
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • 1
    Warning for other users that modulo in pyspark can return negative results; the same behavior as with SQL, but different from the mathematical definition and the python behavior. See: https://stackoverflow.com/q/10472783/5675094 – mimocha Feb 07 '23 at 09:20