7

I'm writing Spark code in Python. I have a col(execution_date) that is a timestamp. How would I turn that into a column that is called is_weekend, that has a value of 1 if the date is a weekend and 0 if it's a week day?

pault
  • 41,343
  • 15
  • 107
  • 149
Daniel Kaplan
  • 62,768
  • 50
  • 234
  • 356

1 Answers1

9

Suppose you had the following spark DataFrame:

df.show()
#+-------------------+
#|     execution_date|
#+-------------------+
#|2019-08-01 00:00:00|
#|2019-08-02 00:00:00|
#|2019-08-03 00:00:00|
#|2019-08-04 00:00:00|
#|2019-08-05 00:00:00|
#|2019-08-06 00:00:00|
#|2019-08-07 00:00:00|
#|2019-08-08 00:00:00|
#|2019-08-09 00:00:00|
#|2019-08-10 00:00:00|
#+-------------------+

Spark Version 2.3+

You can use pyspark.sql.functions.dayofweek, which will return a number between 1 and 7 (Sunday = 1, Saturday = 7).

from pyspark.sql.functions import dayofweek

df.withColumn("is_weekend", dayofweek("execution_date").isin([1,7]).cast("int")).show()
#+-------------------+----------+
#|     execution_date|is_weekend|
#+-------------------+----------+
#|2019-08-01 00:00:00|         0|
#|2019-08-02 00:00:00|         0|
#|2019-08-03 00:00:00|         1|
#|2019-08-04 00:00:00|         1|
#|2019-08-05 00:00:00|         0|
#|2019-08-06 00:00:00|         0|
#|2019-08-07 00:00:00|         0|
#|2019-08-08 00:00:00|         0|
#|2019-08-09 00:00:00|         0|
#|2019-08-10 00:00:00|         1|
#+-------------------+----------+

Spark Versions 1.5+

You can use pyspark.sql.functions.date_format with format = 'EEE'

from pyspark.sql.functions import date_format

df.withColumn(
    "is_weekend", 
    date_format("execution_date", 'EEE').isin(["Sat", "Sun"]).cast("int")
).show()
#+-------------------+----------+
#|     execution_date|is_weekend|
#+-------------------+----------+
#|2019-08-01 00:00:00|         0|
#|2019-08-02 00:00:00|         0|
#|2019-08-03 00:00:00|         1|
#|2019-08-04 00:00:00|         1|
#|2019-08-05 00:00:00|         0|
#|2019-08-06 00:00:00|         0|
#|2019-08-07 00:00:00|         0|
#|2019-08-08 00:00:00|         0|
#|2019-08-09 00:00:00|         0|
#|2019-08-10 00:00:00|         1|
#+-------------------+----------+

For completeness, the intermediate results of both are shown here:

df.withColumn("dow", dayofweek("execution_date"))\
    .withColumn("day", date_format("execution_date", 'EEE'))\
    .show()
#+-------------------+---+---+
#|     execution_date|dow|day|
#+-------------------+---+---+
#|2019-08-01 00:00:00|  5|Thu|
#|2019-08-02 00:00:00|  6|Fri|
#|2019-08-03 00:00:00|  7|Sat|
#|2019-08-04 00:00:00|  1|Sun|
#|2019-08-05 00:00:00|  2|Mon|
#|2019-08-06 00:00:00|  3|Tue|
#|2019-08-07 00:00:00|  4|Wed|
#|2019-08-08 00:00:00|  5|Thu|
#|2019-08-09 00:00:00|  6|Fri|
#|2019-08-10 00:00:00|  7|Sat|
#+-------------------+---+---+
pault
  • 41,343
  • 15
  • 107
  • 149