I'm writing Spark code in Python. I have a col(execution_date)
that is a timestamp. How would I turn that into a column that is called is_weekend
, that has a value of 1
if the date is a weekend and 0
if it's a week day?
Asked
Active
Viewed 4,846 times
7

pault
- 41,343
- 15
- 107
- 149

Daniel Kaplan
- 62,768
- 50
- 234
- 356
1 Answers
9
Suppose you had the following spark DataFrame:
df.show()
#+-------------------+
#| execution_date|
#+-------------------+
#|2019-08-01 00:00:00|
#|2019-08-02 00:00:00|
#|2019-08-03 00:00:00|
#|2019-08-04 00:00:00|
#|2019-08-05 00:00:00|
#|2019-08-06 00:00:00|
#|2019-08-07 00:00:00|
#|2019-08-08 00:00:00|
#|2019-08-09 00:00:00|
#|2019-08-10 00:00:00|
#+-------------------+
Spark Version 2.3+
You can use pyspark.sql.functions.dayofweek
, which will return a number between 1 and 7 (Sunday = 1, Saturday = 7).
from pyspark.sql.functions import dayofweek
df.withColumn("is_weekend", dayofweek("execution_date").isin([1,7]).cast("int")).show()
#+-------------------+----------+
#| execution_date|is_weekend|
#+-------------------+----------+
#|2019-08-01 00:00:00| 0|
#|2019-08-02 00:00:00| 0|
#|2019-08-03 00:00:00| 1|
#|2019-08-04 00:00:00| 1|
#|2019-08-05 00:00:00| 0|
#|2019-08-06 00:00:00| 0|
#|2019-08-07 00:00:00| 0|
#|2019-08-08 00:00:00| 0|
#|2019-08-09 00:00:00| 0|
#|2019-08-10 00:00:00| 1|
#+-------------------+----------+
Spark Versions 1.5+
You can use pyspark.sql.functions.date_format
with format = 'EEE'
from pyspark.sql.functions import date_format
df.withColumn(
"is_weekend",
date_format("execution_date", 'EEE').isin(["Sat", "Sun"]).cast("int")
).show()
#+-------------------+----------+
#| execution_date|is_weekend|
#+-------------------+----------+
#|2019-08-01 00:00:00| 0|
#|2019-08-02 00:00:00| 0|
#|2019-08-03 00:00:00| 1|
#|2019-08-04 00:00:00| 1|
#|2019-08-05 00:00:00| 0|
#|2019-08-06 00:00:00| 0|
#|2019-08-07 00:00:00| 0|
#|2019-08-08 00:00:00| 0|
#|2019-08-09 00:00:00| 0|
#|2019-08-10 00:00:00| 1|
#+-------------------+----------+
For completeness, the intermediate results of both are shown here:
df.withColumn("dow", dayofweek("execution_date"))\
.withColumn("day", date_format("execution_date", 'EEE'))\
.show()
#+-------------------+---+---+
#| execution_date|dow|day|
#+-------------------+---+---+
#|2019-08-01 00:00:00| 5|Thu|
#|2019-08-02 00:00:00| 6|Fri|
#|2019-08-03 00:00:00| 7|Sat|
#|2019-08-04 00:00:00| 1|Sun|
#|2019-08-05 00:00:00| 2|Mon|
#|2019-08-06 00:00:00| 3|Tue|
#|2019-08-07 00:00:00| 4|Wed|
#|2019-08-08 00:00:00| 5|Thu|
#|2019-08-09 00:00:00| 6|Fri|
#|2019-08-10 00:00:00| 7|Sat|
#+-------------------+---+---+

pault
- 41,343
- 15
- 107
- 149