I am trying to run a pyspark query by declaring my date variables and using these variables in the query itself. However, the output doesn't reflect the date filter. my existing code below
strt_dt = "'2018-01-01'"
end_dt = "'2019-12-31'"
df = sqlc.sql('Select * from tbl where dt > {0} and dt < {1}'.format(strt_dt,end_dt))
when I check the max(dt) of df above, it is greater than 2019-12-31 which should not be the case as per the code. I can further filter on this spark df for the required date range using the below code (source from Pyspark: Filter dataframe based on multiple conditions
strt_dt = '2018-01-01'
end_dt = '2019-12-31'
df = sqlc.sql('Select * from tbl')
df.filter((col('dt') >= F.lit(strt_dt)) & (col('dt') < F.lit(end_dt)))
I want to avoid the filtering of spark df as I do not want to create a df with all of the data. Please let me know what I am doing wrong in the first set of code.
PS: When I declare variables other than date datatype in the first block of code, the variable works for that column name. i.e. this is specific to date datatype stored as string in my HIVE table.
Thanks in advance