I'm trying to figure out the number of working days between two dates. The table (dfDates) is laid out as follows:
Key | StartDateKey | EndDateKey |
---|---|---|
1 | 20171227 | 20180104 |
2 | 20171227 | 20171229 |
I have another table (dfDimDate) with all the relevant date keys and whether the date key is a working day or not:
DateKey | WorkDayFlag |
---|---|
20171227 | 1 |
20171228 | 1 |
20171229 | 1 |
20171230 | 0 |
20171231 | 0 |
20180101 | 0 |
20180102 | 1 |
20180103 | 1 |
20180104 | 1 |
I'm expecting a result as so:
Key | WorkingDays |
---|---|
1 | 6 |
2 | 3 |
So far (I realise this isn't complete to get me the above result), I've written this:
workingdays = []
for i in range(0, len(dfDates)):
value = dfDimDate.filter((dfDimDate.DateKey >= dfDates.collect()[i][1]) & (dfDimDate.DateKey <= df.collect()[i][2])).agg({'WorkDayFlag': 'sum'})
workingdays.append(value.collect())
However, only null values are being returned. Also, I've noticed this is very slow and took 54 seconds before it errored.
I think I understand what the error is about but I'm not sure how to fix it. Also, I'm not sure how to optimise the command so it runs faster. I'm looking for a solution in pyspark or spark SQL (whichever is easiest).
Many thanks,
Carolina
Edit: The error below was resolved thanks to a suggestion from @samkart who said to put the agg
after the filter
AnalysisException: Resolved attribute(s) DateKey#17075 missing from sum(WorkDayFlag)#22142L in operator !Filter ((DateKey#17075 <= 20171228) AND (DateKey#17075 >= 20171227)).;