I am trying to derive DOB in the date datatype format (YYYY-MM-DD) in pyspark from a column that has the info reversed, as a large integer, and with single digit day/month's leading zeroes removed. This last point means the data varies between 6-8 digits in length. There are two circumstances for 7 digit numbers where deriving isn't possible due to ambiguity, I'm happy to output null values for these cases.
Examples:
- 831945 = 8th March 1945
- 1232000 = 12th March 2000
- 11102000 = 11th Oct 2000
Ambiguous examples:
- 111YYYY = 1st Nov or 11th Jan
- 112YYYY = 1st Dec or 11th Feb
It's quite complicated logic to code which is a bit above me. I'm thinking I could derive new year, month and day columns before deriving the DOB. First derive day and year cols whilst simultaneously dropping these digits from the merged digits col to leave the month.
Year = last 4 digits.
Day:
- if len 6 then day = first digit
- if len 7 and 2nd digit = 0 or >=2 then day = first two digits
- if len 7 and 2nd digit = 1 and 3rd digit ==0 then day = first digit
- if len 7 and 2nd digit = 1 and 3rd digit ==1,2 then output null
- if len 7 and 2nd digit = 1 and 3rd digit >=3 then day = first two digits
- if len 7 and 2nd digit =>2 then day = first two digits
- if len 8 then day = first two digits
Before:
merged digits | Day | Year
1232000 | |
After:
merged digits | Day | Year
3 | 12 | 2000
Just an idea. Thanks for your help and ideas!