Derive dob / date of birth from merged numbers in pyspark or python?

Question

I am trying to derive DOB in the date datatype format (YYYY-MM-DD) in pyspark from a column that has the info reversed, as a large integer, and with single digit day/month's leading zeroes removed. This last point means the data varies between 6-8 digits in length. There are two circumstances for 7 digit numbers where deriving isn't possible due to ambiguity, I'm happy to output null values for these cases.

Examples:

831945 = 8th March 1945
1232000 = 12th March 2000
11102000 = 11th Oct 2000

Ambiguous examples:

111YYYY = 1st Nov or 11th Jan
112YYYY = 1st Dec or 11th Feb

It's quite complicated logic to code which is a bit above me. I'm thinking I could derive new year, month and day columns before deriving the DOB. First derive day and year cols whilst simultaneously dropping these digits from the merged digits col to leave the month.

Year = last 4 digits.

Day:

if len 6 then day = first digit
if len 7 and 2nd digit = 0 or >=2 then day = first two digits
if len 7 and 2nd digit = 1 and 3rd digit ==0 then day = first digit
if len 7 and 2nd digit = 1 and 3rd digit ==1,2 then output null
if len 7 and 2nd digit = 1 and 3rd digit >=3 then day = first two digits
if len 7 and 2nd digit =>2 then day = first two digits
if len 8 then day = first two digits

Before:

merged digits  | Day | Year
1232000        |     |

After:

merged digits  | Day | Year
3              | 12  | 2000

Just an idea. Thanks for your help and ideas!

what have you tried so far ? You can do that simply with [`when`](https://spark.apache.org/docs/2.4.7/api/python/pyspark.sql.html#pyspark.sql.functions.when) and [`substring`](https://spark.apache.org/docs/2.4.7/api/python/pyspark.sql.html#pyspark.sql.functions.substring). I do not see what is blocking you. — Steven, Aug 02 '21 at 12:21

score 1 · Accepted Answer · answered Aug 02 '21 at 12:35

Just following your logic :

from pyspark.sql import functions as F


F.when(
    F.size("merged_digits") == 6,
    F.Array(
        F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
        F.lpad(F.substring("merged_digits", 2, 1), 2, "0"),
        F.substring("merged_digits", 3, 4),
    ),
).when(
    F.size("merged_digits") == 8,
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.substring("merged_digits", 3, 2),
        F.substring("merged_digits", 5, 4),
    ),
).when(
    F.substring("merged_digits", 1, 1) == "0",
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
        F.substring("merged_digits", 4, 4),
    ),
).when(
    F.substring("merged_digits", 2, 1).cast("int") >= 2,
    F.Array(
        F.substring("merged_digits", 1, 2),
        F.lpad(F.substring("merged_digits", 3, 1), 2, "0"),
        F.substring("merged_digits", 4, 4),
    ),
).when(
    (F.substring("merged_digits", 2, 1) == "1")
    & (F.substring("merged_digits", 3, 1) == "0"),
    F.Array(
        F.lpad(F.substring("merged_digits", 1, 1), 2, "0"),
        F.substring("merged_digits", 2, 2),
        F.substring("merged_digits", 4, 4),
    ),
)

The output I created is an array with first element as day, second as month, and last as year, all left-padded with 0 to have the same format.

Derive dob / date of birth from merged numbers in pyspark or python?

1 Answers1