I'm working on a transformation and stuck with a common problem. Any assist is well appreciated.
Scenario: Step-1: Reading from a delta table.
+--------+------------------+
| emp_id | str |
+--------+------------------+
| 1 | name=qwerty. |
| 2 | age=22 |
| 3 | job=googling |
| 4 | dob=12-Jan-2001 |
| 5 | weight=62.7. |
+--------+------------------+
Step-2: I'm refining the data and outputting it into another delta table dynamically (No predefined schema). Let's say I'm adding null if the column name is not found.
+--------+--------+------+----------+-------------+--------+
| emp_id | name | age | job | dob | weight |
+--------+--------+------+----------+-------------+--------+
| 1 | qwerty | null | null | null | null |
| 2 | null | 22 | null | null | null |
| 3 | null | null | googling | null | null |
| 4 | null | null | null | 12-Jan-2001 | null |
| 5 | null | null | null | null | 62.7 |
+--------+--------+------+----------+-------------+--------+
Is there a way to apply validation in step-2 based on the column name? I'm splitting it by = while deriving the above table. Or do I have to do validation in step-3 while working on the new df?
Second question: Is there a way to achieve the following table?
+--------+--------+------+----------+-------------+--------+---------------------+
| emp_id | name | age | job | dob | weight | missing_attributes |
+--------+--------+------+----------+-------------+--------+---------------------+
| 1 | qwerty | null | null | null | null | age,job,dob,weight |
| 2 | null | 22 | null | null | null | name,job,dob,weight |
| 3 | null | null | googling | null | null | name,age,dob,weight |
| 4 | null | null | null | 12-Jan-2001 | null | name,age,job,weight |
| 5 | null | null | null | null | 62.7 | name,age,job,dob |
+--------+--------+------+----------+-------------+--------+---------------------+