I have a list of valid values that a cell can have. If one cell in a column is invalid, I need to drop the whole column. I understand there are answers of dropping rows in a particular column but here I am dropping the whole column instead even if one cell in it is invalid. The conditions for valid/invalid are that a cell can have only three values: ['Messi', 'Ronaldo', 'Virgil']
I tried reading about filtering but all I could see was filtering columns and dropping the rows. For instance in this question. I also read that one should avoid too much scanning and shuffling in Spark, which I agree with.
I am not only looking at the code solution but more on the off-the-shelf code provided from PySpark. I hope it doesn't get out of the scope of a SO answer.
For the following input dataframe:
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| --------------| --------------| --------------| --------------| --------------|
| Ronaldo | Salah | Messi | |Salah |
| Ronaldo | Messi | Virgil | Messi | null |
| Ronaldo | Ronaldo | Messi | Ronaldo | null |
I expect the following output:
| Column 1 | Column 2 |
| --------------| --------------|
| Ronaldo | Messi |
| Ronaldo | Virgil |
| Ronaldo | Messi |