Currently, I focus on the data preprocessing in the Data Mining Project. To be specific, I want to do the data cleaning with PySpark based on HDFS. I'm very new to those things, so I want to ask how to do that?
For example, there is a table in the HDFS containing the following entries:
attrA attrB attrC label
1 a abc 0
2 abc 0
4 b abc 1
4 b abc 1
5 a abc 0
After cleaning all the entries, row 2 <2, , abc, 0>
should have a default or imputed value for attrB, and row 3 or 3 should be eliminated. So how can I implement that with PySpark?