How to conduct Data Cleaning with Spark-Python based on HDFS

Question

Currently, I focus on the data preprocessing in the Data Mining Project. To be specific, I want to do the data cleaning with PySpark based on HDFS. I'm very new to those things, so I want to ask how to do that?

For example, there is a table in the HDFS containing the following entries:

attrA   attrB   attrC      label
1       a       abc        0
2               abc        0
4       b       abc        1
4       b       abc        1
5       a       abc        0

After cleaning all the entries, row 2 <2, , abc, 0> should have a default or imputed value for attrB, and row 3 or 3 should be eliminated. So how can I implement that with PySpark?

What is your question? What is entry-2 and entry-4? I read over your question twice and don't know what you're asking. — Powers, Feb 19 '17 at 13:30

Favio Vázquez · Answer 1 · 2017-11-30T20:20:35.690

This is a very common problem in any data driven solution. The best tool I can recommend for data cleansing with Pyspark is Optimus.

So lets see. First lets asume you have this DF in memory already:

df.show()

+-----+-----+-----+-----+
|attrA|attrB|attrC|label|
+-----+-----+-----+-----+
|    1|    a|  abc|    0|
|    2|     |  abc|    0|
|    4|    b|  abc|    1|
|    4|    b|  abc|    1|
|    5|    a|  abc|    0|
+-----+-----+-----+-----+

To start lets instantiate the DFTransfomer:

transformer = op.DataFrameTransformer(df)

Set default value for empty cells:

df_default = transformer.replace_col(search='', change_to='new_value', columns='attrB').df

df_default.show()

+-----+---------+-----+-----+
|attrA|    attrB|attrC|label|
+-----+---------+-----+-----+
|    1|        a|  abc|    0|
|    2|new_value|  abc|    0|
|    4|        b|  abc|    1|
|    4|        b|  abc|    1|
|    5|        a|  abc|    0|
+-----+---------+-----+-----+

Eliminate duplicate records:

df_clean = transformer.remove_duplicates(["attrA","attrB"]).df df_clean.show()

 +-----+---------+-----+-----+
 |attrA|    attrB|attrC|label|
 +-----+---------+-----+-----+
 |    4|        b|  abc|    1|
 |    5|        a|  abc|    0|
 |    1|        a|  abc|    0|
 |    2|new_value|  abc|    0|
 +-----+---------+-----+-----+

score 2 · Accepted Answer · answered Feb 19 '17 at 16:30

Well on the basis of whatever you asked, there are two things that you want to achieve, first remove duplicate rows which can be achieved by the distinct function

df2 = df.distinct().show()

will give you the distinct rows of the dataframe.

Second is imputing missing values, which can be achieved by the fillna function

df2 = df.na.fill({'attrB': 'm'}).show()

How to conduct Data Cleaning with Spark-Python based on HDFS

2 Answers2