2

Currently, I focus on the data preprocessing in the Data Mining Project. To be specific, I want to do the data cleaning with PySpark based on HDFS. I'm very new to those things, so I want to ask how to do that?

For example, there is a table in the HDFS containing the following entries:

attrA   attrB   attrC      label
1       a       abc        0
2               abc        0
4       b       abc        1
4       b       abc        1
5       a       abc        0

After cleaning all the entries, row 2 <2, , abc, 0> should have a default or imputed value for attrB, and row 3 or 3 should be eliminated. So how can I implement that with PySpark?

argenisleon
  • 350
  • 1
  • 5
  • 10
Walden Lian
  • 95
  • 2
  • 10

2 Answers2

4

This is a very common problem in any data driven solution. The best tool I can recommend for data cleansing with Pyspark is Optimus.

So lets see. First lets asume you have this DF in memory already:

df.show()

+-----+-----+-----+-----+
|attrA|attrB|attrC|label|
+-----+-----+-----+-----+
|    1|    a|  abc|    0|
|    2|     |  abc|    0|
|    4|    b|  abc|    1|
|    4|    b|  abc|    1|
|    5|    a|  abc|    0|
+-----+-----+-----+-----+

To start lets instantiate the DFTransfomer:

transformer = op.DataFrameTransformer(df)

  1. Set default value for empty cells:

df_default = transformer.replace_col(search='', change_to='new_value', columns='attrB').df

df_default.show()

+-----+---------+-----+-----+
|attrA|    attrB|attrC|label|
+-----+---------+-----+-----+
|    1|        a|  abc|    0|
|    2|new_value|  abc|    0|
|    4|        b|  abc|    1|
|    4|        b|  abc|    1|
|    5|        a|  abc|    0|
+-----+---------+-----+-----+
  1. Eliminate duplicate records:

df_clean = transformer.remove_duplicates(["attrA","attrB"]).df df_clean.show()

 +-----+---------+-----+-----+
 |attrA|    attrB|attrC|label|
 +-----+---------+-----+-----+
 |    4|        b|  abc|    1|
 |    5|        a|  abc|    0|
 |    1|        a|  abc|    0|
 |    2|new_value|  abc|    0|
 +-----+---------+-----+-----+
Favio Vázquez
  • 260
  • 3
  • 11
2

Well on the basis of whatever you asked, there are two things that you want to achieve, first remove duplicate rows which can be achieved by the distinct function

df2 = df.distinct().show()

will give you the distinct rows of the dataframe.

Second is imputing missing values, which can be achieved by the fillna function

df2 = df.na.fill({'attrB': 'm'}).show()
Gaurav Dhama
  • 1,346
  • 8
  • 19