I have imported a data set into Juputer notebook / PySpark to process through EMR, for example:
I want to clean up the data before using it using the filter function. This includes:
- Removing rows that are blank or '0' or NA cost or date. I think the filter would be something like: .filter(lambda (a,b,c,d): b = ?, c % 1 == c, d = ?). I'm unsure how to filter fruit and store here.
- Remove incorrect values e.g. '3' is not a fruit name. This is easy for numbers (just to number % 1 == number) but I'm unsure how it would filter out the words.
- Removing rows that are statistically outliers i.e. 3 standard deviations from the mean. So here cell C4 would clearly need to removed but I am unsure how to incorporate this logic into a filter.
Do I need to perform one filter at a time or is there a way to filter the data set (in lambda notation) all in one go?
Or, would it be easier to write a Spark SQL query instead which has many filters in the 'where' clause (but then #3 above is still difficult to write in SQL).