0

I have imported a data set into Juputer notebook / PySpark to process through EMR, for example:

data sample

I want to clean up the data before using it using the filter function. This includes:

  1. Removing rows that are blank or '0' or NA cost or date. I think the filter would be something like: .filter(lambda (a,b,c,d): b = ?, c % 1 == c, d = ?). I'm unsure how to filter fruit and store here.
  2. Remove incorrect values e.g. '3' is not a fruit name. This is easy for numbers (just to number % 1 == number) but I'm unsure how it would filter out the words.
  3. Removing rows that are statistically outliers i.e. 3 standard deviations from the mean. So here cell C4 would clearly need to removed but I am unsure how to incorporate this logic into a filter.

Do I need to perform one filter at a time or is there a way to filter the data set (in lambda notation) all in one go?

Or, would it be easier to write a Spark SQL query instead which has many filters in the 'where' clause (but then #3 above is still difficult to write in SQL).

lseactuary
  • 45
  • 1
  • 1
  • 7
  • Multiple conditions can be given. See this https://stackoverflow.com/questions/37707305/pyspark-multiple-conditions-in-when-clause – mayank agrawal Oct 05 '17 at 13:52
  • Yup that is for the 'where' clause. But I was thinking to be more traditional and use the filter/map/reduce method. – lseactuary Oct 05 '17 at 14:08

1 Answers1

0

If you read in the documentation, http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter, it is written that

where() is an alias for filter().

So,you can safely use 'filter' instead of 'where' for multiple conditions too.

EDIT: If you want to filter on many conditions for many columns, I would prefer this method.

from dateutil.parser import parse
import pyspark.sql.functions as F

def is_date(string):
    try: 
       parse(string)
       return True
    except ValueError:
       return False
def check_date(d):
    if is_date(d):
        return d
    else:
        return None

date_udf = F.udf(check_date,StrinType())

def check_fruit(name):
    fruits_list #create a list of fruits(can easily find it on google)
                #difficult filtering words otherwise
                #try checking from what you want, rest will be filtered
    if name in fruits_list:
        return name
    else:
        return None

fruit_udf = F.udf(check_fruit,StringType())

def check_cost(value):
    mean, std #calculcated beforehand
    threshold_upper = mean + (3*std)
    threhold_lower = mean - (3*std)

    if value > threhold_lower and value < threshold_upper:
        return value
    else:
        return None
cost_udf = F.udf(check_cost,StringType())        

#Similarly create store_udf

df = df.select([date_udf(F.col('date')).alias('date'),\
            fruit_udf(F.col('fruit')).alias('fruit'),\
            cost_udf(F.col('cost')).alias('cost'),\
            store_udf(F.col('store')).alias('store')]).dropna()

This will result in working on all columns together.

mayank agrawal
  • 2,495
  • 2
  • 13
  • 32
  • Thank you for the link. The issue is mainly how to filter (or use where) above for #1/2/3 as I'm unsure how to clean up the data using a multiple where/filter clause. – lseactuary Oct 05 '17 at 16:17
  • The main issue here is a I have big data set that I need to clean up before using. If I infer the schema, it will import incorrect values, but then I have to write a complex SQL query to remove those instances (and this is not scalable because someone could add additional data to the same file with more bugs). I have tried to cast the values to a 'type' with the .withColumn statement but it rejects this too as there is missing/unclean data (e.g. customer_count in a numeric field so it wont get case properly). Therefore I am stuck on how to approach this issue using what is available in Spark. – lseactuary Oct 08 '17 at 12:04
  • Yes, infering schema will be inappropriate. Since you have many conditions on many columns to filter, I would suggest using UDFs for all columns. Check the EDIT in answer. – mayank agrawal Oct 09 '17 at 11:04
  • Changing the schema (through cast or otherwise) and then using df2 = df.filter (condition for column 1 | condition for column 2 ...) would also work right? – lseactuary Oct 14 '17 at 09:57
  • Changing the schema when your column has unclean values will result in error. For example, in column cost, if you have characters you will not be able to convert it to FloatType. It is better to clean all values as StringType and then convert/cast to the desired schema. – mayank agrawal Oct 14 '17 at 10:56