Pandas dataframe data validation methods

Question

Currently using a dataframe to store information on data we've collected. Prior to submitting the data, we need to validate the data based off a list of rules. Trying to set up these validations in python, and part of the problem is readability vs. performance.

For example: (Each item is a column in the data frame) Rule: Dir_Through_Lanes must exist where F_System = 1 and Facility_Type in (1;2;6) and RoadEventCollectionMethods Pav_Rep_Method = 2

Two methods of validating the data:

Using an apply

def sjpm201a(self):
        def sjpm201_check(row):
            print(row)
            if row['F_SYSTEM'] == 1 and row['FACILITY_TYPE'] in [1,2,6] and row['PAVE_REP_METHOD'] == 2:
                if row['DIR_THROUGH_LANES'] == np.nan:
                    row['SJPM201'] = False
                    return row
            row['SJPM201a'] = True
            return row
        self.df = self.df.apply(sjpm201_check, axis=1)

Using boolean indexing

    def sjpm201b(self):
        df = self.df
        self.df['SJPM201b'] = ((df['DIR_THROUGH_LANES'].notna()) | ((df['F_SYSTEM'] != 1) | (~df['FACILITY_TYPE'].isin([1,2,6])) | (df['PAVE_REP_METHOD'] != 2)))

The problem is that the first method is easier to read/write/update whereas the second method is a bit harder just due to the grouping of ands/ors especially on longer rules. However the second method runs in about .01 seconds vs the first method taking about 22 seconds to run. This is a problem as we have hundreds of rules that need to be ran.

Is there a way to achieve the same performance as the second method, but using if/else if blocks to filter the data?

Arothas · Accepted Answer · 2023-06-22T14:16:01.737

0

Instead of checking for (!condition1 or !condition2) in your second example, you could use (condition1 and condition2) as in the first example in separated lines. E.g.:

def sjpm201b(self):
    self.df = self.df[self.df['DIR_THROUGH_LANES'].notna()]
    self.df = self.df[self.df['F_SYSTEM'] == 1]
    self.df = self.df[self.df['FACILITY_TYPE'].isin([1,2,6])]
    self.df = self.df[self.df['PAVE_REP_METHOD'] == 2]

Edit: I am slightly confused. Do you want to retain rows where df['F_SYSTEM']==1 or do you only want to keep everything that is != 1? The above function would only retain values that are within your specification and delete the rest.

Edit2: The following will return a dataframe that will list which rows conform to your specifications but have na-values in the DIR_THROUGH_LANES column.

def sjpm201b(self):
    df = self.df
    df = df[df['F_SYSTEM'] == 1]
    df = df[df['FACILITY_TYPE'].isin([1,2,6])]
    df = df[df['PAVE_REP_METHOD'] == 2]
    df = df[df['DIR_THROUGH_LANES'].isna()]
    return df

In case you want a simple True/False result of your validation you could use:

return df['DIR_THROUGH_LANES'].isna().any()

Edit3: Improve readability by adding linebreaks:

def sjpm201b(self):
    self.df = self.df[
        df['DIR_THROUGH_LANES'].notna()
        | (df['F_SYSTEM'] != 1)
        | (~df['FACILITY_TYPE'].isin([1,2,6]))
        | (df['PAVE_REP_METHOD'] != 2))
    ]

edited Jun 22 '23 at 14:16

answered Jun 22 '23 at 13:36

Arothas

24
1

So the validation depends on the entire statement. Dir_Through_Lanes can't be missing when the following conditions (F_System=1, Facility_Type=1, 2 or 6, and Pav_Rep_Method=2), otherwise it doesn't matter if Dir_Through_Lanes has a value or not The second method is just an entire logical statement to find lines that meet this criteria (the logic might not read exactly the same, but should give the same results). Using individual column filters won't work in most cases unfortunately as whether or not the data is valid depends on multiple conditions, not just a single condition of each column – Sea Bacon Jun 22 '23 at 13:46
In response to edit 2: The problem with this method is we don't want to filter out rows where the individual conditions aren't met. That's still valid data in most cases. The only time it's invalid is when ALL 3 conditions are met but DIR_THROUGH_LANES is null. If any of the conditions are false, then the row is essentially valid regardless of anything else. – Sea Bacon Jun 22 '23 at 14:04
I see. If you want to retain performance you should keep to the boolean indexing method. You can still put your statements on separate lines which marginally improves readability, see edit3. – Arothas Jun 22 '23 at 14:12
Your previous method might work. If I used it to filter rows where it could potentially be false (fail) then check the initial condition and return a truth series based off that. – Sea Bacon Jun 22 '23 at 14:22
I started using the method from edit 2 with some slight modifications. But it's working out well and WAY faster than trying to go row by row using if statements. TY @Arothas – Sea Bacon Jun 22 '23 at 17:55

Pandas dataframe data validation methods

1 Answers1