Currently using a dataframe
to store information on data we've collected. Prior to submitting the data, we need to validate the data based off a list of rules. Trying to set up these validations in python, and part of the problem is readability vs. performance.
For example: (Each item is a column in the data frame) Rule: Dir_Through_Lanes must exist where F_System = 1 and Facility_Type in (1;2;6) and RoadEventCollectionMethods Pav_Rep_Method = 2
Two methods of validating the data:
- Using an apply
def sjpm201a(self):
def sjpm201_check(row):
print(row)
if row['F_SYSTEM'] == 1 and row['FACILITY_TYPE'] in [1,2,6] and row['PAVE_REP_METHOD'] == 2:
if row['DIR_THROUGH_LANES'] == np.nan:
row['SJPM201'] = False
return row
row['SJPM201a'] = True
return row
self.df = self.df.apply(sjpm201_check, axis=1)
- Using
boolean
indexing
def sjpm201b(self):
df = self.df
self.df['SJPM201b'] = ((df['DIR_THROUGH_LANES'].notna()) | ((df['F_SYSTEM'] != 1) | (~df['FACILITY_TYPE'].isin([1,2,6])) | (df['PAVE_REP_METHOD'] != 2)))
The problem is that the first method is easier to read/write/update whereas the second method is a bit harder just due to the grouping of ands/ors especially on longer rules. However the second method runs in about .01 seconds vs the first method taking about 22 seconds to run. This is a problem as we have hundreds of rules that need to be ran.
Is there a way to achieve the same performance as the second method, but using if/else
if blocks to filter the data?