I have a large df I would like to filter using python. The dataframe is a concatenation of 4 separate dataframes and now has 40,000 rows and 36 columns. For filtering, I am looking specifically at four columns (IJC_S1, IJC_S2, SJC_S1, SJC_S2) which contain a string of two numbers in each (separated by a ','). My filtering criteria will be to keep all rows that contain an IJC_S1 + SJC_S1 >= 10 or IJC_S2 + SJC_S2 >= 10. Where I am getting stuck is keeping the data in the dataframe and in the correct order so it matches the ID number from the original df. Furthermore, I am trying to make this a function, as I have many other datasets I will need to do this with. Here is what I have so far, but I would really appreciate some guidance/help in doing this as I am relatively new to pandas and coding in general.
import pandas as pd
import glob
def get_JC_counts(df):
new_df = df
for col in df:
if col.startswith('IJC') or col.startswith('SJC'):
new_df[] = new_df[col].str.split(',', expand = True) #I am not sure how to make this work the way I intend
for col in new_df: #I feel like I can figure this out once I fix the above error
if col _______:
new_df[col] = new_df[col].apply(pd.to_numeric)
sum_S1 = new_df[] + new_df[] + new_df[] + new_df[]
sum_S2 = new_df[] + new_df[] + new_df[] + new_df[]
new_df['S1'] = sum_S1
new_df['S2'] = sum_S2
new_df = new_df.loc[new_df['S1']>10 or new_df['S2']>10]
return new_df
edit:
I have changed my thoughts on how to get the columns because the number of values in the IJC/SJC columns may be variable for future runs, so I want to make a loop that will make x amount of columns (depending on samples) and name them accordingly.
#Separate columns that contain commas and alter datatype
sep_df = merged_df
#Find how many samples on were input
index = sep_df.columns.get_loc('IJC_SAMPLE_1')
val1 = sep_df.iat[0, index]
samples = val1.count(',') + 1
for i in range(samples):
sep_df[[f'IJC_1_{i}']] = sep_df.IJC_SAMPLE_1.str.split(pat=',', expand =True)