Large data frame filtering in python using pandas

Question

I have a large df I would like to filter using python. The dataframe is a concatenation of 4 separate dataframes and now has 40,000 rows and 36 columns. For filtering, I am looking specifically at four columns (IJC_S1, IJC_S2, SJC_S1, SJC_S2) which contain a string of two numbers in each (separated by a ','). My filtering criteria will be to keep all rows that contain an IJC_S1 + SJC_S1 >= 10 or IJC_S2 + SJC_S2 >= 10. Where I am getting stuck is keeping the data in the dataframe and in the correct order so it matches the ID number from the original df. Furthermore, I am trying to make this a function, as I have many other datasets I will need to do this with. Here is what I have so far, but I would really appreciate some guidance/help in doing this as I am relatively new to pandas and coding in general.

import pandas as pd 
import glob 

def get_JC_counts(df):
    new_df = df 
    for col in df:
        if col.startswith('IJC') or col.startswith('SJC'):
            new_df[] = new_df[col].str.split(',', expand = True) #I am not sure how to make this work the way I  intend 
            
    for col in new_df:  #I feel like I can figure this out once I fix the above error
        if col _______:
            new_df[col] = new_df[col].apply(pd.to_numeric) 
        
    sum_S1 = new_df[] + new_df[] + new_df[] + new_df[]
    sum_S2 = new_df[] + new_df[] + new_df[] + new_df[]
    
    new_df['S1'] = sum_S1
    new_df['S2'] = sum_S2
    
    new_df = new_df.loc[new_df['S1']>10 or new_df['S2']>10]
    
    
    return new_df

edit:

I have changed my thoughts on how to get the columns because the number of values in the IJC/SJC columns may be variable for future runs, so I want to make a loop that will make x amount of columns (depending on samples) and name them accordingly.

#Separate columns that contain commas and alter datatype 
sep_df = merged_df 
    
#Find how many samples on were input
index = sep_df.columns.get_loc('IJC_SAMPLE_1')
val1 = sep_df.iat[0, index] 
samples = val1.count(',') + 1 

for i in range(samples):
     sep_df[[f'IJC_1_{i}']] = sep_df.IJC_SAMPLE_1.str.split(pat=',', expand =True)

Does this answer your question? [Efficient way to apply multiple filters to pandas DataFrame or Series](https://stackoverflow.com/questions/13611065/efficient-way-to-apply-multiple-filters-to-pandas-dataframe-or-series) — noah, Dec 01 '20 at 20:46

score 0 · Accepted Answer · answered Dec 02 '20 at 10:41

I'm not really sure what should happen with the 2, comma separated values in each field. Do you want to add them up? Take the largest/smallest? Anyway, this will do the trick:

import pandas
import random

# Your data
df = pandas.DataFrame([{
    "IJC_S2": f"{random.randint(0, 5)},{random.randint(0, 5)}",
    "SJC_S1": f"{random.randint(0, 5)},{random.randint(0, 5)}",
    "IJC_S1": f"{random.randint(0, 5)},{random.randint(0, 5)}",
    "SJC_S2": f"{random.randint(0, 5)},{random.randint(0, 5)}",
} for _ in range(40000)])


def my_filter(row: pandas.Series) -> bool:

    # Convert the values to numbers and sum(?) them
    IJC_S2 = sum([int(element) for element in row["IJC_S2"].split(",")])
    SJC_S1 = sum([int(element) for element in row["SJC_S1"].split(",")])
    IJC_S1 = sum([int(element) for element in row["IJC_S1"].split(",")])
    SJC_S2 = sum([int(element) for element in row["SJC_S2"].split(",")])

    # Return a boolean value indicating if the row should be kept
    return IJC_S1 + SJC_S1 >= 10 or IJC_S2 + SJC_S2 >= 10


# Apply the filter to each row, use the result to filter the dataframe
df = df.loc[df.apply(my_filter, axis=1)]

Output:

print(df)
      IJC_S2 SJC_S1 IJC_S1 SJC_S2
0        3,4    5,4    5,4    1,5
1        5,2    2,2    2,1    3,5
2        5,2    5,1    0,2    1,4
4        3,3    5,5    4,4    1,5
5        3,3    2,3    0,4    4,0
...      ...    ...    ...    ...
39995    2,0    4,3    4,4    2,1
39996    3,1    4,5    0,1    1,2
39997    5,0    2,3    2,5    4,3
39998    4,2    0,3    5,1    4,1
39999    0,5    2,5    2,2    4,3

[32047 rows x 4 columns]

Thanks for you response, however that is not necessarily what I want to do, my original question is a little confusing. I am trying more to separate the IJC columns into two new columns. I want to use a loop to do this, as I will be entering different dataframes into the function that may contain more than two values within the column, and get it so the values can split into however many columns they need to split into. I added a few lines of code to get the number of samples in the row, and want to make a loop that depending on this value, makes new columns for each sample. Check edits^ — nsh1998, Dec 03 '20 at 18:46

Large data frame filtering in python using pandas

1 Answers1