2

I need to write a function to filter a dataset based on some hierarchical conditions. The purpose of this function is to get one annotation for each protein among a bunch of them.

The function needs to do the following,

  1. Group the proteins by "id" column
  2. Check each analysis in the protein group
  3. Pick the analysis in a prioritized way (It's not a sorting issue so the analysis should be given in a conditional order so that it could be re-ordered later on). First, check analysis_1 if it doesn't exist then take analysis_2 or analysis_3, etc.
  4. Get the row into a new data frame.

Here is an example of data that will be filtered,

df=pd.DataFrame({
'id':      ['Protein_1', 'Protein_1', 'Protein_1', 
          'Protein_2','Protein_2','Protein_2'],
'analysis': ['analysis_6', 'analysis_4', 'analysis_1', 
                 'analysis_3','analysis_2','analysis_5'],
'annotation':['annotation_1', 'annotation_2', 'annotation_3',
              'annotation_1','annotation_2','annotation_3'] })

and this is the output I'd like to see,

df_filtered= pd.DataFrame({
'id':      ['Protein_1','Protein_2'],
'analysis': ['analysis_1', 'analysis_2'],
'annotation':['annotation_3', 'annotation_2'] })

The code in the following is working but I'd like to do it by using pandas groupby, apply, and iterrows functions.

new_df =pd.DataFrame(columns=df.columns)
protein_id=list(df.id.unique())

for protein in protein_id:

data=df[df["id"] == protein]

if len(data[data["analysis"] =="analysis_1"]) == 0:
    
    if len(data[data["analysis"] =="analysis_2"]) == 0:
        
        if len(data[data["analysis"] =="analysis_3"]) == 0:
            pass
        else:
            data2=data[data["analysis"] =="analysis_3"]
            new_df = pd.concat([new_df,data2])
        
    else:
        data2=data[data["analysis"] =="analysis_2"]
        new_df = pd.concat([new_df,data2])
    
else:
    data2=data[data["analysis"] =="analysis_1"]
    new_df = pd.concat([new_df,data2])
    
new_df

Appreciate any help!!

zkrt
  • 49
  • 6
  • Why do you state, that it is no sorting issue? Why do you think sorting is not the right way (see my proposed answer below). – jottbe Nov 30 '20 at 16:25
  • The sorting can only be done by prioritizing the analysis, not by their names or orders. I wrote analysis_1 and analysis_2 just to give an example. – zkrt Nov 30 '20 at 18:33
  • Ok I see, but then I think you should provide a mapping from actual analysis value to its priority in case they are not numerically ordered. You need to specify what is better and how it is identified that it is better, otherwise it is not possible to provide a solution, that selects the best result. – jottbe Nov 30 '20 at 18:41

2 Answers2

1

you can use vectorized methods to do this , if analysis column follow the same pattern, you could split by _ and get the last integer hierarchy:

Method 1: (If there can be duplicate minimum hierarchy per group)

helper_s = df['analysis'].str.split("_").str[-1].astype(int)
out = df[helper_s.eq((df.assign(helper_s=helper_s)
                 .groupby("id")['helper_s'].transform("min")))]

          id    analysis    annotation
2  Protein_1  analysis_1  annotation_3
4  Protein_2  analysis_2  annotation_2

Method 2:

helper_s = df['analysis'].str.split("_").str[-1].astype(int)
out = df.loc[df.assign(helper_s=helper_s).groupby("id")['helper_s'].idxmin()]

          id    analysis    annotation
2  Protein_1  analysis_1  annotation_3
4  Protein_2  analysis_2  annotation_2
anky
  • 74,114
  • 11
  • 41
  • 70
  • Unfortunately, the analysis column is filled with several databases like PANTHER, Pfam, Gene3D etc. It's not possible to sort them based on their strings. – zkrt Nov 30 '20 at 18:30
1

You could temporarily sort the dataframe, then drop all but one entries for each id. It looks like this:

df.sort_values('analysis').drop_duplicates(['id'], keep='first')

Note, that this doesn't change the order in your original dataframe. The result looks like this:

          id    analysis    annotation
2  Protein_1  analysis_1  annotation_3
4  Protein_2  analysis_2  annotation_2

In case you have a function that returns the priority of an analysis, you can use it in combination with the method above:

def prio_function(analysis):
    # return a low number for a better result
    # and a high number for a worse result
    return int(analysis.split('_')[1])  # replace this row by your code

df_work= df.assign(_prio=df['analysis'].apply(prio_function))
df_work.sort_values('_prio').drop_duplicates(['id'], keep='first').drop(columns='_prio')

If priorization is simpler, you can also pass a dicitionary to apply instead of a function.

jottbe
  • 4,228
  • 1
  • 15
  • 31
  • 1
    Yeah, that makes total sense. I've changed the analysis names df["analysis"] = df["analysis"].str.replace('PANTHER', 'PANTHER_1') and then used your prio_function as a sorting method. Now it works perfectly fine. Thank you very much! – zkrt Dec 01 '20 at 10:50
  • Thank you for your feedback. I'm glad it worked. If you like you can mark it as answer. – jottbe Dec 01 '20 at 10:53
  • Now, maybe using an apply function to do the string change for a dozen different analyses would be great but I don't know how to do that. – zkrt Dec 01 '20 at 11:01
  • What do you mean by "string change for a dozen of different analyses"? I think you should create a function that returns exactly one score, which is an ordinal number that helps you to identify the most relevant analysis. So in case you can identify them by checking if they contain keywords, you just have to check for the presence of these keywords in the specific order and return the resulting score. – jottbe Dec 01 '20 at 11:06
  • 1
    I meant replacing the string for example "PANTHER" to "PANTHER_1" as I've done it with the str.replace() function but as you've mentioned I need to score them anyway and I think manually changing the analysis string is a lot easier for that problem. – zkrt Dec 01 '20 at 15:40