I need to write a function to filter a dataset based on some hierarchical conditions. The purpose of this function is to get one annotation for each protein among a bunch of them.
The function needs to do the following,
- Group the proteins by "id" column
- Check each analysis in the protein group
- Pick the analysis in a prioritized way (It's not a sorting issue so the analysis should be given in a conditional order so that it could be re-ordered later on). First, check analysis_1 if it doesn't exist then take analysis_2 or analysis_3, etc.
- Get the row into a new data frame.
Here is an example of data that will be filtered,
df=pd.DataFrame({
'id': ['Protein_1', 'Protein_1', 'Protein_1',
'Protein_2','Protein_2','Protein_2'],
'analysis': ['analysis_6', 'analysis_4', 'analysis_1',
'analysis_3','analysis_2','analysis_5'],
'annotation':['annotation_1', 'annotation_2', 'annotation_3',
'annotation_1','annotation_2','annotation_3'] })
and this is the output I'd like to see,
df_filtered= pd.DataFrame({
'id': ['Protein_1','Protein_2'],
'analysis': ['analysis_1', 'analysis_2'],
'annotation':['annotation_3', 'annotation_2'] })
The code in the following is working but I'd like to do it by using pandas groupby, apply, and iterrows functions.
new_df =pd.DataFrame(columns=df.columns)
protein_id=list(df.id.unique())
for protein in protein_id:
data=df[df["id"] == protein]
if len(data[data["analysis"] =="analysis_1"]) == 0:
if len(data[data["analysis"] =="analysis_2"]) == 0:
if len(data[data["analysis"] =="analysis_3"]) == 0:
pass
else:
data2=data[data["analysis"] =="analysis_3"]
new_df = pd.concat([new_df,data2])
else:
data2=data[data["analysis"] =="analysis_2"]
new_df = pd.concat([new_df,data2])
else:
data2=data[data["analysis"] =="analysis_1"]
new_df = pd.concat([new_df,data2])
new_df
Appreciate any help!!