I have a dataframe with pairwise comparisons (comparing "query"and "reference"), a distance column and some columns with metadata. I made a selection and created a new dataframe from the main dataframe.
I want all pairwise comparisons of the 19 elements that are in the "query" column of the new dataframe. Then I want to get those elements where one element is present in the "query" column and another that is present in the "reference" column of the main dataframe.
So, what I tried is (in Python 3.10.6, Pandas 1.5.0):
def metadata_def_epi(df_metadata, df_selection_ur):
df_selection_ur_list = df_selection_ur["query"].values.tolist()
mask = df_metadata["query"].isin(df_selection_ur_list) & df_metadata["reference"].isin(df_selection_ur_list)
df_out = df_metadata[mask]
return df_out
It works as intended, BUT I only get 136 pairwise comparisons while I'm expecting 171 (pairwise comparisons = n(n-1)/2).
Am I missing something? Any help would be greatly appreciated!
EDIT
@itprorh66, here is an example of my data: Main dataframe:
FIELD1 | query | reference | Core | Accessory | pmatch | patient_id_query | Collection_date_query | instelling_query | locatie_query | afdeling_query | patient_id_reference | Collection_date_reference | instelling_reference | locatie_reference | afdeling_reference | Cdate_query | Cdate_reference | epi_link_id | snp |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ASM1421713v1_genomic | ASM1527755v1_genomic | 0,000312388 | 0,052548587 | 0,938905646 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt5 | 7-nov-2020 | HK | 00 | 00 | 17-aug-2020 | 7-nov-2020 | Unrelated | 70 |
1 | ASM1421713v1_genomic | ASM1930313v1_genomic | 0,002514303 | 0,15178698 | 0,788494326 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt6 | 21-jul-2021 | LA | 00 | 00 | 17-aug-2020 | 21-jul-2021 | Unrelated | 3608 |
2 | ASM1421713v1_genomic | ASM194037v1_genomic | 0,000102997 | 0,080629885 | 0,916627997 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt2 | 9-jan-2017 | CP | 00 | 00 | 17-aug-2020 | 9-jan-2017 | Unrelated | 103 |
3 | ASM1421713v1_genomic | ASM2008964v1_genomic | 0,004278481 | 0,124020875 | 0,773557592 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt7 | 21-sep-2021 | NO | 00 | 00 | 17-aug-2020 | 21-sep-2021 | Unrelated | 7313 |
4 | ASM1421713v1_genomic | ASM2491733v1_genomic | 0,000498295 | 0,08597058 | 0,900912922 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt8 | 4-sep-2022 | TX | 00 | 00 | 17-aug-2020 | 4-sep-2022 | Unrelated | 145 |
5 | ASM1421713v1_genomic | ASM2539267v1_genomic | 0,000156939 | 0,10040438 | 0,895510335 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt9 | 23-sep-2022 | RA | 00 | 00 | 17-aug-2020 | 23-sep-2022 | Unrelated | 83 |
6 | ASM1421713v1_genomic | ASM972006v1_genomic | 0,000523388 | 0,08677614 | 0,899463797 | pt4 | 17-aug-2020 | AP | 00 | 00 | pt3 | 27-nov-2017 | US | 00 | 00 | 17-aug-2020 | 27-nov-2017 | Unrelated | 96 |
Input of a list of query IDs = [ASM1527755v1_genomic, ASM1930313v1_genomic, ASM194037v1_genomic, ASM2008964v1_genomic]
Expected output: Similar table as the main dataframe, but than only the selection where one of the elements of my input list is in "query" and another one in the "reference" column.