0

I have a dataframe with pairwise comparisons (comparing "query"and "reference"), a distance column and some columns with metadata. I made a selection and created a new dataframe from the main dataframe.

I want all pairwise comparisons of the 19 elements that are in the "query" column of the new dataframe. Then I want to get those elements where one element is present in the "query" column and another that is present in the "reference" column of the main dataframe.

So, what I tried is (in Python 3.10.6, Pandas 1.5.0):

def metadata_def_epi(df_metadata, df_selection_ur):

    df_selection_ur_list = df_selection_ur["query"].values.tolist()
    
    mask = df_metadata["query"].isin(df_selection_ur_list) & df_metadata["reference"].isin(df_selection_ur_list)
    
    df_out = df_metadata[mask]

return df_out

It works as intended, BUT I only get 136 pairwise comparisons while I'm expecting 171 (pairwise comparisons = n(n-1)/2).

Am I missing something? Any help would be greatly appreciated!

EDIT

@itprorh66, here is an example of my data: Main dataframe:

FIELD1 query reference Core Accessory pmatch patient_id_query Collection_date_query instelling_query locatie_query afdeling_query patient_id_reference Collection_date_reference instelling_reference locatie_reference afdeling_reference Cdate_query Cdate_reference epi_link_id snp
0 ASM1421713v1_genomic ASM1527755v1_genomic 0,000312388 0,052548587 0,938905646 pt4 17-aug-2020 AP 00 00 pt5 7-nov-2020 HK 00 00 17-aug-2020 7-nov-2020 Unrelated 70
1 ASM1421713v1_genomic ASM1930313v1_genomic 0,002514303 0,15178698 0,788494326 pt4 17-aug-2020 AP 00 00 pt6 21-jul-2021 LA 00 00 17-aug-2020 21-jul-2021 Unrelated 3608
2 ASM1421713v1_genomic ASM194037v1_genomic 0,000102997 0,080629885 0,916627997 pt4 17-aug-2020 AP 00 00 pt2 9-jan-2017 CP 00 00 17-aug-2020 9-jan-2017 Unrelated 103
3 ASM1421713v1_genomic ASM2008964v1_genomic 0,004278481 0,124020875 0,773557592 pt4 17-aug-2020 AP 00 00 pt7 21-sep-2021 NO 00 00 17-aug-2020 21-sep-2021 Unrelated 7313
4 ASM1421713v1_genomic ASM2491733v1_genomic 0,000498295 0,08597058 0,900912922 pt4 17-aug-2020 AP 00 00 pt8 4-sep-2022 TX 00 00 17-aug-2020 4-sep-2022 Unrelated 145
5 ASM1421713v1_genomic ASM2539267v1_genomic 0,000156939 0,10040438 0,895510335 pt4 17-aug-2020 AP 00 00 pt9 23-sep-2022 RA 00 00 17-aug-2020 23-sep-2022 Unrelated 83
6 ASM1421713v1_genomic ASM972006v1_genomic 0,000523388 0,08677614 0,899463797 pt4 17-aug-2020 AP 00 00 pt3 27-nov-2017 US 00 00 17-aug-2020 27-nov-2017 Unrelated 96

Input of a list of query IDs = [ASM1527755v1_genomic, ASM1930313v1_genomic, ASM194037v1_genomic, ASM2008964v1_genomic]

Expected output: Similar table as the main dataframe, but than only the selection where one of the elements of my input list is in "query" and another one in the "reference" column.

  • It is very difficult to answer your question without seeing any of your data nor any of the code that you have written that produces your problem. Please edit your question to show a minimal reproducible set consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") for details on how to best help us help you. – itprorh66 Jan 05 '23 at 14:12
  • @itprorh66 I added a few example rows to the question, I hope this clarifies my question? – Freek de Kreek Jan 05 '23 at 15:21
  • So do I understand that you desire an output which selects from the input df if the selection criterion is in the query column or the reference column? – itprorh66 Jan 05 '23 at 18:05
  • Yes basically I want the pairwise comparisons with an ID that is in my selection (a list of ID's). – Freek de Kreek Jan 06 '23 at 08:05

1 Answers1

0

I think your problem is related to the & operator used in your mask statement. This will result in selecting rows which contain one or more of the selection criteria in both the query and reference columns.

to fix this issue rewrite your function as shown below:

def metadata_def_epi(df_metadata, df_selection_ur):

    df_selection_ur_list = df_selection_ur["query"].values.tolist()
    # Note change in the mask statement from & to | operator
    mask = df_metadata["query"].isin(df_selection_ur_list) | df_metadata["reference"].isin(df_selection_ur_list)
    
    df_out = df_metadata[mask]

return df_out
itprorh66
  • 3,110
  • 4
  • 9
  • 21
  • Sorry, maybe I phrased my question wrong. But I do not want one of my ID's to be in query or in reference, but for example in my list I have: [ID1, ID2, ID3]. Then I want ID1 in query and ID2 in reference, ID1 in query and ID3 in reference, ID2 in query and ID3 in reference – Freek de Kreek Jan 09 '23 at 14:32
  • 1
    So given a list of selection options ['ID1', 'ID2', 'ID3'] you want the combinations ('ID1', 'ID2'), ('ID1', 'ID3'), ('ID2', 'ID3') where the first tuple entry refers to the Query column and the second refers to the Reference column, but you don't want the inverse where the first refers to the Query column and the second refers to the Reference column? – itprorh66 Jan 09 '23 at 16:02
  • It doesn't really matter if the tuples are inversed, as long as I get all the possible combinations of the list of selection options ['ID1', 'ID2', 'ID3'] – Freek de Kreek Jan 10 '23 at 07:44
  • Oay, I am officially confused. Can you provide an example that illustrates what your end frame looks like? Hopefully that will clarify – itprorh66 Jan 10 '23 at 14:15