Getting pairwise comparisons from a pandas dataframe from a list

Question

I have a dataframe with pairwise comparisons (comparing "query"and "reference"), a distance column and some columns with metadata. I made a selection and created a new dataframe from the main dataframe.

I want all pairwise comparisons of the 19 elements that are in the "query" column of the new dataframe. Then I want to get those elements where one element is present in the "query" column and another that is present in the "reference" column of the main dataframe.

So, what I tried is (in Python 3.10.6, Pandas 1.5.0):

def metadata_def_epi(df_metadata, df_selection_ur):

    df_selection_ur_list = df_selection_ur["query"].values.tolist()
    
    mask = df_metadata["query"].isin(df_selection_ur_list) & df_metadata["reference"].isin(df_selection_ur_list)
    
    df_out = df_metadata[mask]

return df_out

It works as intended, BUT I only get 136 pairwise comparisons while I'm expecting 171 (pairwise comparisons = n(n-1)/2).

Am I missing something? Any help would be greatly appreciated!

EDIT

@itprorh66, here is an example of my data: Main dataframe:

FIELD1	query	reference	Core	Accessory	pmatch	patient_id_query	Collection_date_query	instelling_query	patient_id_reference	Collection_date_reference	instelling_reference	Cdate_query	Cdate_reference	epi_link_id	snp
0	ASM1421713v1_genomic	ASM1527755v1_genomic	0,000312388	0,052548587	0,938905646	pt4	17-aug-2020	AP	pt5	7-nov-2020	HK	17-aug-2020	7-nov-2020	Unrelated	70
1	ASM1421713v1_genomic	ASM1930313v1_genomic	0,002514303	0,15178698	0,788494326	pt4	17-aug-2020	AP	pt6	21-jul-2021	LA	17-aug-2020	21-jul-2021	Unrelated	3608
2	ASM1421713v1_genomic	ASM194037v1_genomic	0,000102997	0,080629885	0,916627997	pt4	17-aug-2020	AP	pt2	9-jan-2017	CP	17-aug-2020	9-jan-2017	Unrelated	103
3	ASM1421713v1_genomic	ASM2008964v1_genomic	0,004278481	0,124020875	0,773557592	pt4	17-aug-2020	AP	pt7	21-sep-2021	NO	17-aug-2020	21-sep-2021	Unrelated	7313
4	ASM1421713v1_genomic	ASM2491733v1_genomic	0,000498295	0,08597058	0,900912922	pt4	17-aug-2020	AP	pt8	4-sep-2022	TX	17-aug-2020	4-sep-2022	Unrelated	145
5	ASM1421713v1_genomic	ASM2539267v1_genomic	0,000156939	0,10040438	0,895510335	pt4	17-aug-2020	AP	pt9	23-sep-2022	RA	17-aug-2020	23-sep-2022	Unrelated	83
6	ASM1421713v1_genomic	ASM972006v1_genomic	0,000523388	0,08677614	0,899463797	pt4	17-aug-2020	AP	pt3	27-nov-2017	US	17-aug-2020	27-nov-2017	Unrelated	96

Input of a list of query IDs = [ASM1527755v1_genomic, ASM1930313v1_genomic, ASM194037v1_genomic, ASM2008964v1_genomic]

Expected output: Similar table as the main dataframe, but than only the selection where one of the elements of my input list is in "query" and another one in the "reference" column.

It is very difficult to answer your question without seeing any of your data nor any of the code that you have written that produces your problem. Please edit your question to show a minimal reproducible set consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [Minimal Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example "Minimal Reproducible Example") for details on how to best help us help you. — itprorh66, Jan 05 '23 at 14:12
@itprorh66 I added a few example rows to the question, I hope this clarifies my question? — Freek de Kreek, Jan 05 '23 at 15:21
So do I understand that you desire an output which selects from the input df if the selection criterion is in the query column or the reference column? — itprorh66, Jan 05 '23 at 18:05
Yes basically I want the pairwise comparisons with an ID that is in my selection (a list of ID's). — Freek de Kreek, Jan 06 '23 at 08:05

score 0 · Answer 1 · answered Jan 06 '23 at 15:32

0

I think your problem is related to the & operator used in your mask statement. This will result in selecting rows which contain one or more of the selection criteria in both the query and reference columns.

to fix this issue rewrite your function as shown below:

def metadata_def_epi(df_metadata, df_selection_ur):

    df_selection_ur_list = df_selection_ur["query"].values.tolist()
    # Note change in the mask statement from & to | operator
    mask = df_metadata["query"].isin(df_selection_ur_list) | df_metadata["reference"].isin(df_selection_ur_list)
    
    df_out = df_metadata[mask]

return df_out

answered Jan 06 '23 at 15:32

itprorh66

3,110
4
9
21

Sorry, maybe I phrased my question wrong. But I do not want one of my ID's to be in query or in reference, but for example in my list I have: [ID1, ID2, ID3]. Then I want ID1 in query and ID2 in reference, ID1 in query and ID3 in reference, ID2 in query and ID3 in reference – Freek de Kreek Jan 09 '23 at 14:32
1

So given a list of selection options ['ID1', 'ID2', 'ID3'] you want the combinations ('ID1', 'ID2'), ('ID1', 'ID3'), ('ID2', 'ID3') where the first tuple entry refers to the Query column and the second refers to the Reference column, but you don't want the inverse where the first refers to the Query column and the second refers to the Reference column? – itprorh66 Jan 09 '23 at 16:02
It doesn't really matter if the tuples are inversed, as long as I get all the possible combinations of the list of selection options ['ID1', 'ID2', 'ID3'] – Freek de Kreek Jan 10 '23 at 07:44
Oay, I am officially confused. Can you provide an example that illustrates what your end frame looks like? Hopefully that will clarify – itprorh66 Jan 10 '23 at 14:15

Getting pairwise comparisons from a pandas dataframe from a list

EDIT

1 Answers1