Filter DataFrame for most matches

Question

I have a list (list_to_match = ['a','b','c','d']) and a dataframe like this one below:

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

My goal would be to filter for the rows with most matches with the list in the corrisponding position (e.g. position 1 in the list has to match column 1, position 2 column 2 etc...). In this specific case, excluding row 100, row 5 and 6 would be the one selected since they match 'a', 'b' and 'c' but if row 100 were to be included row 100 and all the other rows matching all elements would be the selected. Also the list might change in length e.g. list_to_match = ['a','b'].

Thanks for your help!

so do you need filter rows that has a number of matches equals to the max number of matches among all rows? — ansev, Jan 10 '23 at 17:46
I was thinking maybe a loop where it first filter the dataframe for the first column if it's equal to 'a' then if the length of the resulting sliced dataframe is different then '0' it goes to the next one and tries to filter (the already filtered by "a") dataframe by 'b' and if the len of this new filtered database is above '0' it goes next etc... and the loop breaks whenever the len of the filtered dataframe is zero by picking the previous level as the good one — Dario Bani, Jan 10 '23 at 18:13
I think in that case you only need a solution with cummin or cumprod like mozway solution or my updated solution @DarioBani — ansev, Jan 10 '23 at 18:20

mozway · Accepted Answer · 2023-01-10T18:51:30.723

2

I would use:

list_to_match = ['a','b','c','d']

# compute a mask of identical values
mask = df.iloc[:, :len(list_to_match)].eq(list_to_match)
# ensure we match values in order
mask2 = mask.cummin(axis=1).sum(axis=1)

# get the rows with max matches
out = df[mask2.eq(mask2.max())]
# or
# out = df.loc[mask2.nlargest(1, keep='all').index]

print(out)

Output (ignoring the input row 100):

      One Two Three  Four
Index                    
5       a   b     c     g
6       a   b     c  None

edited Jan 10 '23 at 18:51

answered Jan 10 '23 at 17:56

mozway

194,879
13
39
75

I see this as a possible solution, the problem is that the list is variable in length e.g. ['a','b'], in that case I believe this one breaks. – Dario Bani Jan 10 '23 at 18:18
I've also changed the main post, apologies for the amendment – Dario Bani Jan 10 '23 at 18:19
@DarioBani easy to handle, see update – mozway Jan 10 '23 at 18:26
I believe, we are almost there, the mask2 leaves creates a mask where the desirable rows in the output are the one with the highest value, but then "out" doesn't show the df filtered for those rows but for others... really odd – Dario Bani Jan 10 '23 at 18:37
@Dario it was my understanding that you wanted to exclude row100 as it contained all matches, if this is incorrect simplify my solution with: `out = df[mask2.eq(mask2.max())]` or `out = df.loc[mask2.nlargest(1, keep='all').index]` – mozway Jan 10 '23 at 18:41
I'm definetly abusing your patience here, but is there a way that I can slice the df the way you did, add e.g. +100 to all the values of one column (e.g. "Six) of the sliced dataframe and having this impacting the original df? @mozway – Dario Bani Jan 10 '23 at 18:53
@Dario not sure to get what you mean, do you want to perform boolean indexing? Maybe worth opening a follow-up question? – mozway Jan 10 '23 at 19:14

score 0 · Answer 2 · answered Jan 10 '23 at 18:41

Here is my approach. Descriptions are commented below.

import pandas as pd
import numpy as np
from scipy.spatial.distance import cosine



data = {'One':  ['a', 'a', 'a', 'a'], 
        'Two':  ['b', 'b', 'b', 'b'],
        'Three':  ['c', 'c', 'y', 'c'], 
        'Four': ['g', 'g', 'z', 'd']}

dataframe_ = pd.DataFrame(data)


#encoding Letters into numerical values so we can compute the cosine similarities
dataframe_[:] = dataframe_.to_numpy().astype('<U1').view(np.uint32)-64

#Our input data which we are going to compare with other rows
input_data = np.array(['a', 'b', 'c', 'd'])

#encode input data into numerical values
input_data = input_data.astype('<U1').view(np.uint32)-64

#compute cosine similarity for each row
dataframe_out = dataframe_.apply(lambda row: 1 - cosine(row, input_data), axis=1)
print(dataframe_out)

output:

0    0.999343
1    0.999343
2    0.973916
3    1.000000

Filtering rows based on their cosine similarities:

df_filtered = dataframe_out[dataframe_out.iloc[:, [0]] > 0.99]
print(df_filtered)
0  0.999343
1  0.999343
2       NaN
3  1.000000

From here on you can easily find the rows with non-NaN values by their indexes.

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d

Filter DataFrame for most matches

2 Answers2

Index	One	Two	Three	Four
1	a	b	d	c
2	b	b	d	d
3	a	b	d
4	c	b	c	d
5	a	b	c	g
6	a	b	c
7	a	s	c	f
8	a	f	c
9	a	b
10	a	b	t	d
11	a	b	g
...	...	...	...	...
100	a	b	c	d