Count Matches between two lists in python and return matches

Question

I am trying to add a count of all the matches between dataframes a & b

df2['Count'] = len(set(a) & set(b))
df2.head(5)

But it only returns "0"

Data for a:

Result	Column1	Column2	Column3	D-level	R-level
numpy	de	LA	11060303	8	NaN
FRA	Paris	YouTube	56764332	1	4.0

Here is the data for b:

numpy
File Edit View —Insert_ © Cell~—«KKemmel_«- Widgets. Help tT | Python 3 (ipykernel) @
@ B & % mR MC PM Code » 2

YouTube

import numpy

Desired output should be a total of matches between a and b appended to the dataframe:

Result	Column1	Column2	Column3	D-level	R-level	No of matches?
numpy	de	LA	11060303	8	NaN	(1 unique match)
FRA	Paris	YouTube	56764332	1	4.0	(1 unique match)

Best,

Can you provide a sample of `a` and `b`? Do a and b have the same number of columns? — Corralien, Mar 26 '22 at 13:40
@Corralien Hi, a is a .csv file similar to the table above and b is random data. b does not have collumes — William_b, Mar 28 '22 at 12:58
Your code is not reproducible. Try to update your post with data for `a` and `b` as plain text (not image) please — Corralien, Mar 28 '22 at 13:00
@Phoenix Hi, any matches found in b (Which is a unordered list) and in a (which is a .csv file) So if b contains "YouTube" this should count as 1 unique match. — William_b, Mar 28 '22 at 13:13
If there is 'Python 3' in dataframe 1, do you consider this as match? — Hamzah, Mar 28 '22 at 14:06
@Phoenix If it was found in a, yes. At the moment no, because it is not in the a. — William_b, Mar 28 '22 at 17:18
However, Python 3 is found in the single line but you meant if the Python 3 should be a single word as YouTube and numpy to say there is a match, right? — Hamzah, Mar 28 '22 at 17:26
A last question: Is `NaN` a string to match or we have to ignore it? — Corralien, Mar 29 '22 at 06:00
@Corralien In principal no, but ideally we want to count all matches across all rows. — William_b, Mar 29 '22 at 06:32

score 0 · Answer 1 · answered Mar 26 '22 at 13:54

lets consider the dataframe

df = pd.DataFrame([['a','c'],['a','b']])

Running set(df) results in {0,1} which is not the set of entries you want. What you need to do is get a flattened list of entries (see How to make a flat list out of a list of lists?)

def flatten_df_values(df):
    return [item for sublist in df.values for item in sublist]

then if you have a second dataframe

df2 = pd.DataFrame([['f','c'],['a','b']])

you can perform your operation and get

set(flatten_df_values(df)) & set(flatten_df_values(df2)) = {'a', 'b', 'c'}

if you want to get the repeated rows you can simply use merge with its default how='inner'

df.merge(df2,on=list(df.columns))

This will result in a Dataframe containing the duplicated rows. In our example case

   0  1
0  a  b

Note that you can modify the on parameter to include only the columns you want.

1 Answers1