3

I'm trying to figure out how to count a given combination of 2 strings regardless of which string is first / second.

Here is my code:

import pandas as pd

mylist = [[('Smith JR', 'Kim YY'), ('Smith JR', 'Ron AA'), ('Kim YY', 'Ron AA')],
          [('Kim YY', 'Smith JR')], [('Smith JR', 'Ron AA')]]

flat_list = [item for sublist in mylist for item in sublist]

df = pd.DataFrame(flat_list, columns=["From", "To"])
df_graph = df.groupby(["From", "To"]).size().reset_index()
df_graph.columns = ["From", "To", "Count"]

print(df_graph)

which gives:

       From        To  Count
0    Kim YY    Ron AA      1
1    Kim YY    Smith JR    1
2  Smith JR    Kim YY      1
3  Smith JR    Ron AA      2

but since Kim YY Smith JR and Smith JR Kim YY form a connection between the same two people I want it to give:

       From        To  Count
0    Kim YY    Ron AA      1
1    Kim YY    Smith JR    2
2  Smith JR    Ron AA      2

I have seen a number of solutions which remove the duplicated row but don't combine the Counts from each row as I desire. I can't seem to figure out how to combine the

1    Kim YY    Smith JR    1
2  Smith JR    Kim YY      1

rows such that only the Kim YY - Smith JR row remains and the Count is 2. Also, in my actual data the count can be greater than 1 for a given row.

ALollz
  • 57,915
  • 7
  • 66
  • 89
bjg
  • 115
  • 6

2 Answers2

5

Sort the two columns together before adding to the dataframe so that you are guaranteed that a pair will only appear in a certain order. Only then apply your counting method. Using the method from link for sorting:

import pandas as pd
import networkx as nx

mylist = [[('Smith JR','Kim YY'),('Smith JR','Ron AA'),('Kim YY','Ron AA')],[('Kim YY','Smith JR')],[('Smith JR','Ron AA')]]

flat_list = [item for sublist in mylist for item in sublist]

df = pd.DataFrame(flat_list, columns=["From", "To"])
#create a new dataframe with the value pairs sorted. You can also sort earlier if you prefer.
df = pd.DataFrame(np.sort(df[["From", "To"]]), columns = ["From", "To"])
#now, just apply the groupby.
df_graph = df.groupby(["From", "To"], axis=0).size().reset_index()
#Output:
     From        To  0
0  Kim YY    Ron AA  1
1  Kim YY  Smith JR  2
2  Ron AA  Smith JR  2
Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
3

Quick 'n Dirty

But not that dirty

pd.value_counts([*map(frozenset, zip(df.From, df.To))])

(Smith JR, Ron AA)    2
(Kim YY, Smith JR)    2
(Kim YY, Ron AA)      1
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624