4

I have a single column Pandas dataframe:

s = 
      VALUE
INDEX
A     12
B     21
C     7
...
Y     21
Z     7

I want to make it into a square matrix mask with the same index and columns as s.index, with each element either True if the value of column and index are the same in s or False otherwise.

mask = 
      A     B     C ...      Y     Z 
A  True False False ...  False False
B False  True False ...   True False
C False False  True ...  False  True
...
Y False  True False ...   True False
Z False False  True ...  False  True

My actual s has 10K+ rows. What is the fastest way to generate that mask DataFrame?

One way I've tried is to create a 2-level dictionary with two for loops. (eg. dict['A']['B'] = dict['B']['A'] = True if s.loc['A'] == s.loc['B'] else False, etc.). Then convert bottom level of dict to a Pandas Series (eg. row = pd.Series(dict[A])), then append that series to mask. mask is constructed iteratively row by row.

That takes really long time, and has to explicitly loop through 10K x 10K / 2 = 50M elements... Not ideal?

Zhang18
  • 4,800
  • 10
  • 50
  • 67

1 Answers1

3

Use numpy broadcasting

v = s.VALUE.values
pd.DataFrame(v == v[:, None], s.index, s.index)

INDEX      A      B      C      Y      Z
INDEX                                   
A       True  False  False  False  False
B      False   True  False   True  False
C      False  False   True  False   True
Y      False   True  False   True  False
Z      False  False   True  False   True
piRSquared
  • 285,575
  • 57
  • 475
  • 624