0

I want to compute correlation percentages between multiple items that appear in log files. In doing so, I get the number of times they appear divided by the number of times they appear while another item was present. I won't go too much in the details but this correlation is not symmetrical (The correlation between A and B is not the same as between B and A)

As an output I have a dictionary that has a format like this one :

{
    itemA:  {
        itemB: 0.85,
        itemC: 0.12
    },
    itemB:  {
        itemC: 0.68,
        itemA: 0.24
    },
    itemC:  {
        itemA: 0.28
    }
}

I have tried working with DictVectorizer from sklearn but it doesn't work since it requires a list of dictionaries.

I would like the output to be a matrix for visualisation with matplotlib

something like this :

[[1,0.85,0.12]
[0.68,1,0.24]
[0.28,0,1]]

If possible, I would also like to have a matplotlib visualisation with a legend for each line and column, since my dict has way more than 3 items.

I hope that everything is clear. Thank you for your help.

Dan
  • 45,079
  • 17
  • 88
  • 157
  • Do you want a list of dictionaries like your text says, or a list of lists like in your example output? – Dan Jul 19 '19 at 09:02

2 Answers2

1

You can do this efficiently with pandas and numpy:

import pandas as pd

d = {
    'itemA':  {
        'itemB': 0.85,
        'itemC': 0.12
    },
    'itemB':  {
        'itemA': 0.68,
        'itemC': 0.24
    },
    'itemC':  {
        'itemA': 0.28
    }
}

df = pd.DataFrame(d)

# since this is a matrix of co-occurrences of a set of objects,
# sort columns and rows alphabetically
df = df.sort_index(axis=0)
df = df.sort_index(axis=1)

# the matrix is now the values of the dataframe
a = df.values.T

# if needed, fill the diagonal with 1 and replace NaN with 0
import numpy as np

np.fill_diagonal(a, 1)
a[np.isnan(a)] = 0

The matrix now is:

array([[1.  , 0.85, 0.12],
       [0.68, 1.  , 0.24],
       [0.28, 0.  , 1.  ]])

To visualize this matrix:

import matplotlib.pyplot as plt
plt.matshow(a)
plt.show()

The row and column ids will be shown as labels.

vpekar
  • 3,275
  • 1
  • 19
  • 16
  • Hi. Thanks a lot. Your solution is great yet it requires the column of the dataframe to be sorted (here it only works because items are alphabetically sorted in the example). Might want to add `df = df.sort_index(axis=1)` – Benoît Carlier Jul 19 '19 at 09:30
  • @BenoîtCarlier Yes, if you need to sort the data by rows or columns, you can use `df.sort_index()`. – vpekar Jul 19 '19 at 09:59
  • You should edit your answer adding that imo. If you don't, then if `'itemA'` and `'itemB'` are inverted you get the wrong diagonal and you fill in the percentages. – Benoît Carlier Jul 19 '19 at 11:57
  • @BenoîtCarlier I edited the answer to say both columns and rows need to be sorted alphabetically. – vpekar Jul 19 '19 at 16:14
0

Here is a code that work with an array, but you can easily adapt it to the sequence you want to use.

dictionary = {
    'itemA':  {
        'itemB': 0.85,
        'itemC': 0.12
    },
    'itemB':  {
        'itemA': 0.68,
        'itemC': 0.24
    },
    'itemC':  {
        'itemA': 0.28
    }
}

matrix = []
i = 0
for v in dictionary.values():
    tmp_mat = []
    for h in v.values():
        if len(tmp_mat) == i:
            tmp_mat.append(1)
        tmp_mat.append(h)
    i += 1
    if len(tmp_mat) == len(v):
        tmp_mat.append(1)
    matrix.append(tmp_mat)

print(matrix)

Output:

[[1, 0.85, 0.12], [0.68, 1, 0.24], [0.28, 1]]

unpacking keys and values of a dictionary

Dorian Turba
  • 3,260
  • 3
  • 23
  • 67