So I created a python script where from a given data of query logs, I transform it into a list
of nested dictionary
and write it to a new text file.
This is a sample output of my script:
[{'ip_address': '10.10.80.209', 'domain_names': {'google.com': 2}},
{'ip_address': '10.10.25.188', 'domain_names': {'fbcdn-profile-a.akamaihd.net': 1}},
{'ip_address': '10.10.50.195', 'domain_names': {'googleads.g.doubleclick.net': 2, '0-edge-chat.facebook.com': 2, 'gg.google.com': 2, 'content.googleapis.com': 1, 'accounts.google.com': 1}}]
As you can see, I have a list
of user transactions, which contains two entries: the key-value pair ip_address
and the dictionary
of domain_names
, which in turn contains a dictionary
of domain names and their visit count (e.g. 'google.com': 2
).
Somehow, I need to transform this file into a co-occurrence matrix, as from what you can see in this image: where t
is the user transactions, d
is the domain names and the value is the visit count
(as you can see, visit count = 0 if the user didn't visit that certain domain name).
The data I created is close to this concept already, the problem is I have to transform it into a matrix
(consequently, for each non-existing visited domain name in a user transaction, the value must be 0
, but my list
of nested dictionary
only provides "visited" values) and save it as a .mat
file type.
It needs to be a .mat
file because the script for clustering this data requires a .mat
file type. From what I've known, .mat is a file type for MATLAB, and I have no prior knowledge regarding that language.
So how do I do this?