Identify most common combinations of values in 2 or more columns in a CSV file

Question

How to find the most common combinations of values in 2 or more columns for rows in a CSV file. example:

event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica

I have tried to create lists for some of the possible combinations I'm looking at, and then use the most_common() method in Collections.Counter to find the common patterns. but I need an algorithm to find common records for any possible combination of 2 or more columns.

My code so far:

import csv
from collections import Counter

class Alert:
    def __init__(self, event, rack, role, dc):
        self.event = event
        self.rack = rack
        self.role = role
        self.dc = dc

    def __str__(self):
        return(",".join([self.event, self.rack, self.role, self.dc]))


alerts = []
with open('data.csv', mode='r') as csv_file:
    csv_reader = csv.DictReader(csv_file)

    for row in csv_reader:
        alert = Alert(row['event'], row['rack'], row['role'], row['dc'])
        alerts.append(alert)
dcevent= []
dceventrole = []
dcrole = []
dcrolerack = []

for alert in alerts:
    dcevent.append(alert.dc + '-' + alert.event)
    dceventrole.append(alert.dc+'-'+alert.event+'-'+alert.role)
    dcrole.append(alert.dc+'-'+alert.role)
    dcrolerack.append(alert.dc+'-'+alert.role+'-'+alert.rack)


masterlist = Counter(dcevent).most_common() + Counter(dceventrole).most_common() + Counter(dcrole).most_common() + Counter(dcrolerack).most_common()

for item in sorted(masterlist, key=lambda x: x[1], reverse=True):
    print(item)

This is the output for the records mentioned above:

('northamerica-web-cpu', 3) # there are 3 rows matching the values northamerica,web and cpu
('northamerica-web', 3) # there are 3 rows matching just the values northamerica and web
('northamerica-cpu', 3) # there are 3 rows matching northamerica and cpu
('europe-oom', 2) # there are 2 rows matching europe and oom
('africa-mobile-network', 1)
('asia-mobile-network', 1)
('europe-desktop-oom', 1)
('europe-mobile-oom', 1)
('africa-mobile-north', 1)
('asia-mobile-east', 1)
('europe-desktop-south', 1)
('northamerica-web-east', 1)
('europe-mobile-north', 1)
('northamerica-web-south', 1)
('northamerica-web-west', 1)
('africa-mobile', 1)
('asia-mobile', 1)
('europe-desktop', 1)
('europe-mobile', 1)
('africa-network', 1)
('asia-network', 1)

You'll have to explain your output (maybe a smaller dataset and a hand-created version of what the correct output should be?). Also, why is `('northamerica-network', 4)` but `('northamerica-cpu', 3)`? — thebjorn, Mar 25 '19 at 16:24
To the close-voters: this question is no longer unclear (and it's definitely not too broad). — thebjorn, Mar 26 '19 at 22:16

score 1 · Accepted Answer · answered Mar 26 '19 at 22:15

Let me start by defining the data structures in-situ, since csv-reading is orthogonal to the real issue:

lines = [line.split(',') for line in """\
event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica
""".splitlines()]

for line in lines:
    print line

which prints:

['event', 'rack', 'role', 'dc']
['network', 'north', 'mobile', 'africa']
['network', 'east', 'mobile', 'asia']
['oom', 'south', 'desktop', 'europe']
['cpu', 'east', 'web', 'northamerica']
['oom', 'north', 'mobile', 'europe']
['cpu', 'south', 'web', 'northamerica']
['cpu', 'west', 'web', 'northamerica']

now, lets create all possible combinations of 2 or more words from each line. There are 11 ways to choose 2, 3, or 4 from 4 (4C2 + 4C3 + 4C4 == 6 + 4 + 1 == 11).

The algorithm I'm using to find the combinations looks at the binary numbers with 4 digits (i.e. 0000, 0001, 0010, 0011, 0100, etc.) and for each such number creates the combination of words depending on if the respective binary digit is 1. E.g. for 0101 the second and fourth word are chosen:

def find_combinations(line):
    combinations = []
    for i in range(2**len(line)):
        bits = bin(i)[2:].zfill(len(line))
        if bits.count('1') < 2:  # skip numbers with less than two 1-bits
            continue
        combination = set()
        for bit, word in zip(bits, line):
            if bit == '1':
                combination.add(word)
        combinations.append('-'.join(sorted(combination)))
    return combinations

now we can loop through all combinations and count their frequency:

from collections import defaultdict
counter = defaultdict(int)
for line in lines:
    for c in find_combinations(line):
        counter[c] += 1

and finally we can sort (descending) on frequency

for combination_freq in sorted(counter.items(), key=lambda item: item[1], reverse=True):
    print combination_freq

to get:

('cpu-northamerica', 3)
('northamerica-web', 3)
('cpu-northamerica-web', 3)
('cpu-web', 3)
('mobile-north', 2)
('mobile-network', 2)
('europe-oom', 2)
('east-network', 1)
('asia-east-mobile', 1)
('asia-east-network', 1)
('cpu-south-web', 1)
('east-northamerica-web', 1)
('europe-north', 1)
('cpu-east', 1)
...etc.

Thank you @thebjorn that was actually a pretty simple way of doing it. — Sanjan, Mar 27 '19 at 00:27

Identify most common combinations of values in 2 or more columns in a CSV file

1 Answers1