Python: many-to-many comparison to find required set of data

Question

This is my first question so please forgive any mistakes.

I have a large file(csv) with several(~10000000+) lines of information like the following example:

date;box_id;box_length;box_width;box_height;weight;type
--snip--
1999-01-01 00:00:20;nx1124;10;4;5.5;2.1;oversea
1999-01-01 00:00:20;np11r4;8;3.25;2;4.666;local
--snip--

My objective is to read through each line and calculate the box's volume and within 1 hour window(for example, 00:00:00 - 00:00:59) I have to record if 2 or more boxes are of similar volume (+-10% difference) and then record their timestamp as well as type.

At the moment, I am using a brute-force approach:

run through each line
calculate volume
go to next line and compute volume
compare
repeat till 1 hr time-difference is detected
remove the first box from list
add another box to the list
repeat the process with second box

For example, if my 1 hour window has 1,2,3,4, I'm doing this

1
2 == 1
3 == 1 then == 2
4 == 1 then == 2 then == 3
5 == 2 then == 3 then == 4 # removed 1 from list(1hr window moved down)
6 == 2 then == 3 then == 4 then == 5
7 == 2 then == 3 then == 4 then == 5 then == 6
.... so on ....

This is the best I can think of since I have to compare each and every box with others within a given time-window. But this is very very slow at the moment.

I am looking for a better algorithm but I am unsure as to which direction I must go. I am trying to learn some excellent tools(so far Pandas is my favorite) but I am under the assumption that I need to implement some algorithm first to allow these tools to deal with the data in the way I need to.

If it helps I will post my python code(source).

Update Following are my code. I have omitted several lines(such as try/catch block for invalid file path/format, type conversion error handling etc). I have customized the code a bit to work for 5second window.

Following is the Box class

from datetime import datetime
from time import mktime

class Box(object):
    """ Box model """

    def __init__(self,data_set):
        self.date = data_set[0]
        self.timestamp = self.__get_time()
        self.id = data_set[1]
        self.length = float(data_set[2])
        self.width = float(data_set[3])
        self.height = float(data_set[4])
        self.weight = int(data_set[5])
        self.volume = self.__get_volume()

    def __get_time(self):
        """ convert from date string to unix-timestamp """
        str_format = '%Y-%m-%d %H:%M:%S'
        t_tuple = datetime.strptime(self.date, str_format).timetuple()
        return mktime(t_tuple)

    def __get_volume(self):
        """ calculate volume of the box """
        return (self.length * self.width * self.height)

Following is the actual program performing the comparison. I combined by utility file and main.py file together for convenience.

from csv import reader
from io import open as open_file
from os import path
from sys import argv, exit
from time import time

# custom lib
from Box import Box

def main():

    file_name = str.strip(argv[1])
    boxes_5s = []
    diff = 0

    similar_boxes = []

    for row in get_file(file_name):
        if row:
            box = Box(row)

            if len(boxes_5s) > 0:
                diff = box.timestamp - boxes_5s[0].timestamp
                if diff < 6:
                    boxes_5s.append(box)
                else:
                    similar_boxes += get_similar(boxes_5s)
                    del boxes_5s[0] # remove the oldest box
                    boxes_5s.append(box)
            else:
                boxes_5s.append(box)

        print(similar_boxes)


def get_file(file_name):
    """ open and return csv file pointer line by line """
    with open_file(file_name,'rb') as f:
        header = f.readline()
        print(header)
        rows = reader(f, delimiter=';')

        for r in rows:
            yield r
        else:
            yield ''


def get_similar(box_list):
    """ compare boxes for similar volume """    

    num_boxes = len(box_list)

    similar_boxes = []
    record_str = "Box#{} Volm:{} and #{} Volm:{}"
    for i in xrange(num_boxes):
        box_1 = box_list[i]

        for j in xrange(i+1, num_boxes):
            box_2 = box_list[j]

            vol_diff = abs((box_1.volume - box_2.volume)/box_1.volume) <= 0.1


            if vol_diff: similar_boxes.append(record_str.format(box_1.id,box_1.volume,box_2.id, box_2.volume))

    return similar_boxes

if __name__ == "__main__":
    main()

Thank you.

you should probably order them first and then find the similar boxes — armonge, Jun 24 '16 at 18:26
it helps. see [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) — rll, Jun 24 '16 at 18:26
I am sorry for the vague nature of my question, I'll post some code later when I am back home — A.I., Jun 24 '16 at 18:39
Thank you. I think that is a great idea, what I understand is, you mean I should calculate the volumes first,then sort them, and finally pick out the similar value ones. I shall try it. @armonge — A.I., Jun 24 '16 at 18:39
does "within 1 hour window" mean the hour after your current box's timestamp, or the hour within your current box's timestamp? looks like your bruteforce version is using the first case. — Fabricator, Jun 24 '16 at 18:43
The sample denotes a one minute window, as in the usual ISO8601 rep time is given (here) as `HH:MM:SS` so the half open interval for the hour "related" to 00:00:00 would be [00:00:00, 01:00:00) or closed interval [00:00:00, 00:59:59] when having second resolution. Or am I wrong in understanding the concept? — Dilettant, Jun 24 '16 at 19:02
w.r.t. re-ordering strategies, the idea collecting 1e^7 or more items, filter by some similarity criterion smells like O(n^2), the moving time window kernel over an already time ordered (the 1 hour skip criterion is helpful) what the OP lables as brute force might actually lead to an O(n) time complexity. Maximal storage needs triggered with many items per hour that are all similar. Curiously awaiting the code ... — Dilettant, Jun 24 '16 at 19:08
The 1 hour window means, the difference between first box and last box in collection should be 1hour or what /Dilettant/ said [00:00:00, 00:59:59]. @Fabricator — A.I., Jun 24 '16 at 22:46

Dilettant · Accepted Answer · 2016-06-24T21:18:53.210

Taking the first timestamp as start of a one hour window (instead of clock hour bins always staring at hour:00:00) I think a quite feasible implementation for data amounts as small as a few ten million lines of data might be (expect time ordered entriesin file):

#! /usr/bin/env python
from __future__ import print_function

import csv
import datetime as dt
import math
import collections


FILE_PATH_IN = './box_data_time_ordered_100k_sparse.csv'
TS_FORMAT = '%Y-%m-%d %H:%M:%S'
TS_TOKEN = 'date'
SIMILAR_ENOUGH = 0.1
BoxEntry = collections.namedtuple(
    'BoxEntry', ['start_ts', 'a_ts', 't_type', 'b_volume'])


def box_volume(box_length, box_width, box_height):
    """Volume in cubic of length units given."""
    return box_length * box_width * box_height


def filter_similar_box_volumes(box_entries):
    """Ordered binary similarity comparator using complex algorithm
    on a medium large slice of data."""

    def _key(r):
        """sort on volume."""
        return r.b_volume

    entries_volume_ordered = sorted(box_entries, key=_key)
    collector = []
    for n, box_entry in enumerate(entries_volume_ordered[1:], start=1):
        one = box_entry.b_volume
        prev_box_entry = entries_volume_ordered[n]
        previous = prev_box_entry.b_volume
        if one and math.fabs(one - previous) / one < SIMILAR_ENOUGH:
            if box_entry not in collector:
                collector.append(box_entry)
            if prev_box_entry not in collector:
                collector.append(prev_box_entry)
    return collector


def hourly_boxes_gen(file_path):
    """Simplistic generator, yielding hour slices of parsed
    box data lines belonging to 1 hour window per yield."""

    csv.register_dialect('boxes', delimiter=';', quoting=csv.QUOTE_NONE)
    start_ts = None
    cx_map = None
    hour_data = []
    an_hour = dt.timedelta(hours=1)
    with open(file_path, 'rt') as f_i:
        for row in csv.reader(f_i, 'boxes'):
            if cx_map is None and row and row[0] == TS_TOKEN:
                cx_map = dict(zip(row, range(len(row))))
                continue
            if cx_map and row:
                a_ts = dt.datetime.strptime(row[cx_map[TS_TOKEN]], TS_FORMAT)
                t_type = row[cx_map['type']]
                b_length = float(row[cx_map['box_length']])
                b_width = float(row[cx_map['box_width']])
                b_height = float(row[cx_map['box_height']])
                b_volume = box_volume(b_length, b_width, b_height)
                if start_ts is None:
                    start_ts = a_ts
                    hour_data.append(
                        BoxEntry(start_ts, a_ts, t_type, b_volume))
                elif a_ts - an_hour < start_ts:
                    hour_data.append(
                        BoxEntry(start_ts, a_ts, t_type, b_volume))
                else:
                    yield filter_similar_box_volumes(hour_data)
                    hour_data = [BoxEntry(start_ts, a_ts, t_type, b_volume)]
                    start_ts = a_ts
        if hour_data:
            yield filter_similar_box_volumes(hour_data)


def main():
    """Do the thing."""
    for box_entries in hourly_boxes_gen(FILE_PATH_IN):
        for box_entry in box_entries:
            print(box_entry.start_ts, box_entry.a_ts, box_entry.t_type)

if __name__ == '__main__':
    main()

With sample input file:

date;box_id;box_length;box_width;box_height;weight;type
1999-01-01 00:00:20;nx1124;10;4;5.5;2.1;oversea
1999-01-01 00:00:20;np11r4;8;3.25;2;4.666;local
1999-01-01 00:10:20;np11r3;8;3.25;2.1;4.665;local
1999-01-01 00:20:20;np11r2;8;3.25;2.05;4.664;local
1999-01-01 00:30:20;np11r1;8;3.23;2;4.663;local
1999-01-01 00:40:20;np11r0;8;3.22;2;4.662;local
1999-01-01 00:50:20;dp11r4;8;3.24;2;4.661;local
1999-01-01 01:00:20;cp11r3;8;3.25;2;4.666;local
1999-01-01 01:01:20;bp11r2;8;3.26;2;4.665;local
1999-01-01 01:02:20;ap11r1;8;3.22;2;4.664;local
1999-01-01 01:03:20;zp11r0;12;3.23;2;4.663;local
1999-01-01 02:00:20;yp11r4;8;3.24;2;4.662;local
1999-01-01 04:00:20;xp11r4;8;3.25;2;4.661;local
1999-01-01 04:00:21;yy11r4;8;3.25;2;4.661;local
1999-01-01 04:00:22;xx11r4;8;3.25;2;4.661;oversea
1999-01-01 04:59:19;zz11r4;8;3.25;2;4.661;local

yields:

1999-01-01 00:00:20 1999-01-01 00:30:20 local
1999-01-01 00:00:20 1999-01-01 00:50:20 local
1999-01-01 00:00:20 1999-01-01 00:00:20 local
1999-01-01 00:00:20 1999-01-01 00:20:20 local
1999-01-01 00:00:20 1999-01-01 00:10:20 local
1999-01-01 00:00:20 1999-01-01 00:00:20 oversea
1999-01-01 00:00:20 1999-01-01 01:00:20 local
1999-01-01 01:00:20 1999-01-01 01:01:20 local
1999-01-01 01:00:20 1999-01-01 01:03:20 local
1999-01-01 04:00:20 1999-01-01 04:00:21 local
1999-01-01 04:00:20 1999-01-01 04:00:22 oversea
1999-01-01 04:00:20 1999-01-01 04:59:19 local

Some notes:

csv module used for reading, with a specific dialect (as semicolon is not default delimiter)
import datetime with alias, to access datetime class for strptime method without overriding the module name - YMMV
encapsulate the chunked hour window reader in a generator function
volume and similarity calculation in separate fuctions.
volume ordered simple filter algorithm that should be somehow O(m) for m being the number of candidate matches.
Use named tuple for compact storage but also meaningful addressing.
To implement a clock adjusted 1 hour window (not using the first timestamp to bootstrap), one needs adjust the code a bit, but should be trivial

Otherwise curiously awaiting the code sample from the OP ;-)

updated the similar enough filtering algorithm, so that event rich hours, do not make an O(n^2) algorithm eat all our time ... (the _naive one with a nested loop removed).

Adding a day full of entries every second to the sample with 3600 candidates for the similarity check took approx 10 seconds for these approx 100k lines (86400+).

Thanks! Yes, this is very helpful. I am using something along this line but at the moment it is very crude. I shall try out this solution tomorrow. :) — A.I., Jun 24 '16 at 22:30

Python: many-to-many comparison to find required set of data

1 Answers1