-1

I don't know even how to name my problem - I have a list of tuples in python:

(int, str, datetime, float)

There is a bunch of rows in that list, they are sorted by datetime and I'd like to count how much rows in a 5 minute time span, which have floats in given range, i.e. from 0 to 2, from 2 to 5, from 5 to 10 and so on. I mean, by given such data (date is not string, but datetime.datetime):

(1, 'abc', '2014-09-10 17:50:34', 5.5)
(2, 'abc', '2014-09-10 17:51:34', 1.5)
(3, 'abc', '2014-09-10 17:52:14', 7.1)
(4, 'abc', '2014-09-10 17:59:34', 9.5)
(5, 'abc', '2014-09-10 17:59:54', 9.2)

I'd like to receive some kind of dictionary:

{ the_end_of_time_interval1: {'0to2': int, '2to5': int, '5to10': int, ... },
  the_end_of_time_interval2: {'0to2': int, '2to5': int, '5to10': int, ... }, 
...}

for example:

{ '2014-09-10 17:52:34': { '0to2': 1, '2to5': 0, '5to10': 2, '10to15': 0 },
  '2014-09-10 17:59:54': { '0to2': 0, '2to5': 0, '5to10': 2, '10to15' : 0 } }

My question is - is there some elegant way to do that? I'd like to save it to the file and send it to some database for purpose of monitoring.

bartekmp
  • 403
  • 3
  • 9
  • 21
  • 2
    Some suggestions to get you started: you can use itertools.groupby to divide into groups where all elements are within 5 minutes of the first (or within 5 minutes of the first rounded down, I'm not sure what you want). Then, within each group, you can use sorted to sort by the float value and groupby again, or you can just loop explicitly and increment a Counter value based on the float's range. If you don't understand part of this, or try it and get stuck somewhere, you can write a more specific question. – abarnert Sep 10 '14 at 16:26
  • 1
    Alternatively, if you're going to store this in a database anyway, and that database is going to be relational, you can use SQL GROUPBY and aggregate functions in a similar way. Or you can use Pandas with its grouping and aggregate functions. Or, of course, you can just write an explicit loop around a very simple state machine (when you reach a 5-minute boundary, output the current counter and start a new one). – abarnert Sep 10 '14 at 16:27
  • I don't have an access to that database, I just have to aggregate some interesting data and insert it there. – bartekmp Sep 10 '14 at 16:28

1 Answers1

1

you can make use of

itertools.groupby

you just need to function to do the grouping/sorting

one that will get your dates in order:

the_date = None
def split_5_min(data_row):
    global the_date

    row_date = datetime.datetime.strptime(data_row[2], '%Y-%m-%d %H:%M:%S')
    if the_date is None or row_date - the_date > datetime.timedelta(minutes=5):
        the_date = row_date
        return the_date

something along this lines should do it

then you need to get another one that will put your floats into buckets:

def bucket_floats(data_row):
    float_data = data_row[3]
    if float_data > 0 and float_data <= 2:
        return 1
    elif float_data > 2 and float_data <= 5:
        return 2
    elif float_data > 5 and float_data <= 10:
        return 3
    ...

so now to the meat of things. you want to have the data in a list

then:

final_dict = {}

# always sort as groupby creates group each time the group value changes
data.sort(key=split_5_min)
for period, data_tuples in itertools.groupby(data, split_5_min):
    group_data = list(data_tuples)
    # sort again
    group_data.sort(key=bucket_floats)
    final_dict[period] = {}
    # do another grouping witin 5 min group
    for bucket, float_tuples in itertools.groupby(data, bucket_floats):
        # pack into a dict
        final_dict[period][bucket] = len(list(float_tuples))

think the datetime bucketing may be off here... you probably want something more complicated. or just a straight sort on the dates before you run the slit_5_min function on it

user3012759
  • 1,977
  • 19
  • 22
  • Thank you very much. But in line: data.sort(key=split_5_min) I get an error: TypeError: unorderable types: NoneType() < datetime.datetime() – bartekmp Sep 10 '14 at 17:25
  • it's probably because of that: row_date - the_date > datetime.timedelta(minutes=5): you can split it in two ifs if you prefer and see if it goes any better – user3012759 Sep 11 '14 at 08:21