0

I have a python script that I am using for scraping the data off the web. The data is then stored into a MongoDB database. The data is in following format:

{
  "id": "abcd",
  "value": 100.0,
  "timestamp": "2011-07-14 19:43:37"
}

I have a bunch of data like this. What I would like to do is to group the data by hour, and get the average of the values, along with min and max values. From what I can see in the mongodb docs, both aggregation pipeline and map-reduce can perform grouping by hour or avg, which I can then store back to DB and rerun the aggregation pipeline, or map reduce on the intermediate data.

Is there any way to do it in one step, without storing the data in the temporary tables and running a new iteration?

Ivan Kvolik
  • 401
  • 6
  • 16
  • I don't quite understand the part where you say storing it back to mongodb. Why do we have to store anything into temporary collection ? Aggregation can output the data the way you like. What will be an example of new iteration ? How is this relevant ? You serve the data on demand by running aggregation query based on user input or some static criteria. – s7vr Jan 07 '18 at 13:48
  • @Veeram I stated that i would store the data in the temporary collection because i have been unable to find an example of an aggregation that does more actions at once. The examples in the docs show only one operation. I need the query to group the data by an hour time period, or some other time period and calculate average, min and max of the data grouped in that by time interval. If you can provide the working example i would be most grateful. I am not an expert in mongoDB, this is why i am asking the question. – Ivan Kvolik Jan 07 '18 at 22:16
  • Please take a look [here](https://stackoverflow.com/questions/26814427/group-result-by-15-minutes-time-interval-in-mongodb). Let me know what you think. – s7vr Jan 08 '18 at 04:16

1 Answers1

1

This might help:

from datetime import datetime
from itertools import groupby
from pprint import pprint

# assuming that collection of data objects is a list
datas = [
    {
        "id": "abcd",
        "value": 100.0,
        "timestamp": "2011-07-14 19:43:37"
    },
    {
        "id": "abcd",
        "value": 500.0,
        "timestamp": "2011-07-15 20:30:37"
    },
    {
        "id": "abcd",
        "value": 400.0,
        "timestamp": "2011-07-15 20:30:38"
    }
]

decorated_datas = []
# first we need to add a key with each data, that would be needed during sorting
# and that key would be date and hour
for data in datas:
    timestamp = datetime.strptime(data["timestamp"], "%Y-%m-%d %H:%M:%S") # assuming your timestamp is in this format only
    decorated_datas.append((timestamp.date(), timestamp.time().hour, data))

# then we sort the data created in the last step using the date and hour
sorted_decorated_datas = sorted(decorated_datas, key=lambda x: (x[0], x[1]))


# function for calculating statistics of a given collection of numbers
def calculate_stats(collection_of_numbers):
    maxVal = max(collection_of_numbers)
    minVal = min(collection_of_numbers)
    avgVal = sum(collection_of_numbers) / len(collection_of_numbers)
    return (maxVal, minVal, avgVal)

results = []

# then we group our sorted data by date and hour, and then we calculate
# statistics for the group and append result to our final results
for key, group_iter in groupby(sorted_decorated_datas, lambda x: (x[0], x[1])):
    group_values = [data[2]["value"] for data in group_iter]
    maxValue, minValue, avgValue = calculate_stats(group_values)

    result = {"date": key[0], "hour": key[1], "minVal":
              minValue, "maxVal": maxValue, "avgVal": avgValue}
    results.append(result)


pprint(results)

And the output was:

[{'avgVal': 100.0,
  'date': datetime.date(2011, 7, 14),
  'hour': 19,
  'maxVal': 100.0,
  'minVal': 100.0},
 {'avgVal': 450.0,
  'date': datetime.date(2011, 7, 15),
  'hour': 20,
  'maxVal': 500.0,
  'minVal': 400.0}]

EDIT After giving it a thought, I found out that the format in which you represent timestamp in a string is the perfect candidate where conversion to datetime object isn't needed and those timestamp strings can be sorted by themselves without converting themm to datetime objects, so here is the updated code:

from itertools import groupby
from pprint import pprint

# assuming that collection of data objects is a list
datas = [
    {
        "id": "abcd",
        "value": 100.0,
        "timestamp": "2011-07-14 19:43:37"
    },
    {
        "id": "abcd",
        "value": 500.0,
        "timestamp": "2011-07-15 20:30:37"
    },
    {
        "id": "abcd",
        "value": 400.0,
        "timestamp": "2011-07-15 20:30:38"
    }
]


def get_date_and_hour(timestamp_str):
    date, time = timestamp_str.split()
    date = tuple(map(int, date.split('-')))
    hour = int(time.split(':')[0])
    return tuple((*date, hour))


def calculate_stats(collection_of_numbers):
    maxVal = max(collection_of_numbers)
    minVal = min(collection_of_numbers)
    avgVal = sum(collection_of_numbers) / len(collection_of_numbers)
    return (maxVal, minVal, avgVal)

results = []

sorted_datas = sorted(datas, key=lambda x: x["timestamp"])

for key, group_iter in groupby(sorted_datas, lambda x: get_date_and_hour(x["timestamp"])):
    group_values = [data["value"] for data in group_iter]
    maxValue, minValue, avgValue = calculate_stats(group_values)

    result = {"date": key[0:3], "hour": key[3], "minVal":
              minValue, "maxVal": maxValue, "avgVal": avgValue}
    results.append(result)
pprint(results)

and the output was:

[{'avgVal': 100.0,
  'date': (2011, 7, 14),
  'hour': 19,
  'maxVal': 100.0,
  'minVal': 100.0},
 {'avgVal': 450.0,
  'date': (2011, 7, 15),  
  'hour': 20,  
  'maxVal': 500.0,
  'minVal': 400.0}]

This version is much more shorter and maintainable than the previous one.

tkhurana96
  • 919
  • 7
  • 25
  • 1
    Thank you for the reply, and it indeed works. What i had in mind is for the mongoDB to do the work so my code can be as minimalistic as possible. – Ivan Kvolik Jan 07 '18 at 10:29
  • but you stated in your question that you don't want the db to do the work, anyways I updated my answer with the smaller version of the code, hope that helps – tkhurana96 Jan 07 '18 at 10:52