0

I'm trying Spark with Java and MongoDB and I want to aggregate some Documents into a single one based on timestamps. For example, I want to aggregate X documents into a single one:

{
    "_id" : ObjectId("598c32f455f0353f9e69ebf1"),
    "_class" : "...",
    "timestamp" : ISODate("2017-08-10T10:17:00.000Z"),
    "value" : 10.1
}
...
{
    "_id" : ObjectId("598c32f455f0353f9e69ebz2"),
    "_class" : "...",
    "timestamp" : ISODate("2017-08-10T10:18:00.000Z"),
    "value" : 2.1
}

Lets say I have 60 documents like this and their timestamps are in a window of 1 minute (from 10:17:00 to 10:18:00) and I want to obtain one document:

{
    "_id" : ObjectId("598c32f455f0353f9e69e231"),
    "_class" : "...",
    "start_timestamp" : ISODate("2017-08-10T10:17:00.000Z"),
    "end_timestamp" : ISODate("2017-08-10T10:18:00.000Z"),
    "average_value" : **average value of those documents**
}

Is it possible to perform this kind transformation? Can I retrieve one window of 1 minute of data at a time?

An approach which takes all the documents and compare their timestamps looks slow and inefficient.

Thanks in advance.

Community
  • 1
  • 1
Razvan
  • 347
  • 2
  • 12
  • Can you be more specific? So you want to aggregate all documents that are inside one minute? What is a minute like from 0-60 s or from may 1:30 to 2:30 – jojo_Berlin Sep 06 '17 at 07:36
  • Yes, all documents that are inside one minute. A minute is from 1:30 to 2:30 – Razvan Sep 06 '17 at 07:44
  • the solution is here https://stackoverflow.com/questions/41711716/how-to-aggregate-over-rolling-time-window-with-groups-in-spark ; basically what you do: define a time window; load the data into a dataframe order it by timestamp and apply it – jojo_Berlin Sep 06 '17 at 08:19
  • You can also doing all this work using the mongo aggregation framework, no need spark to compute an average value of 60 documents.... – Bameza Sep 06 '17 at 08:30
  • Yeah I agree with the answer of @Bamenza – jojo_Berlin Sep 06 '17 at 08:40

0 Answers0