Structuring time-series data in ArangoDB

Question

I have some time-series data (roughly on the order of 1-5 points per day) I need to be able to quickly access in a webapp using ArangoDB. The data is associated with a particular profile, but one collection is used for all the data for all profiles. Between the profile node and the data node, there is a report node and an event node. The report is simply a group of data points from a given event. The existing graph structure looks like this:

profile =====> event1 ========> reportA =======> data1
     \             \                   \=======> data2
      \             \
       \             \========> reportB =======> data3
        \                              \=======> data4
         \
          \==> event2 ========> reportA =======> data1    
                   \                   \=======> data2
                    \
                     \========> reportB =======> data3
                                       \=======> data4

The chart I would like would effectively present data1 sequentially, by associated event, sorted by an attribute of the event. An analogous tabular structure of the result set I would like looks like this:

event      dataAttr     value
-------------------------------
event1     data1        42
event2     data1        6
event3     data1        7
event4     data1        343

I am likely to run this query for every dataAttr in a given report, to effectively create a time-series result set for each dataAttr on a particular profile for the last 10-20 events.

When investigating this problem in Neo4J, they recommended directly connecting sequential events to each other. I'm wondering if this is also a better approach in ArangoDB.

This would mean creating an additional graph that looks something like this:

data1 (of event1) => data1 (of event2) => data1 (of event3) => data1 (of event4)
data2 (of event1) => data2 (of event2) => data2 (of event3) => data2 (of event4)

Etc.

Each dataAttr is connected to its cousin in the previous event, thus after traversing to the most recent event in the first graph, the second graph would be used to traverse n-layers to past events (practically 10-20).

Is this probably the best way to structure the data for a query like this? Performance will be critical as I potentially will be loading 20 charts on a page that each are fed by this query.

Would this query be faster simply querying on a document collection with indices rather than via graph traversal? The document collection structure could put a hash index on the dataAttr and skiplist on the event (they will be sequentially ordered with string sorting).

I'm assuming that traversing down to data1 of event1, back up to profile, and back down event2 data1 and so on would be very inefficient.

score 3 · Accepted Answer · answered Nov 15 '16 at 13:42

3

If performance is critical, then trying to handle as much as possible using indexes is paramount. Traversals are superior if you have an unknown path length, which is not your use-case.

I would recommend denormalizing the data stored in the data node. You want to return all data nodes belonging to a profile and a given dataAttr sorted by a time-stamp timeStamp, right? In this case I would at least add the profile identifier to the data node and use a skip-listed index on profileId, dataAttr and timeStamp.

answered Nov 15 '16 at 13:42

fceller

2,734
15
19

So upon traversing to the data desired initially, the time-series query is purely a document search? And due to the indices on documents, that'll be faster than traversing down the adjacent data? – Nate Gardner Nov 16 '16 at 02:58
Yes, that would be the idea. Basically, the Neo4J approach is pointing the same direction, in that case a poor man's implementation of a sorted index using a linked-list. – fceller Nov 18 '16 at 13:57
Cool! This is working well for us. Thank you! Another great benefit of ArangoDB's multi-model functionality! – Nate Gardner Nov 19 '16 at 01:02

Structuring time-series data in ArangoDB

1 Answers1