7

NOTE: I am using a graph database (OrientDB to be specific). This gives me the freedom to write a server-side function in javascript or groovy rather than limit myself to SQL for this issue.*

NOTE 2: Since this is a graph database, the arrows below are simply describing the flow of data. I do not literally need the arrows to be returned in the query. The arrows represent relationships.*

I have data that is represented in a time-flow manner; i.e. EventC occurs after EventB which occurs after EventA, etc. This data is coming from multiple sources, so it is not completely linear. It needs to be congregated together, which is where I'm having the issue.

Currently the data looks something like this:

#     |  event   |  next
--------------------------
12:0  |  EventA  |  12:1
12:1  |  EventB  |  12:2
12:2  |  EventC  |  
12:3  |  EventA  |  12:4
12:4  |  EventD  |  

Where "next" is the out() edge to the event that comes next in the time-flow. On a graph this comes out to look like:

EventA-->EventB-->EventC
EventA-->EventD

Since this data needs to be congregated together, I need to merge duplicate events but preserve their edges. In other words, I need a select query that will result in:

        -->EventB-->EventC
EventA--|
        -->EventD

In this example, since EventB and EventD both occurred after EventA (just at different times), the select query will show two branches off EventA as opposed to two separate time-flows.


EDIT #2

If an additional set of data were to be added to the data above, with EventB->EventE, the resulting data/graph would look like:

#     |  event   |  next
--------------------------
12:0  |  EventA  |  12:1
12:1  |  EventB  |  12:2
12:2  |  EventC  |  
12:3  |  EventA  |  12:4
12:4  |  EventD  |  
12:5  |  EventB  |  12:6
12:6  |  EventE  |  

EventA-->EventB-->EventC
EventA-->EventD
EventB-->EventE

I need a query to produce a tree like:

                    -->EventC
        -->EventB--|
        |           -->EventE
EventA--|
        -->EventD

EDIT #3 and #4

Here is the data with edges shown as opposed to the "next" column above. I also added a couple additional columns here to hopefully clear up any confusion about the data:

#     |  event   |    ip_address    |       timestamp     |   in   |  out  |
----------------------------------------------------------------------------
12:0  |  EventA  |  123.156.189.18  | 2015-04-17 12:48:01 |        |  13:0 |
12:1  |  EventB  |  123.156.189.18  | 2015-04-17 12:48:32 |  13:0  |  13:1 |
12:2  |  EventC  |  123.156.189.18  | 2015-04-17 12:48:49 |  13:1  |       |
12:3  |  EventA  |  103.145.187.22  | 2015-04-17 14:03:08 |        |  13:2 |
12:4  |  EventD  |  103.145.187.22  | 2015-04-17 14:05:23 |  13:2  |       |
12:5  |  EventB  |  96.109.199.184  | 2015-04-17 21:53:00 |        |  13:3 |
12:6  |  EventE  |  96.109.199.184  | 2015-04-17 21:53:07 |  13:3  |       |

The data is saved like this to preserve each individual event and the flow of a session (labeled by the ip address).

TL;DR

Got lots of events, some duplicates, and need them all organized into one neat time-flow graph.

Hossein Narimani Rad
  • 31,361
  • 18
  • 86
  • 116
Devin Young
  • 841
  • 7
  • 21
  • Do you need to start from this kind of data or do you have the option of reorganizing the data (i.e. when events arrive write them differently to the DB)? In the second case, you may not create a different node for the second EventA but keep a list/set for next events and put EventD into that collection. Also, does your output graph is smt like a single tree or a set of trees. What happens when another EventB comes which has a next EventE? Shall it be merged into 12:1 EventB (like Event A is merged) or become a seperate instance (i.e a new tree with root EventB 12:5)? – pembeci Apr 14 '15 at 01:05
  • @pembeci in short, yes the data needs to be organized this way to track ALL occurrences of the event. I *could* create a separate class that would be solely in charge of tracking the tree, but I would like to keep all of this in one class to avoid data getting out of sync. To answer your second question, I edited the question to show what would happen in that scenario. – Devin Young Apr 14 '15 at 02:56
  • Hmm. At least having next not as a field/column but as an edge to another event would be a better design. If that was the case you should be able utilize the [traverse](http://orientdb.com/docs/last/SQL-Traverse.html) command for better performance. I'll try to solve this by writing a server side Javascript command for reducing roundtrips but I'll need some time. – pembeci Apr 14 '15 at 13:08
  • @pembeci "next" is an edge. That is noted above. The in() and out() edges are used to order the graph. – Devin Young Apr 14 '15 at 13:31
  • @pembeci I edited again to explicitly show the edges. Hopefully this clears it up a bit. – Devin Young Apr 14 '15 at 13:40
  • OK, thanks. Is my other assumption also correct? There is only one tree with root Event A. So each new event is either Event A, or some Event X that has some Event A as its parent or grandparent. – pembeci Apr 15 '15 at 10:09
  • @pembeci Yes, unless Event X has an edge that puts it before Event A--in which case Event X would move to the front of the tree. Think of it like people collaborating on a family tree: everyone inputs what they know, and the query puts it all together to make sense of it. – Devin Young Apr 15 '15 at 12:02
  • Based on what condition `in` and `out` columns have to be generated? – Maciej Los Apr 15 '15 at 16:25
  • @MaciejLos "in" and "out" aren't columns, they're edges (OrientDB). It's essentially a foreign key from out->in where out=in – Devin Young Apr 15 '15 at 17:35
  • I just want to confirm. It even has to display that way. Using just one sql query? – Luigi Apr 17 '15 at 10:17
  • why does the row `12:5 | EventB | | 13:3` not have a 13:0 in? Am I missing something? Or shouldnt you have 8 rows for your last example? This would make it easier. – Koryu Apr 17 '15 at 10:32
  • @LuigiMackenzieC.Brito Yes it has to output like the tree shown, but I know this will likely be a server-side function rather than one sql query. – Devin Young Apr 17 '15 at 18:07
  • @Koryu That row does not have a 13:0 in because it is a completely separate occurrence of EventB. I edited the table at the end to hopefully clear up the structure. There needs to be a trail of individual events, but the ability to query to determine all possible paths--sort of like a flowchart. – Devin Young Apr 17 '15 at 18:07
  • @DevinYoung, I really don't like to say that it's impossible. But in this case, I would really like to say that it's impossible on a single SQL Query. If anyone comes up with an answer I will kneel down and bow. – Luigi Apr 17 '15 at 19:35
  • How do you want to handle recursion? if EventE gets followed by EventA. just record the one EventA, or try to do something else with it? – Alan Hoover Apr 17 '15 at 22:03
  • @AlanHoover Great question. I would like to keep the one EventA since the output is reflecting all events that could occur before and after EventX. I *think* OrientDB will handle this fine in the graph, but I haven't gotten far enough to find out... – Devin Young Apr 17 '15 at 22:56
  • Maybe I should have posted this on Programming Challenges... :) – Devin Young Apr 17 '15 at 22:56
  • I could be misunderstanding the issue. It looks to me like you are wanting to map how people traverse a website (pageA-> PageB->PageD). and want to accumulate all visitors to one map of all paths people travel. My question involves when EventA-> EventB -> EventE->EventA. My thoughts are to show that second EventA as a child leaf (no descendents) but marked somehow to reflect that it refers to another node already on the graph. – Alan Hoover Apr 17 '15 at 23:04
  • @AlanHoover I think that's reasonable if it winds up being easier and more performent. I guess the main priority is that the second EventA does not contain any descendants like you said, otherwise you would have a large recursive mess :) – Devin Young Apr 17 '15 at 23:23
  • I can't do anything for you as far as drawing a picture, but I have some ideas floating around on organizing data. Let me fool around over the weekend. – Alan Hoover Apr 17 '15 at 23:40
  • The rub here, if I understand correctly what you're trying to do, is when EventA points to EventC and when EventB points to EventD. These cause a many to many relationship that is not really representable in the format you are asking for. – Alan Hoover Apr 20 '15 at 14:40

1 Answers1

0

Holy cow.

After wrestling with this for over a week I think I FINALLY have a working function. This isn't optimized for performance (oh the loops!), but gets the job done for the time being while I can work on performance. The resulting OrientDB server-side function (written in javascript):

The function:

// Clear previous runs
db.command("truncate class tmp_Then");
db.command("truncate class tmp_Events");

// Get all distinct events
var distinctEvents = db.query("select from Events group by event");

// Send 404 if null, otherwise proceed
if (distinctEvents == null) {
  response.send(404, "Events not found", "text/plain", "Error: events not found" );
} else {
  var edges = [];

  // Loop through all distinct events
  distinctEvents.forEach(function(distinctEvent) {
    var newEvent = [];
    var rid = distinctEvent.field("@rid");
    var eventType = distinctEvent.field("event");

    // The main query that finds all *direct* descendents of the distinct event
    var result = db.query("select from (traverse * from (select from Events where event = ?) where $depth <= 2) where @class = 'Events' and $depth > 1 and @rid in (select from Events group by event)", [eventType]);

    // Save the distinct event in a temp table to create temp edges
    db.command("create vertex tmp_Events set rid = ?, event = ?", [rid, event]);
      edges.push(result);
    });

  // The edges array defines which edges should exist for a given event
  edges.forEach(function(edge, index) {
    edge.forEach(function(e) {
      // Create the temp edge that corresponds to its distinct event
      db.command("create edge tmp_Then from (select from tmp_Events where rid = " + distinctEvents[index].field("@rid") + ") to (select from tmp_Events where rid = " + e.field("@rid") + ")");
    });
  });

  var result = db.query("select from tmp_Events");
  return result;
}

Takeaways:

  • Temp tables appeared to be necessary. I tried to do this without temp tables (classes), but I'm not sure it could be done. I needed to mock edges that didn't exist in the raw data.
  • Traverse was very helpful in writing the main query. Traversing through an event to find its direct, unique descendents was fairly simple.
  • Having the ability to write stored procs in Javascript is freaking awesome. This would have been a nightmare in SQL.
  • omfg loops. I plan to optimize this and continue to make it better so hopefully other people can find some use for it.
Devin Young
  • 841
  • 7
  • 21
  • I think, this is the way to go, creating a server side JS. But since you have all the power of JS you don't need all of those tmp_ nodes, edges. You can directly create your tree as an JS object and return it. Still looking for a time slot to toy with this challange. – pembeci Apr 21 '15 at 09:42