3

I have the list of events of certain type, structured on the following example:

{
 createdAt: 123123132,
 type: STARTED,
 metadata: {
     emailAddress: "foo@bar.com"
 }
}

The number of types is predefined (START, STOP, REMOVE...). Users produce one or more events during time.

I want to get the following aggregation:

For each user, calculate the number of events for each type.

My AQL query looks like this:

FOR event IN events
  COLLECT
    email = event.metadata.emailAddress,
    type = event.type WITH COUNT INTO count
  LIMIT 10
  RETURN {
      email,
      t: {type, count}
  }

This produces the following output:

{ email: '_84@example.com', t: { type: 'CREATE', count: 203 } }
{ email: '_84@example.com', t: { type: 'DEPLOY', count: 214 } }
{ email: '_84@example.com', t: { type: 'REMOVE', count: 172 } }
{ email: '_84@example.com', t: { type: 'START', count: 204 } }
{ email: '_84@example.com', t: { type: 'STOP', count: 187 } }
{ email: '_95@example.com', t: { type: 'CREATE', count: 189 } }
{ email: '_95@example.com', t: { type: 'DEPLOY', count: 173 } }
{ email: '_95@example.com', t: { type: 'REMOVE', count: 194 } }
{ email: '_95@example.com', t: { type: 'START', count: 213 } }
{ email: '_95@example.com', t: { type: 'STOP', count: 208 } }
...

i.e. I got a row for each type. But I want results like this:

{ email: foo@bar.com, count1: 203, count2: 214, count3: 172 ...}
{ email: aaa@fff.com, count1: 189, count2: 173, count3: 194 ...}
...

OR

 { email: foo@bar.com, CREATE: 203, DEPLOY: 214, ... }
 ...

i.e. to group again the results.

I also need to sort the results (not the events) by the counts: to return e.g. the top 10 users with max number of CREATE events.

How to do that?

ONE SOLUTION

One solution is here, check the accepted answer for more.

FOR a in (FOR event IN events
  COLLECT
    emailAddress = event.metadata.emailAddress,
    type = event.type WITH COUNT INTO count
  COLLECT email = emailAddress INTO perUser KEEP type, count
  RETURN MERGE(PUSH(perUser[* RETURN {[LOWER(CURRENT.type)]: CURRENT.count}], {email})))
SORT a.create desc
LIMIT 10
RETURN a
igr
  • 10,199
  • 13
  • 65
  • 111
  • Are you sure about the desired result format? It doesn't carry the type information anymore, and I don't see how "count" + number as key is doing anything good. Shouldn't it be more like `{ email: foo@bar.com, CREATE: 203, DEPLOY: 214, ... }`? Not sure about your last sentence regarding sort by count. At what stage do you want to sort? How is the output supposed to look like? – CodeManX Apr 18 '18 at 19:28
  • @CoDEmanX updated. It can be both ways, since by column naming I will know the type. But I agree, your example is better. Sorting is done after the aggregation and need to sort by the count of certain type. – igr Apr 18 '18 at 20:07

1 Answers1

2

You could group by user and event type, then group again by user keeping only the type and already calculated event type counts. In the second aggregation, it is important to know into which groups the events fall to construct the result. An array inline projection can be used for that to keep the query short:

FOR event IN events
  COLLECT
    emailAddress = event.metadata.emailAddress,
    type = event.type WITH COUNT INTO count
  COLLECT email = emailAddress INTO perUser KEEP type, count
    RETURN MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))

Another way would be to group by user and keep event types, then group the types in a subquery. But it is significantly slower in my test (without any indexes defined at least):

FOR event IN events
  LET type = event.type
  COLLECT
    email = event.metadata.emailAddress INTO groups KEEP type
    LET byType = (
    FOR t IN groups[*].type
        COLLECT t2 = t WITH COUNT INTO count
        RETURN {[t2]: count}
    )
    RETURN MERGE(PUSH(byType, {email}))

Returning the top 10 users with the most CREATE events is much simpler. Filter for CREATE event type, then group by user and count the number of events, sort by this number in descending order and return the first 10 results:

FOR event IN events
    FILTER event.type == "CREATE"
    COLLECT email = event.metadata.emailAddress WITH COUNT INTO count
    SORT count DESC
    LIMIT 10
    RETURN {email, count}

EDIT1: Return one document per user with event types grouped and counted (like in the first query), but capture the MERGE result, sort by the count of one particular event type (here: CREATE) and return the top 10 users for this type. The result is the same as with the solution given in the question. It spares the subquery a la FOR a IN (FOR event IN events ...) ... RETURN a however:

FOR event IN events
  COLLECT
    emailAddress = event.metadata.emailAddress,
    type = event.type WITH COUNT INTO count
  COLLECT email = emailAddress INTO perUser KEEP type, count
  LET ret = MERGE(PUSH(perUser[* RETURN {[CURRENT.type]: CURRENT.count}], {email}))
  SORT ret.CREATE DESC
  LIMIT 10
  RETURN ret

EDIT2: Query to generate example data (requires a collection events to exist):

FOR i IN 1..100
    LET email = CONCAT(RANDOM_TOKEN(RAND()*4+4), "@example.com")
    FOR j IN SPLIT("CREATE,DEPLOY,REMOVE,START,STOP", ",")
        FOR k IN 1..RAND()*150+50
            INSERT {metadata: {emailAddress: email}, type: j} INTO events RETURN NEW
CodeManX
  • 11,159
  • 5
  • 49
  • 70
  • The first query does the trick! Two things I need different: How to sort that results per new columns (created by merge)? I have to sort the results, not to perform the sort separately. The second thing is limiting the sorted results, but I guess LIMIT will do the trick, once when I make the sort. – igr Apr 19 '18 at 06:52
  • I figured out - using subquery, but had to lowercase the attribute keys – igr Apr 19 '18 at 07:09
  • There are no columns. The returned result is an array of objects, and the attribute keys email, CREATE etc. are top level members of these objects. There is no notion of order in objects in JSON and ArangoDB. If you need to maintain an order, you have to use arrays, which would mean something like `{ email: …, events: [ { type: CREATE, count: 123 }, { type: REMOVE, count: 456 } ] }`. The last query shown returns the top 10 users as requested. You can remove the LIMIT to return all. Or was it meant as top 10 events per user? – CodeManX Apr 19 '18 at 07:18
  • Please check my post, Ive added a solution based on your. Also, please forgive my wrong usage of terms. Now, I need to sort the whole results from your first query by some count. So, I need to show all other counts as well, but just sort by singe one of them. I did that by wrapping your aql with another FOR-IN (and adding the LOWERCASE as I didn't know how to address capitalized key otherwise) – igr Apr 19 '18 at 07:21
  • 1
    I see and nice that you were able to work out a solution for that! I added a variant without subquery to my answer. I also removed the conversion to lower-case. You can address the key `CREATE` like `a.CREATE`. AQL is case-sensitive for me most part. It is only the operations and functions which are case-insensitive. Case does matter in variable names, attribute keys and so on. – CodeManX Apr 19 '18 at 07:50