0

I've been learning Hive over the past 2 months, but I'm having trouble figuring out how to do certain sequence based queries. Take this example:

  • I have a huge log consisting of user actions
  • Every user action has a date field but obviously may not be scanned in that order due to multiple log files from different machines
  • Each log can log a variety of different events. For this example, I will denote these as letters: A, B, C, D...

The problem: How would I write a query which asks, "On average, how many times does event A happen before B occurs"?

I know how to group the users and only take users which have done A, and B and average the number of A which occurred, but limiting by the first occurrence of B seems like it would be difficult. I think I might actually be able to do this by stringing together 10 or so nasty looking queries, but I was wanting to know if there was an easier way to do it that I don't know of.

Thanks!

David
  • 1,648
  • 1
  • 16
  • 31

1 Answers1

2

This is not something easy to map to SQL, especially in Hive, taking into account that Hive do not support joins different from equi-joins.
In the same time we can leave a lot of heavy lifting to Hive, while doing the small part by ourselves.
Idea I would suggest is: let hive to group together all events for the specific user. We can create our own agregation UDF which will calculate min date for A and B and give output 0,1,2,3 - if only A happens, Only B Happens, A happens before B , B happens before A.
From there the rest of calculations is again trivial to do in Hive.

David Gruzman
  • 7,900
  • 1
  • 28
  • 30
  • Thanks! I hadn't explored UDF, but it looks like an aggregate function would be able to this. – David Jan 26 '13 at 15:48