I've been learning Hive over the past 2 months, but I'm having trouble figuring out how to do certain sequence based queries. Take this example:
- I have a huge log consisting of user actions
- Every user action has a date field but obviously may not be scanned in that order due to multiple log files from different machines
- Each log can log a variety of different events. For this example, I will denote these as letters: A, B, C, D...
The problem: How would I write a query which asks, "On average, how many times does event A happen before B occurs"?
I know how to group the users and only take users which have done A, and B and average the number of A which occurred, but limiting by the first occurrence of B seems like it would be difficult. I think I might actually be able to do this by stringing together 10 or so nasty looking queries, but I was wanting to know if there was an easier way to do it that I don't know of.
Thanks!