0

I have an elastic index which has documents for user state history. Data looks like this;

  {
    "session_id": "yunus",
    "state_name": "start",
    "entry_time": "2016-11-09 15:27:03"
  },
  {
    "session_id": "yunus",
    "state_name": "end",
    "entry_time": "2016-11-09 16:30:00"
  },
  {
    "session_id": "can",
    "state_name": "start",
    "entry_time": "2016-11-09 12:01:00"
  },
  {
    "session_id": "rick",
    "state_name": "start",
    "entry_time": "2016-11-09 09:00:00"
  },
  {
    "session_id": "rick",
    "state_name": "end",
    "entry_time": "2016-11-10 10:00:00"
  }

I want to aggregate by state name with date histogram but for only relevant last state at that time. So result can be;

2016-11-08 
start = 0
end = 0

2016-11-09 
start = 2
end = 1

2016-11-10 
start = 1
end = 2

Actually plan is to generate grouped bar chart with timeline to show states change over time.

I tried several things like aggregation pipelines, top hits but couldn't make any progress.

Any help appreciated.

Fatih Donmez
  • 4,319
  • 3
  • 33
  • 45

1 Answers1

0

For anyone interested, I solved it with spark. I used elastic-spark to read from elasticsearch and then write back to elasticsearch.

Here is the read from es as Rdd;

val allData = sc.esRDD(s"states_${id}/log", query)

Then I first group by session id, sort by date to find only latest state of a session;

val latestStates = allData.groupBy(k => k._2.get("session_id").get).map(k => (k._2).reduceLeft((d1, d2) => {
  d1._2.get("timestamp").get.asInstanceOf[Long] > d2._2.get("timestamp").get.asInstanceOf[Long] match {
    case true => d1
    case false => d2
  }
})).map(_._2)

Once I have the latest states of session, I filter the exit states then count by value;

val stateSummary = latestStates
  .filter(s => s.isDefinedAt("state_id") && s("state_id").asInstanceOf[Long] != -1)
  .map(s => (s("state_id"), s("state_name")))
  .countByValue()
  .map(d => Map("state_id" -> d._1._1.asInstanceOf[Long], "state_name" -> d._1._2.asInstanceOf[String], "count" -> d._2)).toList

Now we have the current number of sessions in states. (current is configurable so we can set it for a specific time), only thing is left, write back to the elasticsearch;

sc.makeRDD(Seq(finalElasticDoc)).saveToEs(s"states_${id}/analytic_daily")
Fatih Donmez
  • 4,319
  • 3
  • 33
  • 45