Relevant questions
This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
Background
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
What I've tried (Scala)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map
which is doing a count
by marital status, and outputs the result, then I foldRight
to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
The question(s)
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
- Age (ascending)
- Num Married
- Num Not Married
- Running Total Married
- Running Total Not Married
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.