1

I have list of case classes. Output requires aggregation on different parameters of case class. Looking for more optimized way to do it.

Example:

case class Students(city: String, college: String, group: String,
                    name: String, fee: Int, age: Int)

object GroupByStudents {
  val studentsList= List(
    Students("Mumbai","College1","Science","Jony",100,30),
    Students("Mumbai","College1","Science","Tony", 200, 25),
    Students("Mumbai","College1","Social","Bony",250,30),
    Students("Mumbai","College2","Science","Gony", 240, 28),
    Students("Bangalore","College3","Science","Hony", 270, 28))
}

Now to get details of students from a City, i need to first aggregate by City, then break-up those details college wise, then group wise.

Output is list of case class in below format.

Students(Mumbai,,,,790,0) -- aggregate city wise
Students(Mumbai,College1,,,550,0)  -- aggregate college wise
Students(Mumbai,College1,Social,,250,0)
Students(Mumbai,College1,Science,,300,0)
Students(Mumbai,College2,,,240,0)
Students(Mumbai,College2,Science,,240,0)
Students(Bangalore,,,,270,0)
Students(Bangalore,College3,,,270,0)
Students(Bangalore,College3,Science,,270,0)

Two methods to achieve this:

1) Loop all list, create a map for each combination (above case 3 combinations ), aggregate data and create new result list and append data to it.

2) Using foldLeft option

studentsList.groupBy(d=>(d.city))
  .mapValues(_.foldLeft(Students("","","","",0,0))
    ((r,c) => Students(c.city,"","","",r.fee+c.fee,0)))

studentsList.groupBy(d=>(d.city,d.college))
  .mapValues(_.foldLeft(Students("","","","",0,0))
    ((r,c) => Students(c.city,c.college,"","",r.fee+c.fee,0)))

studentsList.groupBy(d=>(d.city,d.college,d.group))
  .mapValues(_.foldLeft(Students("","","","",0,0))
    ((r,c) => Students(c.city,c.college,c.group,"",r.fee+c.fee,0)))

In both cases, looping on list more than once. Is there any way to achieve this with single pass and optimized way.

Łukasz
  • 8,555
  • 2
  • 28
  • 51
jony
  • 119
  • 3
  • 13

2 Answers2

4

With GroupBy

Code looks a little bit nicer, but I think it isn't faster. With groupby you have always 2 "loops"

studentsList.groupBy(d=>(d.city)).map { case (k,v) =>
    Students(v.head.city,"","","",v.map(_.fee).sum, 0)
}
studentsList.groupBy(d=>(d.city,d.college)).map { case (k,v) =>
    Students(v.head.city,v.head.college,"","",v.map(_.fee).sum, 0)
}    
studentsList.groupBy(d=>(d.city,d.college,d.group)).map { case (k,v) =>
    Students(v.head.city,v.head.college,v.head.group,"",v.map(_.fee).sum, 0)
}

You get then Something like this

List(Students(Bangalore,College3,Science,Hony,270,0),
     Students(Mumbai,College1,Science,Jony,790,0))
List(Students(Mumbai,College2,,,240,0),
     Students(Bangalore,College3,,,270,0),  
     Students(Mumbai,College1,,,550,0))
List(Students(Bangalore,College3,Science,,270,0), 
     Students(Mumbai,College2,Science,,240,0), 
     Students(Mumbai,College1,Social,,250,0), 
     Students(Mumbai,College1,Science,,300,0))

It is not exactly the same output like in your example, but it is the desired output: a list of case class students.

With a for comprehension

You could avoid this looping if your grouping by yourself. Only have the city example the other are straight forward.

var m = Map[String, Students]()
for (v <- studentsList) {
    m += v.city -> Students(v.city,"","","",v.fee + m.getOrElse(v.city, Students("","","","",0,0)).asInstanceOf[Students].fee, 0)
}
m

Output

It's the same Output like your studenList but I only loop one time, for every Map[String,Students] output.

Map(Mumbai -> Students(Mumbai,,,,790,0), Bangalore -> Students(Bangalore,,,,270,0))

With Foldleft

Just going in one loop over the complete list.

val emptyStudent = Students("","","","",0,0);
studentsList.foldLeft(Map[String, Students]()) { case (m, v) =>
    m + (v.city -> Students(v.city,"","","",
                            v.fee + m.getOrElse(v.city, emptyStudent).fee, 0))
}
studentsList.foldLeft(Map[(String,String), Students]()) { case (m, v) =>
    m + ((v.city,v.college) -> Students(v.city,v.college,"","",
                                        v.fee + m.getOrElse((v.city,v.college), emptyStudent).fee, 0))
}
studentsList.foldLeft(Map[(String,String,String), Students]()) { case (m, v) =>
    m + ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"",
                                                v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
}

Output

It's the same Output like your studenList but I only loop one time, for every Map[String,Students] output.

Map(Mumbai -> Students(Mumbai,,,,790,0), 
    Bangalore -> Students(Bangalore,,,,270,0))
Map((Mumbai,College1) -> Students(Mumbai,College1,,,550,0), 
    (Mumbai,College2) -> Students(Mumbai,College2,,,240,0), 
    (Bangalore,College3) -> Students(Bangalore,College3,,,270,0))
Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0), 
    (Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0), 
    (Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0), 
    (Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))

With FoldLeft One Loop

You can just generate one Big Map with all the List.

val emptyStudent = Students("","","","",0,0);
studentsList.foldLeft(Map[(String,String,String), Students]()) { case (m, v) =>
  {
    var t = m + ((v.city,"","") -> Students(v.city,"","","",
      v.fee + m.getOrElse((v.city,"",""), emptyStudent).fee, 0))
    t = t + ((v.city,v.college,"") -> Students(v.city,v.college,"","",
      v.fee + m.getOrElse((v.city,v.college,""), emptyStudent).fee, 0))
    t + ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"",
      v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
  }
}

Output

In this case you loop one time and get back the results for all aggregating, but only in oneMap. This would work with for comprehension, too.

Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0), 
    (Bangalore,,) -> Students(Bangalore,,,,270,0), 
    (Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0), 
    (Mumbai,College2,) -> Students(Mumbai,College2,,,240,0), 
    (Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0), 
    (Mumbai,,) -> Students(Mumbai,,,,790,0), 
    (Bangalore,College3,) -> Students(Bangalore,College3,,,270,0), 
    (Mumbai,College1,) -> Students(Mumbai,College1,,,550,0), 
    (Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))

The Map is always copied, so it could have some performance and memory issues. To solve this use a for comprehension

For Comprehension One Loop

This generates one Map with the 3 aggregate types.

val emptyStudent = Students("","","","",0,0);
var m = Map[(String,String,String), Students]()
for (v <- studentsList) {
  m +=  ((v.city,"","") -> Students(v.city,"","","", v.fee + m.getOrElse((v.city,"",""), emptyStudent).fee, 0))
  m += ((v.city,v.college,"") -> Students(v.city,v.college,"","", v.fee + m.getOrElse((v.city,v.college,""), emptyStudent).fee, 0))
  m += ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"", v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
}
m

This should be better in terms of memory consumption cause you aren't copy the maps like in the foldLeft example

Output

Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0), 
(Bangalore,,) -> Students(Bangalore,,,,270,0), 
(Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0), 
(Mumbai,College2,) -> Students(Mumbai,College2,,,240,0), 
(Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0), 
(Mumbai,,) -> Students(Mumbai,,,,790,0), (Bangalore,College3,) -> Students(Bangalore,College3,,,270,0), 
(Mumbai,College1,) -> Students(Mumbai,College1,,,550,0), 
(Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))

In all cases you could just reduce the code if you make the parameter optional in your case class students, cause then you can just do something like Students(city=v.city,fee=v.fee+m.getOrElse(v.city,emptyStudent).fee during grouping

Kordi
  • 2,405
  • 1
  • 14
  • 13
  • Sorry. Actually i am looking for the mentioned output and aggregation happens on some other params also. Not able to find clean way to generate mentioned output. – jony Mar 09 '16 at 20:14
  • @jony ok thought the fee is enough cause the other parameter are obvious. – Kordi Mar 09 '16 at 20:15
  • @jony now I have the exact Output and the code is a little bit prettier, but I dont think Its faster, I only have the first example group by city, the other are straight forward, cause the grouping is the same like in your example – Kordi Mar 09 '16 at 22:51
  • @jony So now I have some better methods with only going 1 time over the list. And the output is exactly like in your code example. – Kordi Mar 09 '16 at 23:43
  • Great. Thank you for your time. – jony Mar 10 '16 at 10:36
  • Please explain me this - Just going in one loop over the complete list for FoldLeft. Here you are parsing list 3 times (How can it be one parse), making a map of 1 combination, 2 and 3 combinations. For the desired output then i need to go through map3 first, loop on each key, take first two parameters and ask for map2 and then for map1. Here almost we are running the list 4 times if half of the list has unique values. – jony Mar 10 '16 at 11:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/105895/discussion-between-kordi-and-jony). – Kordi Mar 10 '16 at 11:49
  • @jony For one foldLeft I go one time over the list. In your code you go one time via groupby over the list and then one time via the foldLeft where you are sum up the fee. So I create 3 Maps, so i loop 3 times over the List. You can avoid that if you create one Big Map I try to do this. But then the output is only one Big Map – Kordi Mar 10 '16 at 11:57
  • @jony I have now one chapter where I create a Big Map with all the 3 grouping in it. There I go only one time over the complete list. – Kordi Mar 10 '16 at 12:16
  • @Kordi regarding your comment "This should be better in terms of memory consumption cause you aren't copy the maps like in the foldLeft example". I think this is a false assumption. When dealing with immutable data structure, copy is cheap, indeed you benefit from structural sharing., there is no need for deep copy, pointers to existing data can be safely used by the system. – Patrick Refondini Jun 01 '18 at 08:55
1

Use a foldLeft

First, let's define some type aliases to make the syntax easier

object GroupByStudents {

type City = String
type College = String
type Group = String
type Name = String

type Aggregate = Map[City, Map[College, Map[Group, List[Students]]]]
def emptyAggregate: Aggregate = Map.empty

case class Students(city: City, college: College, group: Group,
                  name: Name, fee: Int, age: Int)
}

You can aggregate the students list into an Aggregate map in a single foldLeft

object Test {

import GroupByStudents._

def main(args: Array[String]) {
   val studentsList = List(
     Students("Mumbai","College1","Science","Jony",100,30),
     Students("Mumbai","College1","Science","Tony", 200, 25),
     Students("Mumbai","College1","Social","Bony",250,30),
     Students("Mumbai","College2","Science","Gony", 240, 28),
     Students("Bangalore","College3","Science","Hony", 270, 28))

   val aggregated = studentsList.foldLeft(emptyAggregate){(agg, students) =>
     val cityBin = agg.getOrElse(students.city, Map.empty)
     val collegeBin = cityBin.getOrElse(students.college, Map.empty)
     val groupBin = collegeBin.getOrElse(students.group, List.empty)

     val nextGroupBin = students :: groupBin
     val nextCollegeBin= collegeBin + (students.group -> nextGroupBin)
     val nextCityBin = cityBin + (students.college -> nextCollegeBin)
     agg + (students.city -> nextCityBin)
     }
   }
}

aggregated can then be mapped over to calculate fees. If you really want, you can calculate the fees in the foldLeft itself, but this would make the code harder to read.

Note that you can also try monocle's lenses to put the students value in the aggregated structure.

Swifter
  • 211
  • 1
  • 5
  • Thanks for the neat answer. As mentioned, output requires one more pass through all map values. – jony Mar 10 '16 at 17:37