The format of your vector is not correct scala syntax, which I think you mean something like this:
val items = Seq(
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "D", "A", "F"),
Seq("B", "D", "A", "T", "F"),
Seq("B", "A", "P", "F"),
Seq("B", "D", "A", "P", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F"),
Seq("B", "A", "F")
)
It sounds like what you are trying to accomplish is two group by
clauses. First, you would like to get all combinations from each list, then get the most frequent combinations accross the sets, get how often they occur, and then for groups that occur at the same frequency, do another group by
and merge those together.
For this you will need the following function to perform the double reduction after the double groupby.
Steps:
- Collect all the sequences of groups. Inside items, we calculate the total combinations of elements inside that list of items which generates a
Seq[Seq[String]]
of groups where the Seq[String]
is a unique combination. This is flattened because the (1 to group.length)
operation generates a Seq
of Seq[Seq[String]]
. We then flatten all the mappings together accross all the lists in the vector you have to get a Seq[Seq[String]]
- The
groupMapReduce
function is used to calculate how often a certain combination appears, and then each combination is given a value of 1 to be summed up. This gives a frequency on how often any certain combination shows up.
- The groups are grouped again, but this time by the number of occurences. So if "A" and "B" both occur 10 times, they will be grouped together.
- The final map reduces the groups that were accumulated
val combos = items.flatMap(group => (1 to group.length).flatMap(i => group.combinations(i).map(_.sorted)).distinct) // Seq[Seq[String]]
.groupMapReduce(identity)(_ => 1)(_ + _) // Map[Seq[String, Int]]
.groupMapReduce(_._2)(v => Seq(v))(_ ++ _) // Map[Int, Seq[(Seq[String], Int)]]
.map { case (total, groups) => (groupReduction(groups), total)} // reduction function to determine how you want to double reduce these groups.
This double reduction function I've defined as follows. It converted a group like Seq("A","B")
into ""A","B""
and then if Seq("A","B")
has the same count as another group Seq("C")
, then the group is concatenated together as ""A","B"","C""
def groupReduction(groups: Seq[(Seq[String], Int)]): String = {
groups.map(_._1.map(v => s"""$v""").sorted.mkString(",")).sorted.mkString(",")
}
This filter can be adjusted for particular groups of interest in the (1 to group.length)
clause. If limited from 3 to 3
, then the groups would be
List(List(B, D, P), List(A, D, P), List(D, F, P)): 2
List(List(A, B, F)): 10
List(List(B, D, F), List(A, D, F), List(A, B, D)): 4
List(List(A, F, P), List(B, F, P), List(A, B, P)): 3
List((List(B, D, T), List(A, F, T), List(B, F, T), List(A, D, T), List(A, B, T), List(D, F, T)): 1
As you can see in your example, `List(B, D, F)` and `List(A, D, F)` are also associated with your second line "A,B,D".