Input:
val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))
I like to have an output
val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))
I also have a utility function that returns true for fuzzy matching ("10 Inches", "10.00 inches")
fuzzyMatch(s1, s2) returns
true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"
Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))
How can I reduce that above input to output
What I have so far
val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
.filter {
case (k, v) => v == maxNumberOfValueOccurrences(k._1)
}
.map {
case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))
}.toSeq
which is giving ratios for exact matches. How can I fit in my fuzzy match in there so it would group all similar values and calculate the ratio? Fuzzy matched value can be any of the matches.
It's essentially a custom groupBy using my fuzzyMatch(...) method. But I can't think of a solution here.
After some more thinking I got something like below. Better solutions would be appreciated.
val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(_.map(_._2))
val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()
list foreach {
value =>
// Using fuzzy match to find the best match
// findBestMatch (uses fuzzyMatch) returns the Option(key)
// if there exists a similar key, if not returns None
val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq)
if (bestMatch.isDefined) {
val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
valueCountsMap(bestMatch.get) = newValueCount
} else {
valueCountsMap(value) = 1
}
}
val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)
}