0

Input:

val input = List((a, 10 Inches), (a, 10.00 inches), (a, 15 in), (b, 2 cm), (b, 2.00 CM))

I like to have an output

val output = List((a, 10 Inches, 0.66), (b, 2 cm, 1))

I also have a utility function that returns true for fuzzy matching ("10 Inches", "10.00 inches")

fuzzyMatch(s1, s2) returns

true for s1 = "10 Inches" and s2 = "10.00 inches"
false for s1 = "10 Inches" and s2 = "15 in"
false for s1 = "10.00 inches" and s2 = "15 in"
true for s1 = "2 cm" and s2 = "2.00 CM"

Output = List of (unique_name, max occurred string value, (max number of occurrences/total occurrences))

How can I reduce that above input to output

What I have so far

val tupleMap = input.groupBy(identity).mapValues(_.size)
val totalOccurrences = input.groupBy(_._1).mapValues(_.size)
val maxNumberOfValueOccurrences = tupleMap.groupBy(_._1._1).mapValues(_.values.max)
val processedInput = tupleMap
      .filter {
        case (k, v) => v == maxNumberOfValueOccurrences(k._1)
      }
      .map {
        case (k, v) => (k._1, k._2, v.toDouble / totalOccurrences(k._1))
      }.toSeq

which is giving ratios for exact matches. How can I fit in my fuzzy match in there so it would group all similar values and calculate the ratio? Fuzzy matched value can be any of the matches.

It's essentially a custom groupBy using my fuzzyMatch(...) method. But I can't think of a solution here.

After some more thinking I got something like below. Better solutions would be appreciated.

val tupleMap: Map[String, Seq[String]] = input.groupBy(_._1).mapValues(_.map(_._2))

val result = tupleMap mapValues {
list =>
val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()

list foreach {
  value =>
    // Using fuzzy match to find the best match
    // findBestMatch (uses fuzzyMatch) returns the Option(key) 
    // if there exists a similar key, if not returns None
    val bestMatch = findBestMatch(value, valueCountsMap.keySet.toSeq) 
    if (bestMatch.isDefined) {
      val newValueCount = valueCountsMap.getOrElse(bestMatch.get, 0) + 1
      valueCountsMap(bestMatch.get) = newValueCount
    } else {
      valueCountsMap(value) = 1
    }
}

val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
(maxOccurredValueNCount._1, maxOccurredValueNCount._2)
}
yalkris
  • 2,596
  • 5
  • 31
  • 51
  • Your code doesn't match your example data. Particularly, how do you find the "max occurred value"? And if you already can extract numerical values, why do you need `fuzzyMatch` at all? Just convert string to the numerical value and match by it. – SergGr Feb 02 '18 at 01:41
  • 1
    One of the requirements is to find max occurred value using fuzzyMatch. In 15 inches, 15.00 inches & 10 in fuzzyMatch says 15 Inches & 15.00 inches are similar and 10 in is not. With that we can tell "15 Inches/15.00 inches" is the "max occurred value". – yalkris Feb 02 '18 at 22:26

2 Answers2

2

If for some reason approach with converting to numerical values doesn't work for you, here is a code that seems to do what you want:

def fuzzyMatch(s1: String, s2: String): Boolean = {
  // fake implementation
  val matches = List(("15 Inches", "15.00 inches"), ("2 cm", "2.00 CM"))
  s1.equals(s2) || matches.exists({
    case (m1, m2) => (m1.equals(s1) && m2.equals(s2)) || (m1.equals(s2) && m2.equals(s1))
  })
}

 def test(): Unit = {
  val input = List(("a", "15 Inches"), ("a", "15.00 inches"), ("a", "10 in"), ("b", "2 cm"), ("b", "2.00 CM"))
  val byKey = input.groupBy(_._1).mapValues(l => l.map(_._2))
  val totalOccurrences = byKey.mapValues(_.size)
  val maxByKey = byKey.mapValues(_.head) //random "max" selection logic

  val processedInput: List[(String, String, Double)] = maxByKey.map({
    case (mk, mv) =>
      val matchCount = byKey(mk).count(tv => fuzzyMatch(tv, mv))
      (mk, mv, matchCount / totalOccurrences(mk).asInstanceOf[Double])
  })(breakOut)

  println(processedInput)
}

This prints

List((b,2 cm,1.0), (a,15 Inches,0.6666666666666666))

SergGr
  • 23,570
  • 2
  • 30
  • 51
  • This above solution wouldn't work for this below input val input = List(("a", "10 in"), ("a", "15.00 inches"), ("a", "15 Inches"), ("b", "2 cm"), ("b", "2.00 CM")) Problem is with your random max selection logic. Max occurred value should be based on fuzzy matched value. Not random. – yalkris Feb 02 '18 at 22:05
  • @yalkris, obviously you should put your real logic into `maxByKey` calculation. That's why I put a comment on that line that I used essentially a random selection logic instead of your real one as you didn't specify your real one in the question. – SergGr Feb 02 '18 at 22:06
  • Thanks. My requirement is in (a,15 Inches), (a,15.00 inches), (a,10 inches) 15 inches occurred twice so the result should be (a, 15.00 inches or 15 Inches, 0.6666). How would you do that using your fuzzyMatch method which matches 15.00 inches and 15 Inches? – yalkris Feb 02 '18 at 22:09
  • @yalkris, am I right that when you start with `List(("a", "10 in"), ("a", "15.00 inches"), ("a", "15 Inches"), ("b", "2 cm"), ("b", "2.00 CM"))` you get `(a,10 in,0.33333)`? If this is so, the problem is **_not_** in the grouping logic, it is in the logic to select the max value. Your question implies that you already know how to select the max value from those strings. If this is not the case - you should change your question because that problem has nothing to do with "fuzzy matching". – SergGr Feb 02 '18 at 22:12
  • No. I only have a fuzzyMatch method which takes two strings and return a true if they are similar strings. I want to return max occurred value and it's ratio in total occurrences. – yalkris Feb 02 '18 at 22:14
  • @yalkris, you obviously can't return max value using `fuzzyMatch` only. There is nothing that specifies order of different values so how the max values should be selected? – SergGr Feb 02 '18 at 22:15
  • @yalkris, I have a question. Maybe I misunderstood your example. Assume that the starting list is `List(("a", "10 in"), ("a", "10.00 inches"), ("a", "15 Inches"))`. What answer do you expect `("a", "10 in", 0.666)` or `("a", "15 Inches", 0.333)`? If this is the first one, you can get it with `fuzzyMatch` only. If it is the second one - you've got no chances. – SergGr Feb 02 '18 at 22:24
  • @yalkris, sorry have you noticed that in my last comment the example is different? There are two `"10 in"` and `"10.00 inches"` and only one `"15 Inches"`? If the answer is still `("a", "15 Inches", 0.666)`, could you describe by what logic that value should be calculated? – SergGr Feb 02 '18 at 22:29
  • 1
    I am sorry. Yes. As per your input I want `("a", "10.00 inches",0.66)` – yalkris Feb 02 '18 at 22:32
  • 1
    @yalkris, Aha! Your original example is confusing as to what exactly "max" means. This can be done but it will take me some time to update the answer. – SergGr Feb 02 '18 at 22:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164437/discussion-between-yalkris-and-serggr). – yalkris Feb 02 '18 at 22:38
1

Here's an approach to preprocess your input with fuzzy-match, which will then be used as input by your existing code.

The idea is to first generate 2-combinations of your input tuples, fuzzy-match them to create a Map of distinct Sets consisting of the matched values per key, and finally use the Map to fuzzy-match your original input.

To make sure more arbitrary cases are covered, I've expanded your input:

val input = List(
  ("a", "10 in"), ("a", "15 in"), ("a", "10 inches"), ("a", "15 Inches"), ("a", "15.00 inches"),
  ("b", "2 cm"), ("b", "4 cm"), ("b", "2.00 CM"),
  ("c", "7 cm"), ("c", "7 in")
)

// Trivialized fuzzy match
def fuzzyMatch(s1: String, s2: String): Boolean = {
  val st1 = s1.toLowerCase.replace(".00", "").replace("inches", "in")
  val st2 = s2.toLowerCase.replace(".00", "").replace("inches", "in")
  st1 == st2
}

// Create a Map of Sets of fuzzy-matched values from all 2-combinations per key
val fuzMap = input.combinations(2).foldLeft( Map[String, Seq[Set[String]]]() ){
  case (m, Seq(t1: Tuple2[String, String], t2: Tuple2[String, String])) =>
    if (fuzzyMatch(t1._2, t2._2)) {
      val fuzSets = m.getOrElse(t1._1, Seq(Set(t1._2, t2._2))).map(
        x => if (x.contains(t1._2) || x.contains(t2._2)) x ++ Set(t1._2, t2._2) else x
      )
      if (!fuzSets.flatten.contains(t1._2) && !fuzSets.flatten.contains(t2._2))
        m + (t1._1 -> (fuzSets :+ Set(t1._2, t2._2)))
      else
        m + (t1._1 -> fuzSets)
    }
    else
      m
}
// fuzMap: scala.collection.immutable.Map[String,Seq[Set[String]]] = Map(
//   a -> List(Set(10 in, 10 inches), Set(15 in, 15 Inches, 15.00 inches)), 
//   b -> List(Set(2 cm, 2.00 CM)))
// )

Note that for large input, it might make sense to first groupBy key and generate 2-combinations per key.

Next step would be to fuzzy-match the original input using the created Map:

// Fuzzy-match original input using fuzMap
val fuzInput = input.map{ case (k, v) => 
  if (fuzMap.get(k).isDefined) {
    val fuzValues = fuzMap(k).map{
      case x => if (x.contains(v)) Some(x.min) else None
    }.flatten
    if (!fuzValues.isEmpty)
      (k, fuzValues.head)
    else
      (k, v)
  }
  else
    (k, v)
}
// fuzInput: List[(String, String)] = List(
//   (a,10 in), (a,15 Inches), (a,10 in), (a,15 Inches), (a,15 Inches),
//   (b,2 cm), (b,4 cm), (b,2 cm),
//   (c,7 cm), (c,7 in)
// )
Leo C
  • 22,006
  • 3
  • 26
  • 39