5

I'm using some Java code to do fast prefix lookups, using java.util.TreeSet, could I be using scala's TreeSet instead? Or a different solution?

/** A class that uses a TreeSet to do fast prefix matching
 */
class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String): List[String] = {
    val matches = new ListBuffer[String]
    val tailSet = _set.tailSet(prefix)
    for ( tail <- tailSet.toArray ) {
      val tailString = tail.asInstanceOf[String]
      if ( tailString.startsWith(prefix) ) 
        matches += tailString
      else
        return matches.toList    
    }

    matches.toList
  }
}
waterlooalex
  • 13,642
  • 16
  • 78
  • 99
  • While tries are tree-structured, I don't see how a TreeSet is going to help you implement a trie. The use of the tree in a TreeSet is just to exploit the ordering on the elements to speed contains testing and insertion. The tree aspect is not a part of its API. – Randall Schulz Dec 09 '09 at 16:01
  • The fact that the class is named Trie is probably a mistake, I copied it from somewhere else. So, ignoring the name, I think TreeSet is a reasonable approach for matching prefixes. – waterlooalex Dec 09 '09 at 16:13

5 Answers5

14

Use a Trie. Nobody's actually posted a Trie here yet, despite the fact that some people have posted sorted TreeMap data structures that they have misnamed as tries. Here is a fairly representative sample of a Trie implementation in Java. I don't know of any in Scala. See also an explanation of Tries on Wikipedia.

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
  • 3
    There is a Patricia Trie implementation in scala used by **scala-wurfl** project. The last source [here](https://github.com/ff-dev/scala-wurfl/blob/b0e3101826e90f86b20fa96c9379b88672593cda/src/main/scala/org/ffdev/swurfl/matching/trie/PatriciaTrie.scala) – Filippo De Luca Apr 08 '12 at 22:45
6

The from & takeWhile approach:

class PrefixMatcher {
    private val _set = new TreeSet[String]
    def add(s: String) = _set.add(s)
    def findMatches(prefix: String): Iterable[String] =
        _set from prefix takeWhile(_ startsWith prefix)
}

An alternative is to select a subset from prefix to prefix++ (the smallest string after the prefix). This selects only the range of the tree that actually starts with the given prefix. Filtering of entries is not necessary. The subSet method will create a view of the underlying set.

There's still some work (overflow and empty strings won't work) in the increment method but the intent should be clear.

class PrefixMatcher {
  private val _set = new java.util.TreeSet[String]

  def add(s: String) = _set.add(s)

  def findMatches(prefix: String) : Set[String] = {
    def inc(x : String) = { //ignores overflow
       assert(x.length > 0) 
       val last = x.length - 1
       (x take last) + (x(last) + 1).asInstanceOf[Char]
    }
   _set.subSet(prefix, inc(prefix))
  }
}

The same works with the scala jcl.TreeSet#range implementation.

Thomas Jung
  • 32,428
  • 9
  • 84
  • 114
  • Thanks for the suggestion, I think the from syntax looks simpler for this scenario than subSet. – waterlooalex Dec 09 '09 at 16:00
  • Iterating over the set is even simpler. :-) The difference is the runtime behavior. Selecting a subset for the prefix is quite selective and will skip most of the tree. – Thomas Jung Dec 09 '09 at 16:05
  • Heh. I'm comparing using the method 'from' to get a subset with no end (so you have to check to see if you've reached the end), to your suggestion of using subSet, both skip most of the tree I think. – waterlooalex Dec 09 '09 at 16:08
  • 1
    Yes you're right. I've added a from version that does abort when the end of prefix is reached (using takeWhile). I suppose this has the same runtime behavior as your implementation and the subSet implementation. It's quite readable. – Thomas Jung Dec 09 '09 at 16:17
  • Great. I tried to upvote you, and turns out I'd already upvoted you, so it downvoted you, and now it says too old to change, sorry!! – waterlooalex Dec 09 '09 at 16:26
  • Maybe if you edit the answer one more time I'll be able to change my vote back to up. – waterlooalex Dec 09 '09 at 16:29
  • 5 downvotes? how do you mean? (sorry If I am missing the sarcasm or something!) – waterlooalex Dec 09 '09 at 16:30
  • 5 * 2 ~ 10. It was supposed to be ironic. Does not always work between different cultures. – Thomas Jung Dec 09 '09 at 16:33
  • Ah, makes sense now.. My paranoia that maybe I'd downvoted you 5 times, not just 1 time, prevented me from catching what you meant :) – waterlooalex Dec 09 '09 at 16:39
3

As I understand it, the Scala TreeSet is backed by the Java TreeSet, but using the Scala variant would allow you to shorten up the code using a sequence comprehension (http://www.scala-lang.org/node/111) giving you an implementation that looked something like (for Scala 2.7):

import scala.collection.jcl.TreeSet;

class PrefixMatcher 
{
    private val _set = new TreeSet[String]

    def add(s: String) = _set.add(s)

    def findMatches(prefix: String): Iterable[String] =
        for (s <- _set.from(prefix) if s.startsWith(prefix)) yield s
}

object Main
{
    def main(args: Array[String]): Unit =
    {
        val pm = new PrefixMatcher()

        pm.add("fooBar")
        pm.add("fooCow")
        pm.add("barFoo")

        pm.findMatches("foo").foreach(println)
    }
}

Apologies for any bad Scala style on my part, I'm just getting used to the language myself.

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
Derek Mortimer
  • 1,061
  • 1
  • 8
  • 11
  • @pyo: It *looks* like your for loop loops through all strings in the set, is that right? In the java code I call tailSet, which does something like: return an iterator to the nodes that start with the prefix specified in order, allowing us to find the matching items much faster than comparing every item in the set. – waterlooalex Dec 09 '09 at 15:09
  • @Alex Black: indeed, I missed out a call to .from in the sequence comprehension, edited in now. – Derek Mortimer Dec 09 '09 at 15:17
  • @pyo: one more question, will your implementation break when s.startsWith(prefix) returns false as the one I posted does? – waterlooalex Dec 09 '09 at 15:52
  • Nope, it will test the startsWith on every String 'from' prefix to the end of the set – Derek Mortimer Dec 09 '09 at 16:37
  • Truth be told I was unaware of the significance of the term, I merely kept it in from the original question. I shall duly edit it now ;) – Derek Mortimer Dec 10 '09 at 11:37
2

I blogged about finding matches for a combination of prefixes a while ago. It's a harder problem, as you don't know when one prefix ends and the other begins. It might interest you. I'll even post below the code that I did not blog (yet, hopefully :), though it is stripped of all comments, none of which were made in English:

package com.blogspot.dcsobral.matcher.DFA

object DFA {
  type Matched = List[(String, String)]
  def words(s : String) = s.split("\\W").filter(! _.isEmpty).toList
}

import DFA._
import scala.runtime.RichString

class DFA {
  private val initialState : State = new State(None, "")
  private var currState : State = initialState
  private var _input : RichString = ""
  private var _badInput : RichString = ""
  private var _accepted : Boolean = true

  def accepted : Boolean = _accepted
  def input : String = _input.reverse + _badInput.reverse 

  def transition(c : Char) : List[(String, Matched)] = {
    if (c == '\b') backtrack
    else {
      if (accepted) {
        val newState = currState(c)
        newState match {
          case Some(s) => _input = c + _input; currState = s
          case None => _badInput = c + _badInput; _accepted = false
        }
      } else {
        _badInput = c + _badInput
      }
      optionList
    }
  }

  def transition(s : String) : List[(String, Matched)] = {
    s foreach (c => transition(c))
    optionList
  }

  def apply(c : Char) : List[(String, Matched)] = transition(c)
  def apply(s : String) : List[(String,Matched)] = transition(s)

  def backtrack : List[(String, Matched)] = {
    if(_badInput isEmpty) {
      _input = _input drop 1
      currState.backtrack match {
        case Some(s) => currState = s
        case None =>
      }
    } else {
      _badInput = _badInput drop 1
      if (_badInput isEmpty) _accepted = true
    }
    optionList
  }

  def optionList : List[(String, Matched)] = if (accepted) currState.optionList else Nil

  def possibleTransitions : Set[Char] = if (accepted) (currState possibleTransitions) else Set.empty

  def reset : Unit = {
    currState = initialState
    _input = ""
    _badInput = ""
    _accepted = true
  }

  def addOption(s : String) : Unit = {
    initialState addOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
  def removeOption(s : String) : Unit = {
    initialState removeOption s
    val saveInput = input
    reset
    transition(saveInput)
  }
}

class State (val backtrack : Option[State],
             val input : String) {
  private var _options : List[PossibleMatch] = Nil
  private val transitions : scala.collection.mutable.Map[Char, State] = scala.collection.mutable.Map.empty
  private var _possibleTransitions : Set[Char] = Set.empty

  private def computePossibleTransitions = {
    if (! options.isEmpty)
      _possibleTransitions = options map (_.possibleTransitions) reduceLeft (_++_)
    else
      _possibleTransitions = Set.empty
  }

  private def computeTransition(c : Char) : State = {
    val newState = new State(Some(this), input + c)
    options foreach (o => if (o.possibleTransitions contains c) (o computeTransition (newState, c)))
    newState
  }

  def options : List[PossibleMatch] = _options
  def optionList : List[(String, Matched)] = options map (pm => (pm.option, pm.bestMatch))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def transition(c : Char) : Option[State] = {
    val t = c.toLowerCase
    if (possibleTransitions contains t) Some(transitions getOrElseUpdate (t, computeTransition(t))) else None
  }

  def apply(c : Char) : Option[State] = transition(c)

  def addOption(option : String) : Unit = {
    val w = words(option)
    addOption(option, w.size, List(("", w.head)), w)
  }

  def addOption(option : String, priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    options find (_.option == option) match {
      case Some(pM) => 
        if (!pM.hasMatchOption(matched)) {
          pM.addMatchOption(priority, matched, remaining)
          if (priority < pM.priority) {
            val (before, _ :: after) = options span (_ != pM)
            val (highPriority, lowPriority) = before span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
            _options = highPriority ::: (pM :: lowPriority) ::: after
          }
          transitions foreach (t => pM computeTransition (t._2, t._1))
        }
      case None => 
        val (highPriority, lowPriority) = options span (p => p.priority < priority || 
              (p.priority == priority && p.option < option))
        val newPM = new PossibleMatch(option, priority, matched, remaining)
        _options = highPriority ::: (newPM :: lowPriority)
        transitions foreach (t => newPM computeTransition (t._2, t._1))
    }
    computePossibleTransitions
  }

  def removeOption(option : String) : Unit = {
    options find (_.option == option) match {
      case Some(possibleMatch) =>
        val (before, _ :: after) = options span (_ != possibleMatch)
        (possibleMatch.possibleTransitions ** Set(transitions.keys.toList : _*)).toList foreach (t => 
          transition(t).get.removeOption(option))
        _options = before ::: after
        computePossibleTransitions
      case None =>
    }
  }
}

class PossibleMatch (val option : String, 
                     thisPriority : Int, 
                     matched : Matched, 
                     remaining : List[String]) {
  private var _priority = thisPriority
  private var matchOptions = List(new MatchOption(priority, matched, remaining))
  private var _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  private def computePossibleTransitions = {
    _possibleTransitions = matchOptions map (_.possibleTransitions) reduceLeft (_++_)
  }

  def priority : Int = _priority
  def hasMatchOption(matched : Matched) : Boolean = matchOptions exists (_.matched == matched)
  def addMatchOption(priority : Int, matched : Matched, remaining : List[String]) : Unit = {
    if (priority < _priority) _priority = priority
    val (highPriority, lowPriority) = matchOptions span (_.priority < priority)
    val newMO = new MatchOption(priority, matched, remaining)
    matchOptions = highPriority ::: (newMO :: lowPriority)
    computePossibleTransitions
  }
  def bestMatch : Matched = matchOptions.head.matched.reverse.map(p => (p._1.reverse.toString, p._2)) ::: 
    remaining.tail.map(w => ("", w))
  def possibleTransitions : Set[Char] = _possibleTransitions

  def computeTransition(s: State, c : Char) : Unit = {
    def computeOptions(state : State,
                       c : Char, 
                       priority : Int, 
                       matched : Matched, 
                       remaining : List[String]) : Unit = {
      remaining match {
        case w :: ws => 
          if (!w.isEmpty && w(0).toLowerCase == c.toLowerCase) {
            val newMatched = (w(0) + matched.head._1, matched.head._2.substring(1)) :: matched.tail
            val newPriority = if (matched.head._1 isEmpty) (priority - 1) else priority

            if (w.drop(1) isEmpty)
              s.addOption(option, newPriority - 1, ("", ws.head) :: newMatched , ws)
            else
              s.addOption(option, newPriority, newMatched, w.substring(1) :: ws)
          }
          if (ws != Nil) computeOptions(s, c, priority, ("", ws.head) :: matched, ws)
        case Nil =>
      }
    }

    if(possibleTransitions contains c)
      matchOptions foreach (mO => computeOptions(s, c, mO.priority, mO.matched, mO.remaining))
  }
}

class MatchOption (val priority : Int,
                   val matched : Matched,
                   val remaining : List[String]) {
  lazy val possibleTransitions : Set[Char] = Set( remaining map (_(0) toLowerCase) : _* )
}

It really needs some refactoring, though. I always do it when I'm start to explain it for the blog.

Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681
2

Ok, I just realized what you want is pretty much what a friend of mine suggested for another problem. So, here is his answer, simplified for your needs.

class PrefixMatcher {
  // import scala.collection.Set // Scala 2.7 needs this -- and returns a gimped Set
  private var set = new scala.collection.immutable.TreeSet[String]()
  private def succ(s : String) = s.take(s.length - 1) + ((s.charAt(s.length - 1) + 1)).toChar

  def add(s: String) = set += s

  def findMatches(prefix: String): Set[String] = 
    if (prefix.isEmpty) set else set.range(prefix, succ(prefix))
}
Daniel C. Sobral
  • 295,120
  • 86
  • 501
  • 681