Performance of Longest Substring Without Repeating Characters in Haskell

Question

Upon reading this Python question and proposing a solution, I tried to solve the same challenge in Haskell.

I've come up with the code below, which seems to work. However, since I'm pretty new to this language, I'd like some help in understand whether the code is good performancewise.

lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
  where
    step ("","") c = ([c],[c])
    step (maxSubstr,current) c
      | c `elem` current = step (maxSubstr,init current) c
      | otherwise = let candidate = (c:current)
                        longerThan = (>) `on` length
                        newMaxSubstr = if maxSubstr `longerThan` candidate
                                       then maxSubstr
                                       else candidate
                    in (newMaxSubstr, candidate)

Some points I think could be better than they are

I carry on a pair of strings (the longest tracked, and the current candidate) but I only need the former; thinking procedurally, there's no way to escape this, but maybe FP allows another approach?
I construct (c:current) but I use it only in the else; I could make a more complicated longerThan to add 1 to the lenght of its second argument, so that I can apply it to maxSubstr and current, and construct (c:current) in the else, without even giving it a name.
I drop the last element of current when c is in the current string, because I'm piling up the strings with :; I could instead pattern match when checking c against the string (as in c `elem` current@(a:as)), but then when adding the new character I should do current ++ [c], which I know is not as performant as c:current.
I use foldl' (as I know foldl doesn't really make sense); foldr could be an alternative, but since I don't see how laziness enters this problem, I can't tell which one would be better.

`length` runs in *O(n)*, so usually that is not a good idea. You can for example create 2-tuples with the list and the length. — Willem Van Onsem, May 16 '20 at 21:05
You want to be carrying round indices instead of the actual strings. This also helps with the `length` issue mentioned. `elem` is `O(n)`, so it's better to use a map counting character occurrences (or indices with `elem` if optimizing for space usage instead, though then you need the input as a character vector to index cheaply). `init` is expensive, Haskell lists are good as stacks not queues, `Data.Sequence` contains a suitable deque (but again, indices suffice here). — moonGoose, May 17 '20 at 02:37

Daniel Wagner · Accepted Answer · 2020-05-17T13:51:02.527

Running elem on every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running length on, in the worst case, every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running init a lot makes your algorithm Ω(n*sqrt(n)) (for strings that are sqrt(n) repetitions of a sqrt(n)-long string, with every other one reversed, and assuming an O(1) elem replacement).

A better way is to pay one O(n) cost up front to copy into a data structure with constant-time indexing, and to keep a set (or similar data structure) of seen elements rather than a flat list. Like this:

import Data.Set (Set)
import Data.Vector (Vector)
import qualified Data.Set as S
import qualified Data.Vector as V

lswrc2 :: String -> String
lswrc2 "" = ""
lswrc2 s_ = go S.empty 0 0 0 0 where
    s = V.fromList s_
    n = V.length s
    at = V.unsafeIndex s
    go seen lo hi bestLo bestHi
        | hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
        -- it is probably faster (possibly asymptotically so?) to use findIndex
        -- to immediately pick the correct next value of lo
        | at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
        | otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
            if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi

This should have O(n*log(n)) worst-case performance (achieving that worst case on strings with no repeats). There may be ways that are better still; I haven't thought super hard about it.

On my machine, lswrc2 consistently outperforms lswrc on random strings. On the string ['\0' .. '\100000'], lswrc takes about 40s and lswrc2 takes 0.03s. lswrc2 can handle [minBound .. maxBound] in about 0.4s; I gave up after more than 20 minutes of letting lswrc chew on that list.

A similar implementation idea can probably be achieved with an efficient purely-functional queue and no vector/indexing operations: instead of tracking an upper and lower bound on the current candidate slice, keep a queue representing the slice. Push as you increment the high side, pop as you increment the low side. Then you can represent the algorithm as a fold again. — Daniel Wagner, May 17 '20 at 04:23
Sorry, I haven't had time to look at this answer yet. I'll do it in the weekend, most likely. Sorry for this. — Enlico, May 20 '20 at 21:07

Performance of Longest Substring Without Repeating Characters in Haskell

1 Answers1