6

I've been dabbling in Haskell - so still very much a beginner.

I'm been thinking about the counting the frequency of items in a list. In languages with mutable data structures, this is typically solved using a hash table - a dict in Python or a HashMap in Java for example. The complexity of such a solution is O(n) - assuming the hash table can fit entirely in memory.

In Haskell, there seem to be two (mainstream) choices - to sort the data then group and count it or use a Data.Map. If a sort is used, it dominates the run-time of the solution, so the complexity is O(n log n). Likewise, Data.Map uses a balanced tree, so inserting n elements into it will also have complexity O(n log n).

If my analysis is correct, then I assume that this particular problem is most efficiently solved by resorting to a mutable data structure. Are there other types of problems where this is also true? How in general do people using Haskell approach something like this?

  • 2
    Your benchmark for efficiency is _Python_? – AndrewC Feb 27 '14 at 17:26
  • 8
    For all n, log n < 30. – Daniel Wagner Feb 27 '14 at 17:32
  • 1
    Short answer: there are a _few_, but less than you're clearly worried about, and you can do mutable arrays in Haskell if you need to. [Haskell's better at some things than other languages (eg lightweight threads make server programming fast, it's easier, but still not simple to parallelise stuff because of immutability).] – AndrewC Feb 27 '14 at 17:35
  • 6
    See [Can I always convert mutable-only algorithms to single-assignment and still be efficient?](http://stackoverflow.com/questions/6883005/can-i-always-convert-mutable-only-algorithms-to-single-assignment-and-still-be-e/21963347#21963347) for a guy who was convinced immutability would be a problem, felt he'd found an example that would necessitate exponential time, but I wrote a linear Haskell solution. – AndrewC Feb 27 '14 at 17:37
  • I want to evaluate algorithms rather than particular languages AndrewC. The hash table approach could be implemented in C or any language with mutable data structures. – Eric Fredine Feb 27 '14 at 17:38
  • I'm not especially worried about it - just thought it was an interesting case. One thing I really enjoy is that the functions in the standard libraries are often very short and readable. – Eric Fredine Feb 27 '14 at 17:41
  • 1
    @EricFredine But some algorithms make more sense with mutability and others don't. See [Shortening Knuth's algorithm M (mixed-radix numbers) in Haskell](http://stackoverflow.com/questions/21967212/shortening-knuths-algorithm-m-mixed-radix-numbers-in-haskell) for someone who was insisting he wanted imperative when the functional was much easier, simpler and _at least_ no slower. – AndrewC Feb 27 '14 at 17:43
  • 1
    People think mutability is a big deal because they're used to having it around and panic when it's gone. We don't get many non-beginner questions like this. – AndrewC Feb 27 '14 at 17:44
  • Well, good I got it out of my system then @AndrewC - thanks. – Eric Fredine Feb 27 '14 at 17:52
  • @DanielWagner - not sure I understand your comment. – Eric Fredine Feb 27 '14 at 17:54
  • 6
    He means that unless you have more than 2^30 pieces of data (a billion) a factor of log n is essentially a constant. – AndrewC Feb 27 '14 at 17:59
  • 7
    (It's a humourous way of pointing out that constant factors can outweigh logarithmic factors for real problems.) – AndrewC Feb 27 '14 at 18:05
  • 2
    @EricFredine The point I'm trying to make is that log n grows really, really slowly. It's very hard to get a visceral understanding of this, but there's basically three approaches that have helped me with that in the past. The first is the one I said. The second is to notice that you can pick any exponent you like -- say, e=0.00000001 -- and log n is still O(n^e) (hence n*log n is still O(n^1.00000001)). And the third is to just [stare at this graph a little bit](https://www.wolframalpha.com/input/?i=%28log%5Bn%5D%2C+n%29+for+n+%3D+1+to+100). – Daniel Wagner Feb 27 '14 at 19:01
  • 1
    possible duplicate of [Efficiency of purely functional programming](http://stackoverflow.com/questions/1990464/efficiency-of-purely-functional-programming) – Daniel Wagner Feb 27 '14 at 19:09
  • Thanks @DanielWagner - that link on purely functional programming probably gets at the heart of what I was wondering about. But I take your point about the practical considerations. Once n gets really big (i.e. multiple billions that can't fit things into the memory of one machine) you have to adopt a different approach anyway (like a parallel map/reduce framework). – Eric Fredine Feb 27 '14 at 19:26
  • 2
    @EricFredine I think what you said indicates that you didn't understand my point after all! The point is that the larger n gets, the *less* of a contribution log n gives to a term like n * log n. Moreover, its contribution even for small n is very, very small -- so worrying about it is a mistake for both small *and* large n. – Daniel Wagner Feb 27 '14 at 19:29
  • shakes head... mutters... goes off to look at graph... – Eric Fredine Feb 27 '14 at 19:36

1 Answers1

4

The question whether we can implement any algorithm with optimal complexity in a pure language is currently unknown. Nicholas Pippenger has proven that there is a problem that must necessarily have a log(n) penalty in a pure strict language compared to the optimal algorithm. However, there is a followup paper which shows that this problem have an optimal solution in a lazy language. So at the end of the day we really don't know. Though it seems that most people think that there is an inherent log(n) penalty for some problems, even for lazy languages.

svenningsson
  • 4,009
  • 1
  • 24
  • 32
  • I seem to recall that someone proved it's possible to implement any mutable data structure using laziness and immutability with at most a constant factor penalty. You wouldn't happen to be familiar with this result? (hope I'm not making it up) – John L Feb 27 '14 at 22:01
  • I've also heard a similar argument but I don't have any reference and I don't think I've ever seen a paper claiming such result. But intuitively it seems that any imperative algorithm mutating random access memory could be simulated with a state monad using a binary tree to model the memory. That ought to give the log(n) penalty. – svenningsson Feb 27 '14 at 22:05
  • Hmm. I suppose that's close enough for many purposes, thanks. – John L Feb 27 '14 at 22:10
  • 2
    What I took from the comments on my post is that even if there is a log n penalty to be paid sometimes, that I probably shouldn't spend too much time worrying about it. – Eric Fredine Feb 27 '14 at 22:41
  • Indeed. For most practical applications Haskell should be fast enough for you. – svenningsson Feb 28 '14 at 14:08