2

In python there is groupby function.

It's type can be expressed in haskell like this groupby :: a->b->[a]->[(b, [a])] Because it need data to be sorted we can think of it's running time as O(n*log(n)).

I was probably not the only one dissatisfied with this, so I found this library This implementation of groupby need two passes over the input sequence. So I think its running time is O(n), but as it says in the docs it isn't really lazy, because if you don' pass keys to it it would need to make a pass over sequence to collect all unique keys from items.

So I thought, citing Raymond Hetttinger

There must be a better way!

So I wrote this

from collections import defaultdict, deque


def groupby(sequence, key=lambda x: x):
    buffers = defaultdict(deque)
    kvs = ((key(item), item) for item in sequence)
    seen_keys = set()
    def subseq(k):
        while True:
            buffered = buffers[k]
            if buffered:
                yield buffered.popleft()
            else:
                next_key, value = next(kvs)
                buffers[next_key].append(value)
    while True:
        try:
            k, value = next(kvs)
        except StopIteration:
            for bk, group in buffers.items():
                if group and bk not in seen_keys:
                    yield (bk, group)
            raise StopIteration()
        else:
            buffers[k].append(value)
        if k not in seen_keys:
            seen_keys.add(k)
            yield k, subseq(k)

In case you aren't familiar with python the idea is very simple. Create a mutable dictionary of key -> queue of elements Try take next element of sequence and its key value. If sequence isn't empty add this value to the group queue according to its key. If we haven't seen this key yield a pair (key, iterable group ) latter one would take keys either from buffer or from sequence. If we already seen this this key do nothing more and loop.

If sequence is ended it means all its element already either have put in buffers (and probably consumed). In case buffers aren't empty we iterate over them and yield renaming (key, iterable) pairs.

I've already unit tested it and its works. And it's truly lazy (meaning it wouldn't take any value from sequence until consumer haven't asked for it) and it's running time should be O(n).

I've tried to haskell analog of this function and haven't found any.

Is it possible to write such thing in haskell? If so, please show the solution, if not, then explain why.

user1685095
  • 5,787
  • 9
  • 51
  • 100
  • 1
    http://hackage.haskell.org/package/discrimination-0.2.1/docs/Data-Discrimination.html#v:groupWith – leftaroundabout Nov 25 '16 at 19:04
  • @leftaroundabout Yeah, that basically the same, but the type is `a->b->[[a]]`. How would I know which equivalence class is which? You see, I've search on hoogle for type `a->b->[(b, [a])]` – user1685095 Nov 25 '16 at 19:29
  • @leftaroundabout On the second hand I could probably try to read the sources and figure how to change it so that it would return names of equivalence classes. I've skimmed through sources, judging by imports it uses mutable state, right? Do you think this is possible without mutable state? – user1685095 Nov 25 '16 at 19:41
  • 1
    Clearly the type `[(b, [a])]` is not the one you want - Haskell linked lists are not python dictionaries! You simply will not get the performance you seek, as you've seen in the answer below. It does not matter that your python function consumes and yields a list - it uses mutability internally, and so will your Haskell function have to - you can still produce a pure value at the end if you work entirely in `ST`. – user2407038 Nov 25 '16 at 20:25
  • @user1685095 If you want type `[(b, [a])]` you can then just map over result list to convert `[a]` into `(b, [a])`. Just using `map (\l -> (key $ head l, l))` – Shersh Nov 25 '16 at 21:09
  • @Shersh Oh, that would work! – user1685095 Nov 25 '16 at 22:10
  • @user2407038 Clearly the type `[(b, [a])]` is the type I want, because I've said so. Of course linked list is not a dictionary haskell or not. It would be more usefull if you would show how the guys from discrimination package have done it at least conceptually. – user1685095 Nov 25 '16 at 22:13
  • You are comparing apples with oranges here. Your function implements something totally different than `itertools.groupby`. E.g. `[list(v) for k, v in itertools.groupby([1,2,1])]` gives `[[1],[2],[1]]` whereas your function gives `[[1,1],[2]]`. – Frerich Raabe Dec 04 '19 at 11:24
  • I think a simpler definition of your `groupby` function would be: `result = defaultdict(list); for k, v in [key(v), v in sequence]: result[k].append(v); return result.items()`. – Frerich Raabe Dec 04 '19 at 11:30
  • @FrerichRaabe I know. Sometimes what `itertools.groupby` does called `groupUntilChanged`. If you sorted the sequence before doing that then result would be the same as real `groupby` should produce (equivalence classes) of a set. – user1685095 Dec 04 '19 at 11:31

1 Answers1

0

If I understand this correctly, the type you want is

(a -> k) -> [a] -> [(k, [a])]

That is, given a key function and a list of items, group the items by the key.

In Haskell there is a library function groupBy which does something similar. It assumes you have a sorted list, and it groups items that meet a Boolean condition into sublists. We can use it to do what you want:

import Data.List
import Data.Ord

groupByKey :: (a -> k) -> [a] -> [(k, [a])]
groupByKey keyF xs = map getResult groups
   where
      keyPairs = map (\v -> (keyF v, v)) xs
      groups = groupBy (\v1 v2 -> fst v1 == fst v2) 
                  $ sortBy (comparing fst) keyPairs
      getResult xs = (fst $ head xs, map snd xs)

keyPairs is the pair (key, value) for each element in the argument. groups first sorts this into key order using sortBy and then groups the results into sublists that share the same key. getResult converts a sublist into a pair containing the key (taken from the head element) and a list of the original values. We are safe to use head because groupBy never gives an empty sublist.

Paul Johnson
  • 17,438
  • 3
  • 42
  • 59
  • Well, that's obvious solution, but it's running time is `O(n*log(n))`. Maybe that wasn't clear enough, but I want a solution that is lazy and have `O(n)` running time. – user1685095 Nov 25 '16 at 19:50
  • 2
    I don't see how you can get that given the need to sort the elements into key order. Maybe I have misunderstood what you want. I can see how using a table of keys would give you O(n log k). Is that it? – Paul Johnson Nov 25 '16 at 19:54
  • Well, you see how I've already done that in python? My implementation doesn't specifies the key order that would be emited, but it could be modified to output pairs in certain order. The key is buffering of elements. Also there is useful link from @leftaroundabout. The guy who wrote discrimination package already basically done that, so It's possible in haskell also. – user1685095 Nov 25 '16 at 19:56
  • Actually the sort function bundled with GHC uses run identification, so I'm pretty certain my version is O(n log k) too. – Paul Johnson Nov 25 '16 at 19:58
  • And `k` is what exactly? And still `O(n*log k)` is not `O(n)` – user1685095 Nov 25 '16 at 19:59
  • Its the number of different key values. Inserting an element into a table of size n is O(log n). If I understand your Python correctly, it builds up a table of lists indexed by key. Given n items with k different keys, this is going to be O(n * log k). – Paul Johnson Nov 25 '16 at 20:03
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/129071/discussion-between-paul-johnson-and-user1685095). – Paul Johnson Nov 25 '16 at 20:17
  • *"I don't see how you can get that given the need to sort the elements into key order."* Actually, python's `itertools.groupby` doesn't sort the elements at all. It only groups elements that were already adjacent. If two elements have the same key, but are separated by a third element which has a different key, then they won't be in the same group. If the user wants to sort by key before calling `itertools.groupby`, then the user should call `sort` explicitly before calling `groupby`. – Stef Sep 04 '21 at 13:26