In python there is groupby function.
It's type can be expressed in haskell like this groupby :: a->b->[a]->[(b, [a])]
Because it need data to be sorted we can think of it's running time as O(n*log(n))
.
I was probably not the only one dissatisfied with this, so I found this library
This implementation of groupby need two passes over the input sequence. So I think its running time is O(n)
, but as it says in the docs it isn't really lazy, because if you don' pass keys to it it would need to make a pass over sequence to collect all unique keys from items.
So I thought, citing Raymond Hetttinger
There must be a better way!
So I wrote this
from collections import defaultdict, deque
def groupby(sequence, key=lambda x: x):
buffers = defaultdict(deque)
kvs = ((key(item), item) for item in sequence)
seen_keys = set()
def subseq(k):
while True:
buffered = buffers[k]
if buffered:
yield buffered.popleft()
else:
next_key, value = next(kvs)
buffers[next_key].append(value)
while True:
try:
k, value = next(kvs)
except StopIteration:
for bk, group in buffers.items():
if group and bk not in seen_keys:
yield (bk, group)
raise StopIteration()
else:
buffers[k].append(value)
if k not in seen_keys:
seen_keys.add(k)
yield k, subseq(k)
In case you aren't familiar with python the idea is very simple.
Create a mutable dictionary of key -> queue of elements
Try take next element of sequence and its key value.
If sequence isn't empty add this value to the group queue according to its key. If we haven't seen this key yield a pair (key, iterable group ) latter one would take keys either from buffer or from sequence. If we already seen this this key do nothing more and loop.
If sequence is ended it means all its element already either have put in buffers (and probably consumed). In case buffers aren't empty we iterate over them and yield renaming (key, iterable) pairs.
I've already unit tested it and its works. And it's truly lazy (meaning it wouldn't take any value from sequence until consumer haven't asked for it) and it's running time should be O(n)
.
I've tried to haskell analog of this function and haven't found any.
Is it possible to write such thing in haskell? If so, please show the solution, if not, then explain why.