1

Here are some values. Each is a sequence of ascending (or otherwise grouped) values.

(def input-vals [[[1 :a] [1 :b] [2 :c] [3 :d] [3 :e]]
           [[1 :f] [2 :g] [2 :h] [2 :i] [3 :j] [3 :k]]
           [[1 :l] [3 :m]]])

I can partition them each by value.

=> (map (partial partition-by first) input-vals)
   ((([1 :a] [1 :b]) ([2 :c]) ([3 :d] [3 :e])) (([1 :f]) ([2 :g] [2 :h] [2 :i]) ([3 :j] [3 :k])) (([1 :l]) ([3 :m])))

But that gets me 3 sequences of partitions. I want one single sequence of partitioned groups.

What I want to do is return a single lazy sequence of (potentially) lazy sequences that are the respective partitions joined. e.g. I want to produce this:

((([1 :a] [1 :b] [1 :f] [1 :l]) ([2 :c] [2 :g] [2 :h] [2 :i]) ([3 :d] [3 :e] [3 :j] [3 :k] [3 :m])))

Note that not all values appear in all sequences (there is no 2 in the third vector).

This is of course a simplification of my problem. The real data is a set of lazy streams coming from very large files, so nothing can be realised. But I think the solution for the above question is the solution for my problem.

Feel free to edit the title, I wasn't quite sure how to express it.

Joe
  • 46,419
  • 33
  • 155
  • 245
  • Are you aware of how much you changed your question? :p – Chiron Jan 21 '14 at 16:47
  • I changed the content but not the application of functions I'm looking for. – Joe Jan 21 '14 at 16:48
  • Thanks for your patience guys. I tried to make the question as simple as possible by using simple values. (Also made a typo in the repl that confused things). What I'm trying to achieve hasn't changed, but chiron's answer used identity, which meant that I had to demonstrate that the value of the projections used for the partition (in this case `first`) had common values but that the values themselves (`[1 :a]`) are mutually unique. – Joe Jan 21 '14 at 16:57
  • If you look at the first version the semantics are the same, it's just the example data is less clear: http://stackoverflow.com/revisions/e223dcba-1fa5-4098-be7f-f0e2b4e3f6a0/view-source – Joe Jan 21 '14 at 17:02
  • The structure of desired output is different - first version is flat in order, current appears to be partitioned. No matter, if you had the first, you could easily produce the second. I think what's been lost now is the emphasis that you are looking for a _lazy_ sequence. It seems like you want to _lazily_ merge the inputs together in sorted order and then possibly partition. Is that right? – A. Webb Jan 21 '14 at 17:28
  • I'm still suffering from paren-blindness. Yes, you're correct, laziness is important. If it's posissble to lazily merge in order based on a projection and then partition that would be another way to solve it. I was thinking of lazily partitioning each sequence then lazily merging, but the outcome is the same. – Joe Jan 21 '14 at 17:31

5 Answers5

2

Try this horror:

(defn partition-many-by [f comp-f s]
  (let [sorted-s (sort-by first comp-f s)
        first-list (first (drop-while (complement seq) sorted-s))
        match-val (f (first first-list))
        remains (filter #(not (empty? %)) 
                        (map #(drop-while (fn [ss] (= match-val (f ss))) %) 
                             sorted-s))]
    (when match-val
      (cons
        (apply concat
          (map #(take-while (fn [ss] (= match-val (f ss))) %)
               sorted-s))
        (lazy-seq (partition-many-by f comp-f remains))))))

It could possibly be improved to remove the double value check (take-while and drop-while).

Example usage:

(partition-many-by identity [[1 1 1 1 2 2 3 3 3 3] [1 1 2 2 2 2 3] [3]])

=> ((1 1 1 1 1 1) (2 2 2 2 2 2) (3 3 3 3 3 3))
  • Thank you Karl Jonathan Ward very much indeed. – Joe Jan 21 '14 at 18:00
  • This does not _quite_ work, e.g. `(partition-many-by identity [[0 2 4 6 8 10] [0 3 6 9 12] [0 5 10 15]]) ;=> ((0 0 0) (2) (4) (6) (8) (10) (3) (6) (9) (12) (5) (10) (15))`, where instead it seems the two 6's and two 10's ought to be together. – A. Webb Jan 21 '14 at 18:18
  • Right. But I guess it depends on the exact need - is grouping more important, or segregation? I like your idea of a lazy-merge-by below, but that does require the partition elements to be orderable as well as segregatable. e.g. what about the case: – Karl Jonathan Ward Jan 21 '14 at 18:50
  • Right. Due to the lack of ordering. I like your idea of creating a lazy-merge-by and oredering before partition, though I still like the idea of a 'parition-by'-a-like for multiple streams that does not require the streams to be orderable. I guess a question for the poster is, do you want partitioning, or grouping? Unfortunately I pressed enter by mistake and stackoverflow won't allow me to edit after 5 minutes. So please ignore above comment. – Karl Jonathan Ward Jan 21 '14 at 18:57
  • Yeah, the sample input/desired output is weak in this respect. What was your other case truncated from the first comment? – A. Webb Jan 21 '14 at 22:20
  • Can't remember. It was too late in the evening for me to remember anything :-). I have modified my code above so that there is a comparison of the first chunck of each sequence, which is then used to order the sequences for chunck selection. It does introduce the need for a comparison operator but after consideration I don't see this being useful without some ordering. – Karl Jonathan Ward Jan 22 '14 at 14:32
2

Let's make this interesting and use sequences of infinite length for our input

(def twos (iterate #(+ 2 %) 0))
(def threes (iterate #(+ 3 %) 0))
(def fives (iterate #(+ 5 %) 0))

We'll need to lazily merge them. Let's ask for a comparator so we can apply to other data types as well.

(defn lazy-merge-by
 ([compfn xs ys] 
  (lazy-seq
    (cond
      (empty? xs) ys
      (empty? ys) xs
      :else (if (compfn (first xs) (first ys)) 
              (cons (first xs) (lazy-merge-by compfn (rest xs) ys))
              (cons (first ys) (lazy-merge-by compfn xs (rest ys)))))))
  ([compfn xs ys & more] 
   (apply lazy-merge-by compfn (lazy-merge-by compfn xs ys) more)))

Test

(take 15 (lazy-merge-by < twos threes fives))
;=> (0 0 0 2 3 4 5 6 6 8 9 10 10 12 12)

We can (lazily) partition by value if desired

(take 10 (partition-by identity (lazy-merge-by < twos threes fives)))
;=> ((0 0 0) (2) (3) (4) (5) (6 6) (8) (9) (10 10) (12 12))

Now, back to the sample input

(partition-by first (apply lazy-merge-by #(<= (first %) (first %2)) input-vals))
;=> (([1 :a] [1 :b] [1 :f] [1 :l]) ([2 :c] [2 :g] [2 :h] [2 :i]) ([3 :d] [3 :e] [3 :j] [3 :k] [3 :m]))

as desired less one extraneous set of outer parentheses.

A. Webb
  • 26,227
  • 1
  • 63
  • 95
1

I'm not sure whether I'm following but you can faltten the result sequence, something like:

(flatten (partition-by identity (first input-vals)))

clojure.core/flatten
([x])
Takes any nested combination of sequential things (lists, vectors,
etc.) and returns their contents as a single, flat sequence.
(flatten nil) returns an empty sequence.

You can use realized? function to test whether a sequence is lazy or not.

Chiron
  • 20,081
  • 17
  • 81
  • 133
  • This almost does what I want. I'll clarify my question (I think your answer will still apply). But is it lazy? – Joe Jan 21 '14 at 16:32
  • Flatten would eliminate all the internal structure of the input, leaving just a flat sequence of numbers and keywords. – noisesmith Jan 21 '14 at 17:19
  • @noisesmith The original question has been edited a couple of times. Pretty sure that my answer is out of date now – Chiron Jan 21 '14 at 17:20
  • Fair point, but it should be noted that this is a notorious problem with flatten, and it is the reason that flatten is rarely the right function to use. – noisesmith Jan 21 '14 at 17:34
1
user> (def desired-result '((([1 :a] [1 :b] [1 :f] [1 :l])
                             ([2 :c] [2 :g] [2 :h] [2 :i])
                             ([3 :d] [3 :e] [3 :j] [3 :k] [3 :m]))))
#'user/desired-result

user> (def input-vals [[[1 :a] [1 :b] [2 :c] [3 :d] [3 :e]]
                       [[1 :f] [2 :g] [2 :h] [2 :i] [3 :j] [3 :k]]
                       [[1 :l] [3 :m]]])
#'user/input-vals

user> (= desired-result (vector (vals (group-by first (apply concat input-vals)))))
true

I changed the input-vals slightly to correct for what I assume was a typographical error, if it was not an error I can update my code to accommodate the less regular structure.

Using the ->> (thread last) macro, we can have the equivalent code in a more readable form:

user> (= desired-result
         (->> input-vals
           (apply concat)
           (group-by first)
           vals
           vector))
true
noisesmith
  • 20,076
  • 2
  • 41
  • 49
  • A note: you cannot expect that all items will be properly grouped and to also achieve laziness. If you think it through, the two desires are contradictory (unless you have some a-priori knowledge of the structure of future inputs, and build your code around that structure). – noisesmith Jan 21 '14 at 17:37
  • Unless I've misunderstood you, `partition-by` does this doesn't it? As I said at the top, the inputs are ordered within their streams. – Joe Jan 21 '14 at 17:38
  • partition-by will only work if the total input is ordered, ordered subsequences will end up with redundant unmerged groups – noisesmith Jan 21 '14 at 19:38
0
(partition-by first (sort-by first (mapcat identity input-vals)))
Hendekagon
  • 4,565
  • 2
  • 28
  • 43
  • Thanks, but I don't think `sort-by` is lazy. – Joe Jan 22 '14 at 10:07
  • yeh I guess I don't understand the question enough - if you want to group lazily, how can you be sure you've seen all items of a given group if some of them may be in the (rest of the sequence which you've yet to realize ? – Hendekagon Jan 23 '14 at 01:09
  • Because I know that the items in the input file are grouped in sequence. In this case, they are in ascending date sequence and I am partitioning by date. – Joe Jan 23 '14 at 10:59
  • oh in that case wouldn't just (partition-by first (mapcat identity input-vals)) do it ? – Hendekagon Jan 24 '14 at 00:44