1

Today I tried to implement a "R-like" melt function. I use it for Big Data coming from Big Query. I do not have big constraints about time to compute and this function takes less than 5-10 seconds to work on millions of rows.

I start with this kind of data :

(def sample 
  '({:list "123,250" :group "a"} {:list "234,260" :group "b"}))

Then I defined a function to put the list into a vector :

(defn split-data-rank [datatab value]
  (let [splitted (map (fn[x] (assoc x value (str/split (x value) #","))) datatab)]
    (map (fn[y] (let [index (map inc (range (count (y value))))] 
                  (assoc y value (zipmap index (y value))))) 
         splitted)))

Launch :

(split-data-rank sample :list)

As you can see, it returns the same sequence but it replaces :list by a map giving the position in the list of each item in quoted list.

Then, I want to melt the "dataframe" by creating for each item in a group its own row with its rank in the group.

So that I created this function :

(defn split-melt [datatab value]
  (let [splitted (split-data-rank datatab value)]
    (map (fn [y] (dissoc y value))
      (apply concat
        (map
          (fn[x]
            (map
              (fn[[k v]]
                (assoc x :item v :Rank k))
              (x value)))
     splitted)))))

Launch :

(split-melt sample :list)

The problem is that it is heavily indented and use a lot of map. I apply dissoc to drop :list (which is useless now) and I have also to use concat because without that I have a sequence of sequences.

Do you think there is a more efficient/shorter way to design this function ? I am heavily confused with reduce, does not know whether it can be applied here since there are two arguments in a way.

Thanks a lot !

Joseph Yourine
  • 1,301
  • 1
  • 8
  • 18
  • Nested `map` functions and lots of anonymous functions are a tip-off that this isn't ideal Clojure. I'm not sure where you're going with this, but I suggest checking out [`map-indexed`](https://clojuredocs.org/clojure.core/map-indexed) as a starting point. Are you bound to that sample data structure? Why is it quoted? What do you want to return from it? – jmargolisvt Jan 06 '16 at 00:48
  • [Reduce](https://clojuredocs.org/clojure.core/reduce) takes a function of two arguments, traditionally accumulator and value. There's also a [mapcat](https://clojuredocs.org/clojure.core/mapcat) function so (apply concat (map... could be elided. Since you're working with maps [reduce-kv](https://clojuredocs.org/clojure.core/reduce-kv) can help you with your (map (fn [k v]... There's a lot of options for improvement @dAni has a nice implementation, if you find a better way please do send a pull request to [Incanter](https://github.com/incanter/incanter). – Ricardo Acuna Jan 06 '16 at 04:15
  • Thanks for the input. Reduce-kv seems really powerful, will think about it. Nested maps like that are yes not very nice to write and read. I intend to calculate mean ranks for each distinct item but I know how to do it with group-by. I'm bound to this kind of quoted data especially because I take it from Google Big Query. – Joseph Yourine Jan 06 '16 at 13:34

1 Answers1

1

If you don't need the split-data-rank function, I will go for:

(defn melt [datatab value]
  (mapcat (fn [x]
            (let [items (str/split (get x value) #",")]
              (map-indexed (fn [idx item]
                             (-> x
                                 (assoc :Rank (inc idx) :item item)
                                 (dissoc value)))
                           items)))
          datatab))
DanLebrero
  • 8,545
  • 1
  • 29
  • 30
  • Thanks a lot, a lot better. I can indeed drop the first function, I used this trick because I did not know how to acess the index. I also learned two useful functions here ! – Joseph Yourine Jan 06 '16 at 11:23