3

I have a Frame as follows:

x = dt.Frame(k = [1, 1, 2], 
             v = [{'a':1, 'b':2}, {'a':3}, {'b':4}])

which looks like this:

k       v
▪▪▪▪    ▪▪▪▪▪▪▪▪
1       {'a': 1, 'b': 2}
1       {'a': 3}
2       {'b': 4}

What I'm trying to do is to 1) group by k, and 2) aggregate the count in the dictionary. The desired output:

k       v
▪▪▪▪    ▪▪▪▪▪▪▪▪
1       {'a': 4, 'b': 2}
2       {'b': 4}

Is it possible to achieve with the latest pydatatable(v0.11)?

R. Zhu
  • 415
  • 4
  • 16
  • 1
    it's better you modify the dictionaries rather than dataframe – deadshot Sep 04 '20 at 19:42
  • @deadshot Would you elaborate on your point? The original data is stored as a `pandas.DataFrame`(the column types are exactly the same) and I can achieve my goal with `DataFrame.group`. However, I found it painful due to the data size. That's why I took a look at the `pydatatable`. – R. Zhu Sep 04 '20 at 19:55

1 Answers1

3

If you have a large dataset then consider expanding all dictionaries into a frame:

>>> DT = dt.cbind(dt.Frame(_key=[1,1,2]), 
                  dt.Frame([{'a':1, 'b':2}, {'a':3}, {'b':4}]))
>>> DT
   | _key   a   b
-- + ----  --  --
 0 |    1   1   2
 1 |    1   3  NA
 2 |    2  NA   4

[3 rows x 3 columns]

After this, grouping is easy:

>>> from datatable import sum, f, by
>>> DT[:, sum(f[:]), by(f._key)]
   | _key   a   b
-- + ----  --  --
 0 |    1   4   2
 1 |    2   0   4

[2 rows x 3 columns]
Pasha
  • 6,298
  • 2
  • 22
  • 34