2

Let's say that there is such data frame:

    a  b   c
1.  2  2   3
2.  5  4   4
3.  1  7   4
4.  1  9   4
5.  2  14  0
6.  9  10  6

I would like to discretize data in column b and input means of received ranges as discrete values for instances in specified column of processed data frame. Predicted result could look as follows:

    a  b   c
1.  2  3   3
2.  5  3   4
3.  1  8   4
4.  1  8   4
5.  2  12  0
6.  9  12  6

I came across of functions like discretize from arules library

res <- discretize(df$b, method = "frequency", breaks = 3)

which I suppose could solve the problem but I found it impossible to input means back to df.

Edit

Thanks to solutions given in comments I was able to achieve satisfying distribution of original data between ranges. I tested it also on df$b <- iris$Petal.Length (@alistaire solution):

ave(df$b, cut(df$b, quantile(df$b, seq(0, 1, length = 8)), 
          include.lowest = TRUE), FUN = mean)

With following results:

hist(df$b)$count
24 20  0  0 22  0 21 21 23  0 19

If someone knows other way of discretizing instances of column in data frame it would be appreciated. (especially discretization which could divide data on ranges with equal instances count)

  • 1
    *"found it impossible to input means back to `df`"* ... what have you tried? (I'm not proficient at `arules`, so ... does `df$b <- res` not work?) – r2evans May 14 '18 at 23:23
  • 1
    You could do this with `df$b <- ave(df$b, cut(df$b, 3), FUN = mean)`, but `cut` calculates the breakpoints a little differently. – alistaire May 14 '18 at 23:29
  • 1
    Setting the breaks with `quantile` can get you closer: `ave(df$b, cut(df$b, quantile(df$b, seq(0, 1, length = 4)), include.lowest = TRUE), FUN = mean)` – alistaire May 14 '18 at 23:33
  • Thanks @alistaire your way seems to work (I'm checking it out on original data set and as you said cut slices instances differently) – SundayProgrammer May 14 '18 at 23:37

0 Answers0