1

So I have two data.tables.

    size_categories = data.table(category = c("S", "M", "L"), size_min = c(0, 10, 25), 
                           size_max = c(10, 25, Inf), bin = c("blue", "red", "green"))

    products = data.table(object_id = 1:10, size = seq(1, 37, 4))

I want to merge the tables such that each row of the product table is assigned a bin and size category based on its size.

The ham-fisted way I know would be to assign assign a category to each row on products and then merging

products[size >= 0 & size < 10, category := "S"]
products[size >= 10 & size < 25, category := "M"]
products[size >= 25, category := "L"]
merge(products, size_categories)

Of course this is not flexible at all and I would have to rewrite it if size_categories changed.

I am open to using other packages, but would prefer a solution just using data.table.

Thanks!

mt1022
  • 16,834
  • 5
  • 48
  • 71
paxton
  • 71
  • 4

2 Answers2

2

I would do it with non-equi join:

products[size_categories, `:=`(category = i.category, bin = i.bin),
    on = .(size >= size_min, size < size_max)]
# > products
#     object_id size category   bin
#  1:         1    1        S  blue
#  2:         2    5        S  blue
#  3:         3    9        S  blue
#  4:         4   13        M   red
#  5:         5   17        M   red
#  6:         6   21        M   red
#  7:         7   25        L green
#  8:         8   29        L green
#  9:         9   33        L green
# 10:        10   37        L green
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • 1
    There's also `foverlaps`, but I believe that would require copying a column and setting keys. +1 – A5C1D2H2I1M1N2O1R2T1 Dec 13 '20 at 05:06
  • Thanks so much! suppose that size_category had more columns and I wanted to keep most of them. Would I have to list them all out in a similar fashion or is there a faster way? – paxton Dec 13 '20 at 07:58
  • 1
    @paxton In that case, a join is more suitable than in-place modification. I hope the example in my previous answer to a similar problem would be helpful: https://stackoverflow.com/a/65004418/3926543 – mt1022 Dec 13 '20 at 08:23
0

For reference, here's an approach using foverlaps:

foverlaps(setkey(size_categories, size_min, size_max), 
          setkey(products[, size2 := size], size, size2))[, size2 := NULL][]
#     object_id size category size_min size_max   bin
#  1:         1    1        S        0       10  blue
#  2:         2    5        S        0       10  blue
#  3:         3    9        S        0       10  blue
#  4:         4   13        M       10       25   red
#  5:         5   17        M       10       25   red
#  6:         6   21        M       10       25   red
#  7:         7   25        M       10       25   red
#  8:         7   25        L       25      Inf green
#  9:         8   29        L       25      Inf green
# 10:         9   33        L       25      Inf green
# 11:        10   37        L       25      Inf green

It would probably be helpful in cases where your "size_categories" table has more columns that you want included in the final output.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485