How best to select record with the closest matching value in BigQuery?

Question

I have table_a and table_b. In table_a is a list of new SKUs with shipping weight and dimensions (separate fields for length, width, height; see table_a below). In table_b is a list of existing SKUs with the same columns as table_a, plus estimated shipping cost (see table_b below).

For each SKU in table_a, I need to find the SKU in table_b whose numeric value is closest to the same column's value in table_a.

For example, if in table_a I have the following records

item_id	item_wght	item_length	item_height	item_width
2587329866	9	15	14	10
2587829450	7	12	17	11

and in table_b I have the following records

item_id	item_wght	item_length	item_height	item_width	item_ship_est
2587837947	11	16	16	13	7.36
2587854109	8	14	13	11	6.83

I want to select the record in table_b where a particular column's value is the closest numerically to that in table_a.

So, say I want to select the record from table_b where item_wght is the closest in numeric value to item_wght for each record in table_a. Given the data above, the query should return row 2 (item_id 2587854109) of table_b for both records in table_a, because that record's item_wght is closest to the item_wght of both records in table_a.

The end goal is to end up with all the records in table_a, with only the corresponding item_ship_est record in table_b, as follows:

item_id	item_wght	item_length	item_height	item_width	item_ship_est
2587329866	9	15	14	10	6.83
2587829450	7	12	17	11	6.83

I'm not sure how to start on this problem, so I don't really have any useful queries to show what I've already tried. I've searched for similar questions on here, some of which seemed promising, but I was unable to achieve my desired result from any of the existing questions I found.

I'd appreciate any help I can get.

Update: After Mikhail kindly submitted an answer to the above, it made me realize I forgot to mention that table_b contains millions of rows, and table_a contains from hundreds to tens of thousands of rows, depending on the day, so a cross join on the entire data set isn't really feasible here.

I don't know if there if a way to perhaps parse down the number of records from table_b being cross joined to each record of table_a, by segmenting the tables by total dim volume or something. It's a tough one.

score 1 · Answer 1 · answered Jul 18 '22 at 23:55

1

Consider below approach

select any_value(a).*, 
  array_agg(b.item_ship_est order by sqrt(
    pow(a.item_wght - b.item_wght, 2) + 
    pow(a.item_length - b.item_length, 2) + 
    pow(a.item_height - b.item_height, 2) + 
    pow(a.item_width - b.item_width, 2)
  ) limit 1)[offset(0)] as item_ship_est
from table_a a
cross join table_b b
group by format('%t', a)

if applied to sample data in your question - output is

answered Jul 18 '22 at 23:55

Mikhail Berlyant

165,386
8
154
230

Thank you, Mikhail. While this solution does execute and return correct results with a small enough data set, it made me realize I failed to mention that table_b has millions of records, and table_a could have from hundreds to tens of thousands of records, depending on the day. So, a cross join isn't really feasible. I'll update the question. – Phuel Jul 19 '22 at 16:28
to reduce effect of cross joining - you should introduce some partitioning, segmentation! i don't see other ways around – Mikhail Berlyant Jul 19 '22 at 16:36
1

Thanks, Mikhail. I actually have started on the approach of "bucketing" or partitioning the SKUs manually by calculating the "billable weight" -- which is a function of item_wght and the dimensions -- and taking the median of the item_ship_est by billable weight. This reduces the size of table_b to ~3500 rows, which makes it much more manageable for using something like your solution. I'm going to test this out again today. – Phuel Jul 21 '22 at 14:24

Jaytiger · Answer 2 · 2022-07-20T05:24:03.760

below approah may not be also scalable same as Mikhail's one in that it requires to sort your table by item_wght but you can consider this if you don't need an Euclidean distance for closeness.

WITH merged AS (
  -- later you can identify if each row comes from tableA or B by looking at item_ship_est column
  SELECT * FROM tableB
   UNION ALL
  SELECT *, null FROM tableA
),
sorted AS (
  SELECT *,
         LAST_VALUE(IF(item_ship_est IS NULL, NULL, item_wght) IGNORE NULLS) OVER w_before AS b_wght,
         FIRST_VALUE(IF(item_ship_est IS NULL, NULL, item_wght) IGNORE NULLS) OVER w_after AS a_wght,
         LAST_VALUE(item_ship_est IGNORE NULLS) OVER w_before AS b_est,
         FIRST_VALUE(item_ship_est IGNORE NULLS) OVER w_after AS a_est,
    FROM merged
  WINDOW w_before AS (ORDER BY item_wght ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
         w_after AS (ORDER BY item_wght ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
)
SELECT * EXCEPT (b_wght, a_wght, b_est, a_est, item_ship_est), 
       -- to find out the row current item_wght is more close to, pre or next ?
       IF(item_wght - b_wght < a_wght - item_wght, IFNULL(b_est, a_est), IFNULL(a_est, b_est)) AS item_ship_est
  FROM sorted
 WHERE item_ship_est IS NULL
;

How best to select record with the closest matching value in BigQuery?

2 Answers2