1

I am using BigQuery, and trying to assign categorical values to each of my records, based on the number of 'splits' assigned to it.

The table has a cumulative count of records, grouped at the STR level - i.e., if there are 4 SKUs at 2 STR, the SKUs will be labeled 1,2,3,4. Each STR is assigned a SPLIT value, so if the STR has a SPLIT value of 2, I want it to split its SKUs into 2 categories. I want to create another column that would assign SKUs labeled 1-2 as '1', and SKUs labeled 3-4 as '2'. (The actual data is on a much larger scale, but thought this would be easier.)

+-----+------+---------------+--------+
| STR | SKU  | SKU_ROW_COUNT | SPLITS |
+-----+------+---------------+--------+
|   1 | 1230 |             1 |      3 |
|   1 | 1231 |             2 |      3 |
|   1 | 1232 |             3 |      3 |
|   1 | 1233 |             4 |      3 |
|   1 | 1234 |             5 |      3 |
|   1 | 1235 |             6 |      3 |
|   2 | 1310 |             1 |      2 |
|   2 | 1311 |             2 |      2 |
|   2 | 1312 |             3 |      2 |
|   2 | 1313 |             4 |      2 |
|   3 | 2345 |             1 |      1 |
|   3 | 2346 |             2 |      1 |
|   3 | 2347 |             3 |      1 |
+-----+------+---------------+--------+

The SPLITS column is dynamic, ranging from 1 to 3. The number of SKUs in each category should be relatively equal, but that's not a priority as much as just the number of groups that are created. Ideally, the final table with the new column (HOST_NUMBER) would look something like this:

+-----+------+---------------+--------+-------------+
| STR | SKU  | SKU_ROW_COUNT | SPLITS | HOST_NUMBER |
+-----+------+---------------+--------+-------------+
|   1 | 1230 |             1 |      3 |           1 |
|   1 | 1231 |             2 |      3 |           1 |
|   1 | 1232 |             3 |      3 |           2 |
|   1 | 1233 |             4 |      3 |           2 |
|   1 | 1234 |             5 |      3 |           3 |
|   1 | 1235 |             6 |      3 |           3 |
|   2 | 1310 |             1 |      2 |           1 |
|   2 | 1311 |             2 |      2 |           1 |
|   2 | 1312 |             3 |      2 |           2 |
|   2 | 1313 |             4 |      2 |           2 |
|   3 | 2345 |             1 |      1 |           1 |
|   3 | 2346 |             2 |      1 |           1 |
|   3 | 2347 |             3 |      1 |           1 |
+-----+------+---------------+--------+-------------+
GMB
  • 216,147
  • 25
  • 84
  • 135
hstein5
  • 11
  • 2

3 Answers3

0

You can use window functions and arithmetics:

select 
    t.*,
    1 + floor((sku_row_count - 1) * splits / count(*) over(partition by str)) host_number
from mytable t
order by sku

Actually, ntile() seems to do exactly what you want - and you don't even need the sku_row_count column (which basically mimics row_number() anyway):

select 
    t.*,
    ntile(splits) over(partition by str order by sku) host_number
from mytable t
order by sku
GMB
  • 216,147
  • 25
  • 84
  • 135
0

If the ordering of the values in the groups doesn't matter, just use modulo arithmetic:

select t.*, (SKU_ROW_COUNT % SPLITS) as split_group
from t
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Below is for BigQuery Standard SQL

#standardSQL
SELECT *, 1 + MOD(SKU_ROW_COUNT, SPLITS) AS HOST_NUMBER 
FROM `project.dataset.table`
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230