5

Suppose I have the following query sent to BQ:

SELECT shipmentID, category, quantity
FROM [myDataset.myTable]

Further, suppose that the query returns data such as:

shipmentID  category  quantity
1           shoes     5
1           hats      3
2           shirts    1
2           hats      2
3           toys      3
2           books     1
3           shirts    1

How can I pivot the results, from within BQ, to produce output as follows:

 shipmentID   shoes  hats  shirts  toys  books
 1            5      3     0       0     0
 2            0      2     1       0     1
 3            0      0     1       3     0

As some additional background, I actually have 2000+ categories that I need to pivot, and the quantity of data is such that I can't do it directly through a Pandas DataFrame in Python (uses all the memory, then slows to a crawl). I tried using a relational database, but ran into a column limit, so I'd like to be able to do it directly in BQ, even if I have to build the query itself through python. Any suggestions?

** Edit 1 I should mention that pivoting the data itself can be done in chunks and is therefore not the issue. The real trouble comes in trying to do the aggregation afterwards, so that I have only one row for each shipmentID. That's what eats all the RAM.

** Edit 2 After trying out the accepted answer below, I found that trying to use it to create a 2k+ column pivot table was causing "Resources exceeded" errors. My BQ team was able to refactor the query to break it into smaller chunks and allow it to go through. The basic structure of the query is as follows:

SELECT
  SetA.*,
  SetB.*,
  SetC.*
FROM (
  SELECT
    shipmentID,
    SUM(IF (category="Rocks", qty, 0)),
    SUM(IF (category="Paper", qty, 0)),
    SUM(IF (category="Scissors", qty, 0))
  FROM (
    SELECT
      a.shipmentid shipmentid,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetA
INNER JOIN EACH (
  SELECT
    shipmentID,
    SUM(IF (category="Jello Molds", quantity, 0)),
    SUM(IF (category="Torque Wrenches", quantity, 0))
  FROM (
    SELECT
      a.shipmentID shipmentID,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetB
ON
  SetA.shipmentid = SetB.shipmentid
INNER JOIN EACH (
  SELECT
    shipmentID,
    SUM(IF (category="Deep Thoughts", qty, 0)),
    SUM(IF (category="Rainbows", qty, 0)),
    SUM(IF (category="Ponies", qty, 0))
  FROM (
    SELECT
      a.shipmentid shipmentid,
      a.quantity quantity,
      a.category category
    FROM
      [myDataset.myTable] a)
  GROUP EACH BY
    shipmentID ) SetC
ON
  SetB.shipmentID = SetC.shipmentID

The above pattern can be continued indefinitely by adding INNER JOIN EACH segments one after the other. For my application, BQ was able to handle about 500 columns per chunk.

PTTHomps
  • 1,477
  • 2
  • 22
  • 38
  • Did you have to manually type out all of the categories? I am running into a similar issue and that seems like too much work. – Nivi Jan 10 '18 at 02:48
  • I wonder if you tried using PySpark dataframes for pivoting? – Nivi Jan 10 '18 at 02:51

1 Answers1

5

This is a way to do:

select shipmentID,
  sum(IF (category='shoes', quantity, 0)) AS shoes,
  sum(IF (category='hats', quantity, 0)) AS hats,
  sum(IF (category='shirts', quantity, 0)) AS shirts,
  sum(IF (category='toys', quantity, 0)) AS toys,
  sum(IF (category='books', quantity, 0)) AS books,
from
  (select 1 as shipmentID,           'shoes' as category,    5 as quantity),
  (select 1 as shipmentID,           'hats' as category,      3 as quantity),
  (select 2 as shipmentID,           'shirts' as category,    1 as quantity),
  (select 2 as shipmentID,           'hats' as category,      2 as quantity),
  (select 3 as shipmentID,           'toys' as category,      3 as quantity),
  (select 2 as shipmentID,           'books' as category,     1 as quantity),
  (select 3 as shipmentID,           'shirts' as category,    1 as quantity),
group by shipmentID

This returns:

+-----+------------+-------+------+--------+------+-------+---+
| Row | shipmentID | shoes | hats | shirts | toys | books |   |
+-----+------------+-------+------+--------+------+-------+---+
|   1 |          1 |     5 |    3 |      0 |    0 |     0 |   |
|   2 |          2 |     0 |    2 |      1 |    0 |     1 |   |
|   3 |          3 |     0 |    0 |      1 |    3 |     0 |   |
+-----+------------+-------+------+--------+------+-------+---+

See the manual for other pivot table example.

Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • this looks good, and it should be relatively easy to build the query programatically, given a list of categories. I'll give it a shot. Thanks! :) – PTTHomps Jun 08 '15 at 18:08
  • Tested this out, and it works... to an extent. Trying to run it on 2000+ categories unfortunately produces a "Resources exceeded during execution." error, but it works for smaller numbers of categories. – PTTHomps Jun 08 '15 at 19:22
  • @TraxusIV you can reach out to google support, or try posting a new question with a generated query and mentioning a failed job ID, someone from BQ team will be able to see if the limits can be increased or not. – Pentium10 Jun 09 '15 at 06:26
  • Yep, I went ahead and handed it off to our BQ team. They're going to talk to Google about it. To be fair to Google, it's an enormous amount of data and processing. Tried doing the pivot on my laptop with 32GB of ram and it just chokes. – PTTHomps Jun 09 '15 at 14:47
  • If the use is reasonable they can increase the limits. Other than that you can reorganize the tables. – Pentium10 Jun 09 '15 at 15:36
  • Our BQ team was able to modify the query to allow it to process. I'll add it as an edit to the original post: – PTTHomps Jun 11 '15 at 17:56