4

When clustering my Wikipedia pageviews tables I got the error:

Clustering encountered a key that is longer than the maximum allowed limit of 1024 bytes.

Context: https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

(I'm clustering by

CREATE TABLE `fh-bigquery.wikipedia_v3.pageviews_2017`
PARTITION BY DATE(datehour)
CLUSTER BY wiki, title
...

)

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • as FYI : from what I understand - there is a bug found in enforcing the 1KB length that was introduced recently. So, that is disabled as of now till it is fixed – Mikhail Berlyant Aug 21 '18 at 01:32

1 Answers1

6

When clustering tables BigQuery has a limit of 1KB for the keys.

You can solve this for the example tables by changing your insertion code so it truncates any entry that's too long.

For example, instead of:

INSERT INTO `fh-bigquery.wikipedia_v3.pageviews_2018` (datehour, wiki, title, views)
SELECT datehour, wiki, title, views

truncate the potentially long titles with:

INSERT INTO `fh-bigquery.wikipedia_v3.pageviews_2018` (datehour, wiki, title, views)
SELECT datehour, wiki, SUBSTR(title, 0, 300) title, views

If you continue to experience errors, note that some malformed strings might have a longer length than what SUBSTR() sees. Filter those out with:

WHERE BYTE_LENGTH(title) < 300
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • 2
    gotta say I love seeing BQ evolving and innovating at the speed it does :)! – Willian Fuks Aug 19 '18 at 00:02
  • 2
    I suspect it’s not that the title is “malformed”, but hat SUBSTR operates on Unicode runes (characters) and that the clustering limits are in bytes. Thus, titles with double/triple width characters (many Chinese characters, accented characters, emoji) will fall into this case. – elithrar Aug 19 '18 at 00:38
  • 1
    In my specific case, it was malformed strings (coming from real life logs) – Felipe Hoffa Aug 19 '18 at 02:49
  • I'm doing `CLUSTER BY wiki, title`, so I had to make sure that `BYTE_LENGTH(wiki+title)` was lower than the limit. – Felipe Hoffa Aug 20 '18 at 19:54
  • Do you know if BigQuery is working on extending this limit @FelipeHoffa? – Eben du Toit Jun 11 '19 at 15:28
  • Just following up on this thread. This limit has been removed for a while now, and the new behavior is documented here. https://cloud.google.com/bigquery/docs/creating-clustered-tables#limitations – Pavan Edara Aug 05 '20 at 16:10