0

I'm using the BigQuery Storage Api java client (from scala), reading from a clustered table having 4 clustered fields, e.g.

bq mk [...] --table --clustering_fields f1,f2,f3,f4 mytable mytableschema.json

and the schema looks like the following

[
  {
    "name": "f1",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "f2",
    "type": "STRING",
    "mode": "NULLABLE"
  },
  {
    "name": "f3",
    "type": "STRING",
    "mode": "REQUIRED"
  },
  {
    "name": "f4",
    "type": "STRING",
    "mode": "REQUIRED"
  },

 other fields...
]

Now, if I execute a normal query like this:

SELECT * FROM dataset.mytable 
WHERE f1 IN ('a', 'b') AND f2 IS NULL AND f3 = 'x'

I can see the reported "bytes processed" correctly using only the data belonging the the clusters filtered by (a, null, x) and (b, null, x)

If, instead, I try to export the data using the storage api and the exact same row restriction, costs seem to show the whole table size to be processed. There's no way to know the bytes processed, there's only an estimate in the api that reports that the whole table is being billed, and that's also what I see from the actual billing.

The storage api is used as follows. (It's actually wrapped in ZIO stream but this is the gist of it)

   val options =
        TableReadOptions
          .newBuilder()
          .setRowRestriction("f1 IN ('a', 'b') AND f2 IS NULL AND f3 = 'x'")
          .build()

    val readSessionBuilder =
      ReadSession
        .newBuilder()
        .setTable(tableName)
        .setDataFormat(DataFormat.AVRO)
        .setReadOptions(options)

    val readSessionRequestBuilder =
      CreateReadSessionRequest
        .newBuilder()
        .setParent(ProjectName)
        .setReadSession(readSessionBuilder)
        .setMaxStreamCount(1)
    
   val session = client.createReadSession(readSessionRequestBuilder.build())
   // ... read from session.getStreamsList

Does the storage api support at all cost reduction via row restrictions on clustered tables? I can't find any information about it anywhere.

Giovanni Caporaletti
  • 5,426
  • 2
  • 26
  • 39
  • As a best practice, use partitioned and clustered tables whenever possible. You can reduce the amount of data read by using a WHERE clause to prune [partitions](https://cloud.google.com/bigquery/pricing#storage-api). To know the number of bytes read by BigQuery storage API from the BigQuery table you can create a budget and add a filter to check only storage API cost. – kiran mathew Aug 01 '22 at 13:41
  • as I stated in the question, I do use clustering and a where condition in a query works, while it doesn't in a storage api export from the java client. The way I know it's billing more than I expect is from the billing report, as you say – Giovanni Caporaletti Aug 02 '22 at 14:02
  • It seems to be an internal issue. If you have a support plan, I would recommend you to raise a ticket with [Google Cloud Support](https://cloud.google.com/support). If not, create a thread in Google’s Issue Tracker using this [template](https://issuetracker.google.com/issues/new?component=187149&template=1162659). You can also STAR the issue to receive automatic updates and give it traction by referring to this [link](https://developers.google.com/issue-tracker/guides/subscribe#starring_an_issue). – kiran mathew Aug 04 '22 at 12:46

0 Answers0