26

We have found BigQuery to work great on data sets larger than 100M rows, where the 'initialization time' doesn't really come into effect (or is negligible compared to the rest of the query).

However, on anything under that, the performance is quite slow and poor, which makes it (1) ill-suited to working in an interactive BI tool; and (2) inferior to other products, such as Redshift or even ElasticSearch where the data size is under 100M rows. Actually, we had an engineer at our organization that was evaluating a technology for doing queries on data sizes between 1M and 100M rows for an analytics product that has about 1000 users, and his feedback was that he could not believe how slow BigQuery was.

Without a defense of the BigQuery product, I was wondering if there were any plans on improving:

  1. The speed of BigQuery -- especially its initialization time -- on queries of non-massive data sets?
  2. Will BigQuery ever be able to deliver sub-second response times on 'regular' queries (such as a simple aggregation group by) on datasets under a certain size?
David542
  • 104,438
  • 178
  • 489
  • 842
  • 2
    Without understanding the in's and out's of how BigQuery is programmed, I wouldn't be surprised if most of that 3sec is based on initiation time. Do you get the same result for a table with 10 rows? Hadoop (before Yarn was implemented) used to spend around 3sec setting up a JVM with each query. – Ed Baker Feb 24 '17 at 01:20
  • 4
    It's time spent on metadata/initiation as Ed guessed, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while. – Elliott Brossard Feb 24 '17 at 08:39
  • 1
    @ElliottBrossard thanks, would you be interested/able to share any of the broad details in an answer? One of the downsides of this is on benchmarks Redshift comes out much faster on smaller data sizes (but is crushed on large actual big-data sizes), which is a big marketing advantage for Amazon when publishing comparisons! – David542 Feb 24 '17 at 19:53
  • It's hard to say anything useful without going into internal details, unfortunately. You can imagine, though, that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N_0 entities using the service. Once you get to N_1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. – Elliott Brossard Feb 24 '17 at 20:17
  • @ElliottBrossard thanks, any estimate on when that change will be implemented? A couple months? end of 2017? etc. – David542 Feb 26 '17 at 22:06
  • It's notoriously hard to give accurate timelines in software engineering :) Probably sometime this year, though, and it would show up in our [release notes](https://cloud.google.com/bigquery/docs/release-notes). – Elliott Brossard Feb 26 '17 at 22:31
  • @ElliottBrossard - might be worthwhile taking your first comment and turning that into a answer for others to easily see in future?.. – Graham Polley Mar 01 '17 at 11:16
  • Sure, I added my response as a self-contained answer below. – Elliott Brossard Mar 01 '17 at 11:54

2 Answers2

17

It's time spent on metadata/initiation, but actual execution time is very small. We have work in progress that will address this, but some of the changes are complicated and will take a while.

You can imagine that in its infancy, BigQuery could have central systems for managing jobs, metadata, etc. in a manner that performed very well for all N0 entities using the service. Once you get to N1 entities, however, it may be necessary to rearchitect some things to make them have as little latency as possible. For notification about new features--which is also where we would announce API improvements related to start-up latency--keep an eye on our release notes, which you can also subscribe to as an RSS feed.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • does it matter how small the data size is? For example, if we had a data set of 1,000 rows, would it still take 3.5s to run the query? – David542 Mar 02 '17 at 23:12
  • Impossible to say if it would be exactly 3.5s, but yes, 2-4s. – Graham Polley Mar 11 '17 at 08:37
  • 2
    @Elliott Brossard -- is there a related Google product you'd recommend for doing aggregation queries on smaller data-sets (10M and below)? I've tested with Cloud Spanner but the performance is quite poor on aggregation queries at that size. – David542 Mar 11 '17 at 21:52
  • @Elliott any update on when the start-up latency will be implemented in BQ? – David542 May 18 '17 at 22:08
  • 1
    It would be amazing to have faster query execution for small dataset. Today I have to alternate between BigQuery and MySQL. I would love to do all in BigQuery. For now, make no sense to give user a 3 seconds experience for 100k line dataset. I'm looking forward to the improvement :) – Murta Aug 20 '17 at 02:05
  • 3
    @Murta I put this as a ticket (which in my opinion is the most important single 'feature' in BQ), but it seems there's no ETA on it and my guess is it won't be fixed. You can try posting a ticket as well: https://issuetracker.google.com/issues/67021275 – David542 Nov 09 '17 at 00:34
  • I think Omid (who closed the bug) is asserting that there is no short-term effort to address the latency, which is true. He also wrote, "We do have ongoing plans to improve performance for queries with small inputs." So there is currently no ETA--we can't make any promises about when long-term projects will land--but efforts are ongoing. – Elliott Brossard Nov 09 '17 at 01:02
  • Hi Elliott. Do you have news about that issue? – Murta Apr 12 '18 at 02:30
  • The issue that David filed is the best place for updates. – Elliott Brossard Apr 12 '18 at 05:41
  • 1
    Is this response is still valid for 2020? I'm having same latency issue on getting small data from bigquery. About 3s latency. – Hubertokf Aug 01 '20 at 20:48
  • I no longer am with the BigQuery team (I moved to Snowflake, which does have lower latency over small tables). – Elliott Brossard Aug 03 '20 at 17:50
  • 1
    It surprises me how slow it is to be honest. – Christian Leon Oct 23 '20 at 16:00
  • @ChristianLeon yes, it's ridiculous, I cannot tell you how many projects we had that would be candidates for BigQuery that the client ultimately said "too slow". It should state right off the bat for BigQuery, "Query large data sources > 2s". – David542 Feb 19 '21 at 19:55
4

After exacts 4 years since this question, we have amazing news to BigQuery users! As stated in this Bi Engine release note from 2021-02-25:

The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Looqbox, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to this preview version of BI Engine.

I believe this can solve the query latency issue mentioned by David542 question.

Murta
  • 2,037
  • 3
  • 25
  • 33
  • @muta -- could you please explain how that's relevant at all? I've used BI Engine for the past two years... – David542 Mar 01 '21 at 18:39
  • 1
    @Dabid542, BI Engine was limited to operate with Data Studio, but with the last month upgrade it’s possible to expand it to be used with JDBC, ODBC or Rest API connection. – Murta Mar 02 '21 at 02:47
  • 1
    would you want to show an example of its usage then in your question, i.e., a couple lines of code calling it. – David542 Mar 11 '21 at 00:03
  • 1
    When configured, BI Engine will be used for regular BigQuery API calls. There is no need to change anything on the client side. The only thing to note - make sure you use jobs.query (instead of jobs.insert) to get the best latency, but the rest is exactly the same as calling BigQuery. – Alexey Klishin Aug 24 '21 at 16:54