24

This is a follow-up to the question Why doesn't BigQuery perform as well on small data sets.

Let's suppose I have a data-set that is ~1M rows. In the current database that we're using (mysql) aggregation queries would run quite slow, perhaps taking ~10s or so on complex aggregations. On BigQuery, the initialization time required might make this query take ~3 seconds, better than in mysql, but the wrong tool for the job, if we need to return queries in 1s or under.

My question then is, what would be a good alternative to using BigQuery on doing aggregated queries on moderate-sized data-sets, such as 1-10M rows? An example query might be:

SELECT studio, territory, count(*)
FROM mytable
GROUP BY studio, territory
ORDER BY count(*) DESC

Possible solutions I've thought of are ElasticSearch (https://github.com/NLPchina/elasticsearch-sql) and Redshift (postgres is too slow). What would be a good option here that can be queried via SQL?

Note: I'm not looking for why or how BQ should be used, I'm looking for an alternative for data sets under 10M rows where the query can be returned in under ~1s.

Community
  • 1
  • 1
David542
  • 104,438
  • 178
  • 489
  • 842
  • I'm voting to close this topic as a recommendation for an off-site resource. However, I will note that Big Query and the Google Cloud Platform are a work-in-progress. No doubt, Google recognizes that 3 seconds of overhead for a query is a killer (these are the people who monetized search into a gazillion dollar business). I would expect them to have faster solution. As I would expect any other cloud platform. – Gordon Linoff Mar 10 '17 at 23:47
  • @GordonLinoff thanks for the response. Out of curiosity, what would be the first place you'd look to test a solution? – David542 Mar 11 '17 at 00:01
  • Why don't you just start with a PostgreSQL instance? 1-10M rows are not that many. You can experiment with different instance sizes on Heroku or AWS (I think also GCE has experimental support for it now) while fine tuning your tables in terms of data modeling + indexing to achieve the query times you want. When you scale up your data to a level that PostgreSQL is not a viable solution anymore you can easily move to a solution like Redshift that shares the same DNA with PostgreSQL. – cpard Mar 12 '17 at 18:58
  • @cpard -- already tested postgres,way too slow with all indexed columns (on par with mysql for the aggregations) – David542 Mar 13 '17 at 05:06
  • 2
    @David542 OLAP systems like Redshift and Bigquery are not built with emphasis on fast query processing, multiple second or even minute queries are common for these systems. With the amount of data you are mentioning, you should be able to achieve it on something like Redshift but I'm nut sure how consistent this latency will be. Maybe you should consider a different architecture, like putting a cache where the results of your analytical queries will be served from and then schedule to run your queries periodically to update your cache. – cpard Mar 13 '17 at 14:23
  • 1
    @cpard agreed, in our tests with Redshift at 'small' data-sizes it consistently performed worse, and sometimes ad-hoc queries would take over 20s on its first execution, see http://docs.aws.amazon.com/redshift/latest/dg/c-query-performance.html. – David542 Mar 17 '17 at 20:58
  • @David542 first time queries are always taking longer because of compilation and distribution of the code to the nodes, so if possible try to run queries at least twice fo benchmarking purposes. See also here for some great tips for a more in-depth analysis of query performance http://docs.aws.amazon.com/redshift/latest/mgmt/analyzing-query-execution.html – cpard Mar 19 '17 at 11:57
  • 1
    @cpard, right we're doing benchmarks x3, so the first time will be longer but then next two have the compiled query. Anyways, this would be a killer for our project, as mostly all the queries are ad-hoc, and we can't have the disclaimer, "Don't worry -- your query will take 20s, but run it a second time and it'll be faster!" – David542 Mar 20 '17 at 02:44
  • 1
    @David542 if you don't mind using a query language that is not SQL, then it might be better to work with Elastic Search, with such requirements. Especially if you plan to have multiple concurrent users running queries. Are you aware of the concurrent query limitations that Redshift has? http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html – cpard Mar 20 '17 at 16:10
  • 1
    @David542 I added an answer with a few alternatives that I've _actually_ used personally. I'm a little surprised at your Redshift experience. What type of nodes and table structures were you using? We often see subsecond queries on our SSD nodes regardless of whether query has been seen before. – Joe Harris Mar 20 '17 at 17:12
  • Have you tried Redis, MemSQL, VoltDB or any other in-memory / hybrid offerings? What did you end up using? – Davos Jan 10 '19 at 13:10

11 Answers11

14

2020 update: Check out BigQuery BI Engine, the built-in accelerator of queries for dashboards:


If you need answers in less than a second, you need to think about indexing.

Typical story:

  1. MySQL (or any other database proposed here) is fast, until...
  2. One day some of your aggregation queries start running slow. Minutes, hours, days, etc.
  3. Typical solution for step 2 is indexing and pre-aggregating. If you want answers in less than a second for certain type of questions, you'll need to invest time and optimization cycles to answer just that type of questions.
  4. BigQuery's beauty is that you can skip step 3. Bring those minutes/hours/days to seconds, with minimal investment - any query, at any time.

BigQuery is awesome because it gives you 4. But you are asking for 3, MySQL is fine for that, Elasticsearch is fine too, any indexed database will bring you results in less than a second - as long as you invest time on optimizing your system for certain type of question. Then to get answers for any arbitrary question without investing any optimization time, use BigQuery.

BigQuery: Will answer arbitrary questions in seconds, no preparation needed.

MySQL and alternatives: Will answer certain type of questions in less than a second, but it will take development time to get there.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • thanks for this. Out of curiosity, what do Google do for when they need to get sub-second responses on aggregated data sets, such as for Google Analytics? I would assume they aren't using BigQuery or an equivalent for that (and probably not mysql or a traditional oltp system)? – David542 Mar 22 '17 at 19:29
  • 2
    Has Google Analytics ever presented its charts in less than a second? (That's a hint) – Felipe Hoffa Mar 22 '17 at 22:22
9

Here are a few alternatives to consider for data of this size:

  1. Single Redshift small SSD node
    • No setup. Easily returns answers on this much data in under 1s. 
  2. Greenplum on a small T2 instance
    • Postgres-like. Similar perf to Redshift. Not paying for storage you won't need. Start with their single node "sandbox" AMI.
  3. MariaDB Columnstore
    • MySQL-like. Used to be called InfiniDB. Very good performance. Supported by MariaDB (the company).
  4. Apache Drill
    • Drill has a very similar philosophy to BiqQuery but can be used to anywhere (it's just a jar). Queries will be fast on this size data.

If low admin / quick start is critical go with Redshift. If money / flexibility is critical start with Drill. If you prefer MySQL start with MariaDB Columnstore.

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • thanks for these suggestions. We tried Drill and it worked pretty well, but in benchmarks Impala performed better/faster than Drill. Redshift also not an option due to their hard concurrency limitations (as noted in one of the question comments) -- http://docs.aws.amazon.com/redshift/latest/dg/cm-c-implementing-workload-management.html. Will check out Greenplum and MariaDB. – David542 Mar 20 '17 at 20:33
  • Impala, hmm. If you're willing to get into that _kind_ of tool then definitely look at Spark - Good SQL support and your data will easily fit in memory. Also look into Clickhouse. http://tech.marksblogg.com/billion-nyc-taxi-clickhouse.html – Joe Harris Mar 20 '17 at 20:45
  • yes, we also tested Spark and Clickhouse. Impala performed better than Spark, and Clickhouse had a few limitations that made it unsuitable for our project (There is no efficient engine that accepts no params without knowing data nature beforehand -- https://clickhouse.yandex/reference_en.html#Table engines). Will let you know how Greenplum or MariaDB works. – David542 Mar 20 '17 at 20:53
  • here is an actual query that we use in our application, which on a dataset of 1000 rows that we're using for initial loading/testing, took 16s on the first query, then about 600ms on all additional queries: – David542 Mar 20 '17 at 22:52
  • `(SELECT true as __pivot, 0 as __vector, NULL as type, COUNT(DISTINCT score) as UNIQUE_COUNT_c1, COUNT(*) as __count FROM cm10 GROUP BY type) UNION ALL (SELECT false as __pivot, 1 as __vector, type, COUNT(DISTINCT score) as UNIQUE_COUNT_c1, COUNT(*) as __count FROM cm10 GROUP BY type) UNION ALL (SELECT true as __pivot, 1 as __vector, type, COUNT(DISTINCT score) as UNIQUE_COUNT_c1, COUNT(*) as __count FROM cm10 GROUP BY type) ` – David542 Mar 20 '17 at 22:52
  • 1
    Hmm, I'm not really grokking what that query is trying to do. I would say though that `COUNT(DISTINCT` is usually a performance killer on an MPP DB. – Joe Harris Mar 21 '17 at 20:30
7

I know SQL Server, so my answer is biased.

  1. 10M rows should easily fit in memory, so any kind of aggregation should be fast, especially if you have a covering index. If it doesn't, the server configuration may need adjustment. Also, SQL Server has so-called in-memory tables, which may be a good fit here.

  2. SQL Server has a feature called indexed view. Your aggregating query is a classic use case of an indexed view. Indexed view is essentially a copy of the data stored on disk and maintained by the server automatically as the underlying data in the table changes. It slows INSERTS, DELETES and UPDATES, but makes SELECT fast, because summary is always pre-calculated. See: What You Can (and Can’t) Do With Indexed Views. Other DBMSes should have similar features.

Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • we benchmarked SQLServer on six of our application queries and it looked good on about 1M rows and under. After that, 5 of the 6 queries probably exceeded our available memory and were quite slow. I think SQLServer would be an option at ~1M rows or under but on more complex queries it quickly exceeds the machine memory (even if we get a larger machine). – David542 Mar 20 '17 at 02:48
  • @David542, 10M rows with 100 bytes per row is 1GB. It is not trivially small, but not too large either. You may need to look at the execution plan and check what the server is doing. If you use Indexed View, you should be able to greatly reduce the amount of data the server needs to read/keep in memory (depending on your data). If original full table has 10M rows, but there are only, say, 10K distinct combinations of `studio, territory`, then indexed view's index will have only 10K rows => your query with indexed view would be very fast. – Vladimir Baranov Mar 20 '17 at 03:32
  • @David542, on the other hand, if the full table has 10M rows and there are 9M distinct combinations of `studio, territory`, then indexed view would not help much. A simple index on `(studio, territory)` would have almost the same effect. – Vladimir Baranov Mar 20 '17 at 03:48
1

If you don't need concurrency, multiple users connecting simultaneously, and your data can fit in a single disk file, then SQLite might be appropriate.

As they say, SQLite does not compete with client/server databases. SQLite competes with fopen().

http://www.sqlite.org/whentouse.html

Jim Brody
  • 11
  • 1
  • we'd need concurrency on this one. I think Impala might be the fastest option, but seems a bit overkill for data < 10M rows? – David542 Mar 11 '17 at 18:49
1

I think Microsoft SQL Server Analysis Services is a good option, I used myself, it is database behind the PowerBI service,which has a very good free tier option.

if you want a free on premise solution, you can always use SQL Server express with the new columnstore technology, i did not use it myself, but i heard some very good results

Mim
  • 999
  • 10
  • 32
1

If that is your only query, then this will make it run faster:

INDEX(studio, territory)  -- in either order.

If there are other variations, let's see them, plus SHOW CREATE TABLE.

Another thing to check: How much RAM do you have, ans what is the value of innodb_buffer_pool_size? That setting should be about 70% of RAM (if you have more than 4GB of ram).

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks, the above is more a sample query, so we wouldn't necessarily know the index-combinations to use. – David542 Mar 20 '17 at 04:21
  • Need to see the breadth of the problem to provide a complete solution. Sounds link an "EAV" problem -- which is messy. – Rick James Mar 20 '17 at 17:43
1

Do not use COUNT(*).

Use COUNT() on a single column, preferably an indexed one like the PRIMARY KEY.

Usagi Miyamoto
  • 6,196
  • 1
  • 19
  • 33
  • 2
    `COUNT(*)` counts rows and gives the Optimizer the flexibility to pick which index to use `COUNT(x)` checks each `x` for being `NOT NULL`, which is usually not desired. – Rick James Mar 20 '17 at 20:20
  • `COUNT(*)` means count all rows that are not full of `NULL` values. Many implementations use full table scans to do this. – Usagi Miyamoto Mar 21 '17 at 09:31
  • 2
    I strongly believe that you are wrong about `COUNT(*)` needing to look at all the columns. And I tried a simple table with all columns being NULLable; COUNT(*) included rows with all nulls. – Rick James Mar 21 '17 at 19:52
1

My Answer: Optimize query and table structure as previously addressed (1 sec or less). Read on below for further reasoning, because we all fall into this trap. Note: The above is not necessarily a big dataset.

A great question. It is such a struggle to decipher what is the problem and what is a solution. Here's a shot coming from old school. In the old days, we use to say you ask a hardware, OS, or developer what the problem/solution is and you will get three different answers.

My understanding is this question is asking to solve/compare a SQL performance problem with a cloud infrastructure solution. This question will have many different answers based on background. It is confusing, you have just old school database installations(Mysql, Oracle, MSsql), Database As A Service(DBAAS), Big Data Cloud solutions, Big Data Application Solutions(hadoop)

It is so easy to get tangled up in all this technology. Maybe here is a little clarity.

SQL performance problems can be solved in a variety of points of performance(POP) .

  1. SQL Optimization and Tuning (Temp tables, In-Memory, OLAP functions, Sql Plan, Parallelization, analytics ) Tools (MySql Workbench, cmdline, Toad, etc)
  2. Structure Optimization (Tables, Indexing, Partitioning, Pre-Ag Structures)
  3. Database configuration (Memory Size, Cache sizes, Parallelization, Block Size, etc..
  4. OS memory, page size, Processes)
  5. Hardware and Network - Mostly irrellivant now.
  6. Server Provisioning.
  7. Cloud provisioning and clustering.
  8. Infrastructure and Software decisions.

Bottom Line: I will stop here, we have so many solutions for problems. Try to start with the most basic usage of a technology before incurring costs solving solutions with bigger technologies. Hopefully this will give the user a skeleton of a path to work through or terminology to use when asking a question. How do I get x query to run in time t?

1

You don't talk much about the problem space you're in - but have you considered python pandas, or R? These are great tools for data analytics / development.

Assuming you have python and pandas handy pip install pandas you can get started with something like this:

import pandas as pd
import pyodbc

conn = pyodbc.connect(...) # You'll need to figure out the settings for your DB here
# this slow but only needs to be done once:
data = pd.read_sql_query('select * from mytable') # Load everything into memory 

# Now do the query:
data.groupby(['studio', 'territory']).count().sort_values(ascending=False)

I strongly recommend trying out pandas with Jupyter Notebooks

Matthew
  • 10,361
  • 5
  • 42
  • 54
0

BigQuery is meant to perform best at the end of the Big Data pipeline. It has been designed so as to perform well with large data sets, not small ones, and is not meant as a replacement for existing technologies, but rather as an excellent complement in certain situations. An example can be read in the “Google Cloud Big Data and Machine Learning Blog” document.

George
  • 1,488
  • 1
  • 10
  • 13
0

If you are looking for sub-second OLAP query results then Druid (http://druid.io/) was built for that purpose. It is a beast to deploy and tune, but once you get it configured properly for your data it is very very fast. It has streaming support so you can ingest from Kafka with exactly once semantics which is awesome. It scales very well from small to massive amounts of data - although you will pay a cost as it does pre-aggregation so if you have a lot of dimensions the data size explodes. SQL support has only recently been added and is incomplete. Also it does not support joins so you have to structure your data properly to get your answers out.

Ben
  • 1,793
  • 2
  • 15
  • 22
  • thanks, we tested Druid and it wasn't useful for our needs. It requires a timestamped field, which our data often doesn't have (or need): "Every row in Druid must have a timestamp. Data is always partitioned by time, and every query has a time filter. Query results can also be broken down by time buckets like minutes, hours, days, and so on." -- http://druid.io/docs/0.9.2/ingestion/schema-design.html – David542 Mar 19 '18 at 00:33
  • Yes that is true. It is possible to work around that by building a long value that you partition on, but if your data is not time series in nature you are probably better off using something else. – Ben Mar 20 '18 at 02:12
  • Another option for small data could be a data grid like apache ignite. Keep it all in memory and it should be screaming fast. I haven't used it, but I understand it has sql support and works with BI tools like Tableau. There are a fair number of similar products that may have similar or superior capabilities. – Ben Mar 20 '18 at 02:26
  • that's interesting, I've never used (or even heard of) apache ignite. Do you know of any products that use it or a good way to test it out? – David542 Mar 21 '18 at 01:08