Architecting a high performing "inserting solution"

Question

I am tasked with putting together a solution that can handle a high level of inserts into a database. There will be many AJAX type calls from web pages. It is not only one web site/page, but several different ones.

It will be dealing with tracking people's behavior on a web site, triggered by various javascript events, etc.

It is important for the solution to be able to handle the heavy database inserting load. After it has been inserted, I don't mind migrating the data to an alternative/supplementary data store.

We are initial looking at using the MEAN stack with MongoDB and migrating some data to MySql for reporting purposes. I am also wondering about the use of some sort of queue-ing before insert into db or caching like memcached

I didn't manage to find much help on this elsewhere. I did see this post but it is now close to 5 years old, feels a bit outdated and don't quite ask the same questions.

Your thoughts and comments are most appreciated. Thanks.

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

Why do you need a stack at all? Are you looking for a web-application to do the inserting? Or do you already have an application?

It's doubtful any caching layer will outrun your NoSQL database for inserts, but you should probably confirm that you even need a NoSQL database. MySQL has pretty solid raw insert performance, as long as your load can be handled on a single box. Most NoSQL solutions scale better horizontally. This is probably worth a read. But realistically, if you already have MySQL in-house, and you separate your reporting from your insert instances, you will probably be fine with MySQL.

edited May 23 '17 at 11:50

Community

1
1

answered Apr 27 '15 at 13:30

Rob Conklin

8,806
1
19
23

Thanks @Rob, could you elaborate a bit on "as long as your load can be handled on a single box"? Do you mean as long as you have enough hardware? Also, this is a project 'from scratch' and we are not bound by any particular db. – PostureOfLearning Apr 28 '15 at 09:53
Many NoSQL solutions scale horizontally particularly well, MySQL has a lot more trouble, mostly because of the whole multi-master replication complication. http://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html gives some good overviews of this problem-space. But, if you have no need to deal with this complication (because a single machine can handle the throughput). To be clear, one (decent sized) machine will very likely be able to handle a huge number of raw inserts (25k per second is pretty normal). – Rob Conklin Apr 28 '15 at 13:13
If you need more than this, then something like Cassandra might be more appropriate. It has a pretty flat scale out with # of instances. – Rob Conklin Apr 28 '15 at 13:13

score 2 · Answer 2 · answered Apr 27 '15 at 14:33

Some initial theory

To understand how you can optimize for the heavy insert workload, I suggest to understand the main overheads involved in inserting data in a database. Once the various overheads are understood, all kings of optimizations will come to you naturally. The bonus is that you will both have more confidence in the solution, you will know more about databases, and you can apply these optimizations to multiple engines (MySQL, PostgreSQl, Oracle, etc.).

I'm first making a non-exhaustive list of insertion overheads and then show simple solutions to avoid such overheads.

1. SQL query overhead: In order to communicate with a database you first need to create a network connection to the server, pass credentials, get the credentials verified, serialize the data and send it over the network, and so on. And once the query is accepted, it needs to be parsed, its grammar validated, data types must be parsed and validated, the objects (tables, indexes, etc.) referenced by the query searched and access permissions are checked, etc. All of these steps (and I'm sure I forgot quite a few things here) represent significant overheads when inserting a single value. The overheads are so large that some databases, e.g. Oracle, have a SQL cache to avoid some of these overheads.

Solution: Reuse database connections, use prepared statements, and insert many values at every SQL query (1000s to 100000s).

2. Ensuring strong ACID guarantees: The ACID properties of a DB come at the cost of logging all logical and physical modification to the database ahead of time and require complex synchronization techniques (fine-grained locking and/or snapshot isolation). The actual time required to deal with the ACID guarantees can be several orders of magnitude higher than the time it takes to actually copy a 200B row in a database page.

Solution: Disable undo/redo logging when you import data in a table. Alternatively, you could also (1) drop the isolation level to trade off weaker ACID guarantees for lower overhead or (2) use asynchronous commit (a feature that allows the DB engine to complete an insert before the redo logs are properly hardened to disk).

3. Updating the physical design / database constraints: Inserting a value in a table usually requires updating multiple indexes, materialized views, and/or executing various triggers. These overheads can again easily dominate over the insertion time.

Solution: You can consider dropping all secondary data structures (indexes, materialized views, triggers) for the duration of the insert/import. Once the bulk of the inserts is done you can re-created them. For example, it is significantly faster to create an index from scratch rather than populate it through individual insertions.

In practice

Now let's see how we can apply these concepts to your particular design. The main issues I see in your case is that the insert requests are sent by many distributed clients so there is little chance for bulk processing of the inserts.

You could consider adding a caching layer in front of whatever database engine you end up having. I dont think memcached is good for implementing such a caching layer -- memcached is typically used to cache query results not new insertions. I have personal experience with VoltDB and I definitely recommend it (I have no connection with the company). VoltDB is an in-memory, scale-out, relational DB optimized for transactional workloads that should give you orders of magnitude higher insert performance than MongoDB or MySQL. It is open source but not all features are free so I'm not sure if you need to pay for a license or not. If you cannot use VoltDB you could look at the memory engine for MySQL or other similar in-memory engines.

Another optimization you can consider is to have a different database for doing the analytics. Most likely, a database with a high data ingest volume is quite bad at executing OLAP-style queries and the other way around. Coming back to my recommendation, VoltDB is no exception and is also suboptimal at executing long analytical queries. The idea would be to create a background process that reads all new data in the frontend DB (i.e. this would be a VoltDB cluster) and moves it in bulk to the backend DB for the analytics (MongoDB or maybe something more efficient). You can then apply all the optimizations above for the bulk data movement, create a rich set of additional index structures to speed up data access, then run your favourite analytical queries and save the result as a new set of tables/materialized for later access. The import/analysis process can be repeated continuously in the background.

VoltDB looks very promising for our use case. I find a lot of pros but not many cons other than what you mentioned with regards to "OLAP-style queries". Any other cons you've experienced or know about? — PostureOfLearning, Apr 28 '15 at 14:02
Some of the VoltDB limitations for analytics: (1) It does not have a capable query optimizer and you will run into serious problems if you try to execute complex SQL queries. (2) VoltDB uses data partitioning to implement concurrency control (a table is partitioned based on a key and a key range is modified only by a single thread; a query executes without interruption until the end). An analytical query would run for many minutes and touch a lot of data. This essentially means that once an analytical query starts executing no other transactions are possible until the query finishes. — Radu, Apr 29 '15 at 11:42
It is worth noting that the VoltDB website claims as a feature the ability to export data fast to a data warehouse: http://voltdb.com/products/features . I ahve not seen any claim on part of the company that VoltDB is good for analytics. They implicitly suggest having a two tier architecture where VoltDB handles the transactional workload and a specialized backend the analysis. — Radu, Apr 29 '15 at 11:47
Forgot to mention: (3) VoltDB does not yet support the full SQL 92 standard. This means some SQL features are simply not supported. — Radu, Apr 29 '15 at 11:59
With regards to real-time analysis they write: "traditional database systems are simply too slow to ingest data, analyze it in real-time, and make decisions". They also go on about "Smart Data". This implies that they are good at it, but I'm not clear on what they mean by 'real-time analysis'... — PostureOfLearning, Apr 29 '15 at 13:07
How easy/hard is it to for instance, add a column to a table? I am referring more to server down time, etc, as opposed to what sql to write. Does VoltDB need to stop? Replication to all the VoltDB servers? — PostureOfLearning, Apr 29 '15 at 13:47
No, it does not need to stop. However, I did not measure the performance impact. From what I remember from the source code, they might need to copy all the data to a new in-memory table, which might take in the order of a few seconds and require additional memory. — Radu, Apr 29 '15 at 14:23

score 0 · Answer 3 · answered Apr 28 '15 at 15:53

Tables are usually designed with the implied assumption that queries will far outnumber DML of all sorts. So the table is optimized for queries with indexes and such. If you have a table where DML (particularly Inserts) will far outnumber queries, then you can go a long way just by eliminating any indexes, including a primary key. Keys and indexes can be added to the table(s) the data will be moved to and subsequently queried from.

Fronting your web application with a NoSQL table to handle the high insert rate then moving the data more or less at your leisure to a standard relational db for further processing is a good idea.

Architecting a high performing "inserting solution"

3 Answers3