5

I'm writing a bit of software that needs to flatten data from a hierarchical type of format into tabular format. Instead of doing it all in a programming language every time and serving it up, I want to cache the results for a few seconds, and use SQL to sort and filter. When in use, we're talking 400,000 writes and 1 or 2 reads over the course of those few seconds.

Each table will contain 3 to 15 columns. Each row will contain from 100 bytes to 2,000 bytes of data, although it's possible that in some cases, some rows may get up to 15,000 bytes. I can clip data if necessary to keep things sane.

The main options I'm considering are:

MySQL's Memory engine

A good option, almost specifically written for my use case! But.. "MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. MEMORY tables cannot contain BLOB or TEXT columns." - Unfortunately, I do have text fields with a length up to maybe 10,000 characters - and even that is a number that is not specifically limited. I could adjust the varchar length based on the max length of text columns as I loop through doing my flattening, but that's not totally elegant. Also, for my occasional 15,000 character row, does that mean I need to allocate 15,000 characters for every row in the database? If there was 100,000 rows, that's 1.3 gb not including overhead!

InnoDB on RAMDisk

This is meant to run on the cloud, and I could easily spin up a server with 16gb of ram, configure MySQL to write to tmpfs and use full featured MySQL. My concern for this is space. While I'm sure engineers have written the memory engine to prevent consuming all temp storage and crashing the server, I doubt this solution would know when to stop. How much actual space will my 2,000 bytes of data consume when in database format? How can I monitor it?

Bonus Questions

Indexes I will in fact know in advance which columns need to be filtered and sorted by. I could set up an index before I do inserts, but what kind of performance gain could I honestly expect on top of a ram disk? How much extra overhead to indexes add?

Inserts I'm assuming inserting multiple rows with one query is faster. But the one query, or series of large queries are stored in memory, and we're writing to memory, so if I did that I'd momentarily need double the memory. So then we talk about doing one or two or a hundred at a time, and having to wait for that to complete before processing more.. InnoDB doesn't lock the table but I worry about sending two queries too close to each other and confusing MySQL. Is this a valid concern? With the MEMORY engine I'd have to definitely wait for completion, due to table locks.

Temporary Are there any benefits to temporary tables other than the fact that they're deleted when the db connection closes?

Kavi Siegel
  • 2,964
  • 2
  • 24
  • 33
  • Have you ruled out MyISAM / RAMDisk for any reason? It doesn't sound like you need the transaction semantics InnoDB would offer. – O. Jones Jul 25 '13 at 18:21
  • MyISAM is a brittle storage format. If your server crashes or loses power in the middle of a big operation, the whole table can be trashed. InnoDB has a journal that allows it to recover from almost anything, usually safely, sometimes in read-only mode. – tadman Jul 25 '13 at 18:24
  • What are you doing with the data that you need to put it in the database in the first place? Sometimes an RDBMS is the wrong tool for the job if you're just processing it line by line. You can cache this in a flat file quite easily. – tadman Jul 25 '13 at 18:26
  • @tadman - I'm using SQL to aggregate and filter the data that's already coming from a flat file. It's also actually completely ok if the databash gets trashed on a crash, because the origin flat file data is constantly updated in any part of the file, and the whole thing needs to be re-parsed every time the data is accessed. – Kavi Siegel Jul 25 '13 at 19:04
  • @OllieJones I was afraid of full table locks with writing to MyISAM. However I'm sure if I do one large insert or a few smaller ones and wait for completion, it wouldn't be that bad. Concurrent writes could happen otherwise.. it's race condition bait if not done carefully. I'm sure it would save overhead though. Any other pros for it? – Kavi Siegel Jul 25 '13 at 19:05
  • It really sounds like what you need is a document store and something that parses those documents. A system like [MongoDB](http://www.mongodb.org/) can execute JavaScript functions against your documents to do any processing required on an ad-hoc basis. – tadman Jul 25 '13 at 19:13
  • @tadman - fun fact, MongoDB is exactly where the data is coming from :P Sorry I mislead you, for simplicity sake of asking the question I could consider each Mongo Document as a flatfile of it's own. "data from a hierarchical type of format into tabular format". I'm actually joining 4 Mongo collections and flattening 4 level deep subdocuments into a 2d table. MongoDB doesn't do aggregation as easily as MySQL though. Plus there's already javascript libraries and with great interfaces for SQL data traversal. – Kavi Siegel Jul 25 '13 at 19:24
  • This question is really hard to answer without having a sense of the kind of processing you're trying to do. There's no magic bullet for this, it's highly specific to your situation. I'd investigate trying to do the parsing within MongoDB because JavaScript can do a lot of heavy lifting if you use it properly. – tadman Jul 25 '13 at 19:30
  • Can you please edit your question to clarify a few things? To what extent do you need concurrent reading and writing? You say you have 15K writes and 1-2 reads. Are the reads using some kind of summary query? What's the consequence if you do a read when only part of the table is written? – O. Jones Jul 25 '13 at 21:20

1 Answers1

2

I suggest you use MyISAM. Create your table with appropriate indexes for your query. Then disable keys, load the table, and enable keys.

I suggest you develop a discipline like this for your system. I've used a similar discipline very effectively.

Keep two copies of the table. Call one table_active and the second one table_loading.

When it's time to load a new copy of your data, use commands like this.

  ALTER TABLE table_loading DISABLE KEYS;
  /* do your insertions here, to table_loading */
  /* consider using LOAD DATA INFILE if it makes sense.  */
  ALTER TABLE table_loading ENABLE KEYS;  /* this will take a while */
  /* at this point, suspend your software that's reading table_active */
  RENAME TABLE table_active TO table_old;
  RENAME TABLE table_loading TO table_active;
  /* now you can resume running your software */
  TRUNCATE TABLE table_old;
  RENAME TABLE table_old TO table_loading;

Alternatively, you can DROP TABLE table_old; and create a new table for table_loading instead of the last rename.

This two-table (double-buffered) strategy should work pretty well. It will create some latency because your software that's reading the table will work on an old copy. But you'll avoid reading from an incompletely loaded table.

I suggest MyISAM because you won't run out of RAM and blow up and you won't have the fixed-row-length overhead or the transaction overhead. But you might also consider MariaDB and the Aria storage engine, which does a good job of exploiting RAM buffers.

If you do use the MEMORY storage engine, be sure to tweak your max_heap_table_size system variable. If your read queries will use index range scans (sequential index access) be sure to specify BTREE style indexes. See here: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • RENAME TABLE could trigger heavy disk IO when disk based temporory tables needs to be created not to best way to go.. if you use MEMORY and your querys only use = be use it create HASH indexes they will be faster but if you use query with <=, =, => be sure to create BTREE indexes – Raymond Nijland Aug 16 '13 at 16:29
  • If you're using myisam tables for both table_active and table_loading, there's no need to suspend. You can swap the tables out in a single atomic operation: RENAME TABLE table_active to table_old, table_loading to table_active; It just updates pointers, so it should complete almost instantly, and it's an atomic operation, so there's no risk of the table being accessed while the swap is happening. – Nathan Stretch May 16 '14 at 04:45