0

I have a couple of servers in an industrial setting (air-gapped network) that do some fairly light telemetry gathering. We generate about 10GB of telemetry history in ~30 days.

All telemetry goes into a set of tables divided into two types: current state and history. The state tables tend to have 16 rows or fewer. History tables can be fairly large, but the total is around 11GB. Telemetry comes in at a little less than 100 samples per second and the history tables are only updated if something changes or 30 seconds has passed. From my back-of-the-envelope checks, the history update is skipped about 9 times out of 10. So most of the time, each sample results in a single REPLACE INTO into one of about six tables.

This is all running on a stock Ubuntu server 14.04 (64-bit) load on Supermicro 1U servers with Xeon processors from 2015 or so. I am not at the factory so I cannot check the exact model.

Each server has 32GB of ECC RAM.

The disks are in RAID 1 configuration with 4 drives (the techs in the factory do not act quickly when a drive goes bad, so we want lots of backup). All disks are monitored continuously with smartctl and when one shows a failure or warning, we replace it. In December, we replaced the disks in one of the servers and just did the same with the other one.

On both servers, the performance of MySQL is usually good with single digit millisecond response times on updates to the status tables. However, we get extreme outliers. From time to time, multiple times per day and generally more than once an hour, we see a single REPLACE INTO into a 16-row status table take > 1.5 seconds. This sets off alarms that we have lost telemetry, so this is more than annoying.

All tables are InnoDB, single file per table. Discard is enabled for the file system (ext4). I tried changing the MySQL parameters to disable sync on commit (instead using periodic sync) and that seemed to have no effect. I have a 1GB log set up for InnoDB and the database files themselves are significantly smaller than RAM.

RAM is mostly (~60%) cached data.

I tried changing the table types of the status tables to MyISAM but the problem continues, unchanged.

I changed the data logger so that each table is handled by a single thread and threads batch queued updates into commits. Very rarely is there more than one change in a commit except after one of these huge delays.

The fact that MyISAM did not change anything (and I mean there was no discernible change in behavior at all), leads me to suspect the RAID.

The disks are brand new (less than two weeks old) Crucial MX500 drives, 1TB. Yes, these are consumer drives, but the write rate is fairly low. And, we keep the file system at less than 40% full all the time.

I am at a loss what to try next. Is this a RAID problem? Is this a MySQL configuration problem?

I see the delay across all the state tables, even on ones with 1 row. The rows are a bit wide in some cases (one has 125 columns), but they are still very, very small.

The state/status tables do have primary keys to ensure uniqueness of the data.

Anyone have any tips of where I can look next? Again, this is an industrial setting on an air-gapped network so downloading a lot of tools one at a time is extremely painful and all software coming in or out is tightly controlled.

I was not smart enough to install iostat on the servers when they were first installed. However, original tests with hdparm -tT seemed to show that everything was fine on the underlying disks. No disks show problems in smartctl.

The disk replacements were done one at a time, so the RAID is effectively the old RAID (which was MX200-based). The RAID was not rebuilt from scratch when the disks were replaced.

There are references to a known issue with this version of MySQL (5.5 something) and REPLACE INTO, but nothing I've read says that I should see a change in performance that big.

Any ideas would be appreciated!

Kyle
  • 31
  • 3
  • Additional information request. Post on pastebin.com or here. A) complete (not edited) my.cnf or my.ini Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) complete MySQLTuner report if readily available \AND Optional very helpful information, if available includes - htop OR top OR mytop for most active apps, ulimit -a for a linux/unix list of limits, iostat -xm 5 3 when system is busy for an idea of IOPS by device, df -h for a linux/unix free space list by device, for server tuning analysis. – Wilson Hauck Dec 08 '18 at 18:58
  • Unfortunately, this is in a locked-down industrial network off of the Internet. I can try to get permission to copy the data, but it is unlikely to be granted. – Kyle Dec 09 '18 at 23:42
  • Then, why ask for ideas that would be appreciated! Sounds like you are in a difficult situation. – Wilson Hauck Dec 10 '18 at 00:13

1 Answers1

0

Stalling during writes (which you seem to do mostly) might indicate that the innodb_log_file_size is full and is waiting to be flushed. The default size on these in 5.5 is horribly small. Increasing the size to 512M and instances to 4 would be a good start. Follow the reference below. Watch the time stamp difference on these during dataload (top level of the datadir). If they are all on the roughly the same minute these aren't big enough. Also look at SHOW ENGINES INNODB STATUS output.

ref: changing redo log size manual Though I'd move the old files out of the way rather than delete them so you can move them back if required. Backups save jobs.

The innodb_buffer_pool_size should also be set to a size to hold the active working set (70% of available ram is a good start and then look at the SHOW GLOBAL STATUS to see how much is used).

Ensure the slow query log is enabled with an appropriate threshold will help detect other slow queries.

ref: slow query log manual

danblack
  • 1,249
  • 13
  • 15
  • Huh. Managed to miss this. Sorry! I will double check the log file sizes. I am also going to be upgrading these servers in the next month or two to Ubuntu 16.04 and probably all the way to 18.04. – Kyle Dec 10 '18 at 03:57