I have a couple of servers in an industrial setting (air-gapped network) that do some fairly light telemetry gathering. We generate about 10GB of telemetry history in ~30 days.
All telemetry goes into a set of tables divided into two types: current state and history. The state tables tend to have 16 rows or fewer. History tables can be fairly large, but the total is around 11GB. Telemetry comes in at a little less than 100 samples per second and the history tables are only updated if something changes or 30 seconds has passed. From my back-of-the-envelope checks, the history update is skipped about 9 times out of 10. So most of the time, each sample results in a single REPLACE INTO into one of about six tables.
This is all running on a stock Ubuntu server 14.04 (64-bit) load on Supermicro 1U servers with Xeon processors from 2015 or so. I am not at the factory so I cannot check the exact model.
Each server has 32GB of ECC RAM.
The disks are in RAID 1 configuration with 4 drives (the techs in the factory do not act quickly when a drive goes bad, so we want lots of backup). All disks are monitored continuously with smartctl and when one shows a failure or warning, we replace it. In December, we replaced the disks in one of the servers and just did the same with the other one.
On both servers, the performance of MySQL is usually good with single digit millisecond response times on updates to the status tables. However, we get extreme outliers. From time to time, multiple times per day and generally more than once an hour, we see a single REPLACE INTO into a 16-row status table take > 1.5 seconds. This sets off alarms that we have lost telemetry, so this is more than annoying.
All tables are InnoDB, single file per table. Discard is enabled for the file system (ext4). I tried changing the MySQL parameters to disable sync on commit (instead using periodic sync) and that seemed to have no effect. I have a 1GB log set up for InnoDB and the database files themselves are significantly smaller than RAM.
RAM is mostly (~60%) cached data.
I tried changing the table types of the status tables to MyISAM but the problem continues, unchanged.
I changed the data logger so that each table is handled by a single thread and threads batch queued updates into commits. Very rarely is there more than one change in a commit except after one of these huge delays.
The fact that MyISAM did not change anything (and I mean there was no discernible change in behavior at all), leads me to suspect the RAID.
The disks are brand new (less than two weeks old) Crucial MX500 drives, 1TB. Yes, these are consumer drives, but the write rate is fairly low. And, we keep the file system at less than 40% full all the time.
I am at a loss what to try next. Is this a RAID problem? Is this a MySQL configuration problem?
I see the delay across all the state tables, even on ones with 1 row. The rows are a bit wide in some cases (one has 125 columns), but they are still very, very small.
The state/status tables do have primary keys to ensure uniqueness of the data.
Anyone have any tips of where I can look next? Again, this is an industrial setting on an air-gapped network so downloading a lot of tools one at a time is extremely painful and all software coming in or out is tightly controlled.
I was not smart enough to install iostat on the servers when they were first installed. However, original tests with hdparm -tT seemed to show that everything was fine on the underlying disks. No disks show problems in smartctl.
The disk replacements were done one at a time, so the RAID is effectively the old RAID (which was MX200-based). The RAID was not rebuilt from scratch when the disks were replaced.
There are references to a known issue with this version of MySQL (5.5 something) and REPLACE INTO, but nothing I've read says that I should see a change in performance that big.
Any ideas would be appreciated!