Two identical servers. One with extremely poor performance

Question

I have these two servers with the following specs:

OS: RHEL 6.3 Hardware: Dell PowerEdge R610, 12 cores, 64GB RAM Drives: 6 x Samsung 840 Pro SSD RAID controller: Intel RS25AB080, 1GB Cache RAID Level: 5

When we test disk performance using the dd command on server "A," we get an average of 333MB/sec.

When we test disk performance using the dd command on server "B," we get an average of 40MB/sec.

I am using the following command

dd if=/dev/zero of=testfile bs=3G count=1 oflag=dsync

I am unable to figure out why we get such terrible performance on server B.

The server is a standby cluster node for a MySQL database cluster. The active MySQL services are currently running on the other server, so this node is essentially idle. The only significant processes running on it are corosync, pacemaker, and drbd

You're going to need to share a lot more information if you want help. What have you tried to solve the problem? How have you validated the disk is set up correctly and the hardware is working? — Tim, Dec 07 '16 at 03:23
Well, how can I validate it? I came here looking for ideas and things that I should check and keep an eye out for. — ConqueringDev, Dec 07 '16 at 03:27
Show RAID controller status on both servers via `storcli` command. — Mikhail Khirgiy, Dec 07 '16 at 05:38
Is the raid card in patrol read or a full consistency check? Is the BBU charged and working? Is the array in write back mode? What is the output of `iostat`? What's the output of `megacli -LDInfo -LALL -aALL`? — Gmck, Dec 07 '16 at 06:31
Do you have write barriers enabled? Can you post the output of `mount` and `df -h`? — ewwhite, Jan 05 '17 at 00:56

higuita · Answer 1 · 2017-01-05T00:57:46.367

First thing to check is the SSD firmware version. Samsung had released firmware updates to improve performance and reliability. Also check the raid controller firmware.

Second check the machine config, use sysctl -a to confirm that both are using the same kernel settings

From what you describe, i suspect that one SSD is "full" and the other still have free space. Although near full filesystem also helps on this problem, by full i mean that the SSD firmware do not have "empty" blocks to use ( even if the filesystem reports many free space, all blocks might already been written to), so any write forces a garbage collection to get more free block for normal usage. The other SSD have still free blocks, so it quickly find then and write whatever it need there.

Usually you "free" SSD blocks with the discard mount option or (usually more recommended) the fstrim tool. Check this link for more info.

in extreme, you can do a SSD memory cell clearing, that will clean all the SSD, but of course, you will lose all the info, so do this after backup everything. Check the above link also to check more info about this

Finally, if you can not run with discard or run fstrim (as not all RAID controllers allow then), i would recommend that you do the SSD memoru cell clearing and then partition your SSD so that at least 10% to 20% of the SSD is not allocated to any partition (free partition space), so the firmware can see then as free and be able to better do the garbage collection and free enough amount of blocks to avoid repeated garbage collection.

Two identical servers. One with extremely poor performance

1 Answers1