On 2015-04-28 7:51 PM, Andrew Findlay wrote:
Did you get to the bottom of this?
Yes.
On Thu, Apr 23, 2015 at 08:29:48PM +1000, Geoff Swan wrote:
On 2015-04-23 5:56 PM, Howard Chu wrote:
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
Decent SAS disks spin at 10,000 or 15,000 RPM so unless there is a non-volatile memory cache in there I would expect at most 15000/60 = 250 fsyncs per second per drive, giving 125 transaction commits per second per drive.
These are Enterprise SAS drives with onboard read and write cache systems.
OK. I ran a reduced version of test script (20 processes each performing 40 read/write operations) with normal (safe) mode of operation on a test server that has 32GB RAM, and everything else identical to the server with 128GB.
So that is just 800 operations taking 60s?
A quick test using vmstat at 1s intervals gave the following output whilst it was running.
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 20 0 0 32011144 167764 330416 0 0 1 15 40 56 0 0 99 1 0 0 0 0 31914848 167764 330424 0 0 0 1560 2594 2130 2 1 97 0 0 0 0 0 31914336 167764 330424 0 0 0 1708 754 1277 0 0 100 0 0 0 0 0 31914508 167772 330420 0 0 0 2028 779 1300 0 0 99 1 0 The script took about 60s to complete, which is a lot longer than expected. It appears almost all I/O bound, at a fairly slow rate (1500 blocks in a second is 6MB/s).
As you say, it is IO bound (wa ~= 100%). Stop worrying about MB/s: the data rate is irrelevant, what matters is synchronous small-block writes and those are limited by rotation speed.
Are you absolutely certain that the disks are SAS? Does your disk controller believe it? I had big problems with an HP controller once that refused to run SATA drives at anything like their full speed as it waited for each transaction to finish and report back before queuing the next one...
Yes, they are SAS drives and the driver recognises them as such, connected to a C600 controller.
Andrew
Did a lot of testing over the last week or so. It appears to be fundamentally a linux block layer problem. An fsync operation appears to set the FUA flag on the scsi command to force it to bypass the write cache. This is a real problem since it bypasses the intelligence built into a scsi controller to handle the write cache. So consequently we see a seek time in each 4K block transaction. Seems to be hard wired and buried in the block layer. It would be nice to have a mount option to prevent this from happening on certain mounted volumes.
However, there was some significant improvement in the 3.19.5 kernel, where multi-queues can be enabled for scsi operations. Seems to still bypass the write cache on the scsi drive, however the performance is much better.
Another area that also improved things was in the vm tuning, however this is fairly sensitive (like a high-Q bandpass filter). Reducing the vm.dirty_expire_centisecs value from 30s to 15s improved things in this environment, which can have a buildup of written pages. Making them expire a bit sooner allows for less bumpy cache flushing.