On 2015-04-23 4:07 PM, Ulrich Windl wrote:
Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers cached
Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers cached
Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
On 2015-04-23 4:43 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers
cached Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
Thanks, I'll look at buiding it. I have used nmon for a while to view cpu/vm/mem/diskio stats at once.
The system is very lightly loaded at the moment.
# vmstat -w 2 10 procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 0 109758112 391744 19534916 0 0 3 25 3 2 2 0 97 1 0 0 0 0 109758304 391752 19534920 0 0 0 1160 1852 1360 1 0 97 2 0 1 1 0 109755448 391752 19534924 0 0 0 622 1899 1244 1 0 97 1 0 0 1 0 109755536 391752 19534928 0 0 0 2444 1917 1427 1 0 97 2 0 1 1 0 109756784 391756 19534940 0 0 0 116 4987 2658 3 1 96 1 0 3 1 0 109755440 391756 19534948 0 0 0 396 6009 3341 3 2 94 2 0 1 1 0 109756176 391760 19534964 0 0 0 306 6312 3260 3 2 94 1 0 0 1 0 109757840 391760 19534976 0 0 0 320 6120 3293 3 2 94 1 0 0 1 0 109755968 391760 19534984 0 0 0 222 4617 2569 2 1 96 1 0 1 1 0 109754528 391760 19534988 0 0 0 256 2002 1434 1 0 98 1 0
On 2015-04-23 4:43 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers
cached Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
BTW, just using nmon whilst running a test script that writes 20K small objects shows a write speed of around 3-5 MB/s to the disc (without dbnosync, so as to see the immediate effects). This is on a SAS disc with a 140MB/s capability, checked using regular tools such a dd (writing 1000 files of 384K each). I'm failing to understand why the write operations from the kernel page cache is so drastically slower. With dbnosync enabled, lmdb writes to the pages fast (as expected), however the pdflush that follows writes at the same slow speed, causing delays for further processes.
Geoff Swan wrote:
On 2015-04-23 4:43 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
> Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 > in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers
cached Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
BTW, just using nmon whilst running a test script that writes 20K small objects shows a write speed of around 3-5 MB/s to the disc (without dbnosync, so as to see the immediate effects). This is on a SAS disc with a 140MB/s capability, checked using regular tools such a dd (writing 1000 files of 384K each). I'm failing to understand why the write operations from the kernel page cache is so drastically slower. With dbnosync enabled, lmdb writes to the pages fast (as expected), however the pdflush that follows writes at the same slow speed, causing delays for further processes.
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
On 2015-04-23 5:56 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:43 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
>> Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 >> in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers
cached Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
BTW, just using nmon whilst running a test script that writes 20K small objects shows a write speed of around 3-5 MB/s to the disc (without dbnosync, so as to see the immediate effects). This is on a SAS disc with a 140MB/s capability, checked using regular tools such a dd (writing 1000 files of 384K each). I'm failing to understand why the write operations from the kernel page cache is so drastically slower. With dbnosync enabled, lmdb writes to the pages fast (as expected), however the pdflush that follows writes at the same slow speed, causing delays for further processes.
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
OK. I ran a reduced version of test script (20 processes each performing 40 read/write operations) with normal (safe) mode of operation on a test server that has 32GB RAM, and everything else identical to the server with 128GB. A quick test using vmstat at 1s intervals gave the following output whilst it was running.
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 20 0 0 32011144 167764 330416 0 0 1 15 40 56 0 0 99 1 0 0 0 0 31914848 167764 330424 0 0 0 1560 2594 2130 2 1 97 0 0 0 0 0 31914336 167764 330424 0 0 0 1708 754 1277 0 0 100 0 0 0 0 0 31914508 167772 330420 0 0 0 2028 779 1300 0 0 99 1 0 0 0 0 31913956 167772 330428 0 0 0 1996 936 1354 0 0 100 0 0 0 0 0 31914272 167772 330428 0 0 0 1588 785 1286 0 0 100 0 0 0 0 0 31914356 167780 330424 0 0 0 1604 832 1388 0 0 99 1 0 0 0 0 31914572 167780 330428 0 0 0 1240 1258 1910 0 0 100 0 0 0 0 0 31914596 167780 330428 0 0 0 1776 905 1626 0 0 100 0 0 0 0 0 31914812 167780 330428 0 0 0 1340 1117 1731 0 0 100 0 0 0 0 0 31914796 167788 330420 0 0 0 1504 1154 1883 0 0 100 0 0 0 0 0 31914936 167788 330428 0 0 0 1476 835 1390 0 0 100 0 0 0 0 0 31914844 167796 330432 0 0 0 1140 1203 1891 0 0 100 0 0 0 0 0 31914744 167796 330432 0 0 0 1600 1034 1667 0 0 100 0 0 0 0 0 31914976 167796 330432 0 0 0 1488 1149 1715 0 0 100 0 0 2 0 0 31915052 167796 330432 0 0 0 1248 1249 1930 0 0 100 0 0 0 0 0 31914316 167796 330432 0 0 0 1736 1076 1625 0 0 99 0 0 0 0 0 31914680 167804 330424 0 0 0 1176 1103 1751 0 0 100 0 0 0 0 0 31915108 167804 330432 0 0 0 1476 1063 1776 0 0 100 0 0 0 0 0 31914796 167804 330432 0 0 0 1660 967 1594 0 0 100 0 0 0 0 0 31914820 167812 330424 0 0 0 1336 1358 2124 0 0 100 0 0 0 0 0 31914904 167812 330432 0 0 0 1608 857 1464 0 0 100 0 0 0 0 0 31914972 167828 330420 0 0 0 1316 1061 1741 0 0 99 1 0 0 0 0 31915140 167828 330428 0 0 0 1320 1103 1751 0 0 100 0 0 0 0 0 31915756 167828 330436 0 0 0 1604 1049 1544 0 0 99 0 0 0 0 0 31915708 167828 330436 0 0 0 1244 1358 2052 0 0 100 0 0 0 0 0 31916008 167828 330436 0 0 0 1620 912 1595 0 0 100 0 0 0 0 0 31915964 167836 330428 0 0 0 1492 1105 1715 0 0 100 0 0 0 0 0 31915916 167836 330436 0 0 0 1376 1164 1878 0 0 100 0 0 0 0 0 31915924 167836 330436 0 0 0 1484 848 1390 0 0 100 0 0 0 0 0 31916176 167844 330428 0 0 0 1368 1447 2123 0 0 99 0 0 0 0 0 31916864 167844 330444 0 0 0 1444 1355 1832 0 0 99 1 0
The script took about 60s to complete, which is a lot longer than expected. It appears almost all I/O bound, at a fairly slow rate (1500 blocks in a second is 6MB/s).
I'll build atop or iostat to get more detail.
Did you get to the bottom of this?
On Thu, Apr 23, 2015 at 08:29:48PM +1000, Geoff Swan wrote:
On 2015-04-23 5:56 PM, Howard Chu wrote:
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
Decent SAS disks spin at 10,000 or 15,000 RPM so unless there is a non-volatile memory cache in there I would expect at most 15000/60 = 250 fsyncs per second per drive, giving 125 transaction commits per second per drive.
OK. I ran a reduced version of test script (20 processes each performing 40 read/write operations) with normal (safe) mode of operation on a test server that has 32GB RAM, and everything else identical to the server with 128GB.
So that is just 800 operations taking 60s?
A quick test using vmstat at 1s intervals gave the following output whilst it was running.
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 20 0 0 32011144 167764 330416 0 0 1 15 40 56 0 0 99 1 0 0 0 0 31914848 167764 330424 0 0 0 1560 2594 2130 2 1 97 0 0 0 0 0 31914336 167764 330424 0 0 0 1708 754 1277 0 0 100 0 0 0 0 0 31914508 167772 330420 0 0 0 2028 779 1300 0 0 99 1 0
The script took about 60s to complete, which is a lot longer than expected. It appears almost all I/O bound, at a fairly slow rate (1500 blocks in a second is 6MB/s).
As you say, it is IO bound (wa ~= 100%). Stop worrying about MB/s: the data rate is irrelevant, what matters is synchronous small-block writes and those are limited by rotation speed.
Are you absolutely certain that the disks are SAS? Does your disk controller believe it? I had big problems with an HP controller once that refused to run SATA drives at anything like their full speed as it waited for each transaction to finish and report back before queuing the next one...
Andrew
On 2015-04-28 7:51 PM, Andrew Findlay wrote:
Did you get to the bottom of this?
Yes.
On Thu, Apr 23, 2015 at 08:29:48PM +1000, Geoff Swan wrote:
On 2015-04-23 5:56 PM, Howard Chu wrote:
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
Decent SAS disks spin at 10,000 or 15,000 RPM so unless there is a non-volatile memory cache in there I would expect at most 15000/60 = 250 fsyncs per second per drive, giving 125 transaction commits per second per drive.
These are Enterprise SAS drives with onboard read and write cache systems.
OK. I ran a reduced version of test script (20 processes each performing 40 read/write operations) with normal (safe) mode of operation on a test server that has 32GB RAM, and everything else identical to the server with 128GB.
So that is just 800 operations taking 60s?
A quick test using vmstat at 1s intervals gave the following output whilst it was running.
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 20 0 0 32011144 167764 330416 0 0 1 15 40 56 0 0 99 1 0 0 0 0 31914848 167764 330424 0 0 0 1560 2594 2130 2 1 97 0 0 0 0 0 31914336 167764 330424 0 0 0 1708 754 1277 0 0 100 0 0 0 0 0 31914508 167772 330420 0 0 0 2028 779 1300 0 0 99 1 0 The script took about 60s to complete, which is a lot longer than expected. It appears almost all I/O bound, at a fairly slow rate (1500 blocks in a second is 6MB/s).
As you say, it is IO bound (wa ~= 100%). Stop worrying about MB/s: the data rate is irrelevant, what matters is synchronous small-block writes and those are limited by rotation speed.
Are you absolutely certain that the disks are SAS? Does your disk controller believe it? I had big problems with an HP controller once that refused to run SATA drives at anything like their full speed as it waited for each transaction to finish and report back before queuing the next one...
Yes, they are SAS drives and the driver recognises them as such, connected to a C600 controller.
Andrew
Did a lot of testing over the last week or so. It appears to be fundamentally a linux block layer problem. An fsync operation appears to set the FUA flag on the scsi command to force it to bypass the write cache. This is a real problem since it bypasses the intelligence built into a scsi controller to handle the write cache. So consequently we see a seek time in each 4K block transaction. Seems to be hard wired and buried in the block layer. It would be nice to have a mount option to prevent this from happening on certain mounted volumes.
However, there was some significant improvement in the 3.19.5 kernel, where multi-queues can be enabled for scsi operations. Seems to still bypass the write cache on the scsi drive, however the performance is much better.
Another area that also improved things was in the vm tuning, however this is fairly sensitive (like a high-Q bandpass filter). Reducing the vm.dirty_expire_centisecs value from 30s to 15s improved things in this environment, which can have a buildup of written pages. Making them expire a bit sooner allows for less bumpy cache flushing.
On 2015-04-23 5:56 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:43 PM, Howard Chu wrote:
Geoff Swan wrote:
On 2015-04-23 4:07 PM, Ulrich Windl wrote:
>> Geoff Swan gswan3@bigpond.net.au schrieb am 22.04.2015 um 23:18 >> in Nachricht
5538100A.3060301@bigpond.net.au:
[...]
Free stats look fine to me. No swap is being used, or has been used yet on this system.
total used free shared buffers
cached Mem: 132005400 21928192 110077208 10612 363064 19230072 -/+ buffers/cache: 2335056 129670344 Swap: 8388604 0 8388604
The effect I am seeing is that despite a reasonably fast disc system, the kernel writing of dirty pages is painfully slow.
So disk I/O stats is also available via sar. May we see them? Finally there's blktrace where you can follow the timing and positions of each individual block being written, but that's not quite easy to run and analyze (unless I missed the easy way).
I suspect scattered writes that bring your I/O rate down.
Regards, Ulrich
sysstat (and sar) is not installed on this system, so I can't give you that output.
As I said in my previous email - use iostat and/or vmstat to monitor paging activity and system wait time. You can also install atop and get all of the relevant stats on one screen.
I find it indispensible these days.
BTW, just using nmon whilst running a test script that writes 20K small objects shows a write speed of around 3-5 MB/s to the disc (without dbnosync, so as to see the immediate effects). This is on a SAS disc with a 140MB/s capability, checked using regular tools such a dd (writing 1000 files of 384K each). I'm failing to understand why the write operations from the kernel page cache is so drastically slower. With dbnosync enabled, lmdb writes to the pages fast (as expected), however the pdflush that follows writes at the same slow speed, causing delays for further processes.
In normal (safe) operation, every transaction commit performs 2 fsyncs. Your 140MB/s throughput spec isn't relevant here, your disk's IOPS rate is what matters. You can use NOMETASYNC to do only 1 fsync per commit.
Put atop on the test machine. Nice utiltity. Ran the test script again and got these results around the memory and disk lines, which appear to confirm the low IOPS. Any clues on improving this? The dd test on the disc was just to confirm that the C600 SAS controller/driver and disc could achieve the throughput.
MEM | tot 31.4G | free 30.4G | cache 326.5M | buff 167.2M | slab 85.2M | shmem 10.0M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M | SWP | tot 16.0G | free 16.0G | | | | | | vmcom 5.4G | vmlim 31.7G | DSK | sda | busy 75% | read 0 | write 3400 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 1.28 | avio 2.20 ms |
openldap-technical@openldap.org