Over the past couple years I've noticed that slapd's frontend was a bottleneck in our benchmarks on larger systems. But I hadn't quantified the extent of the problem. This week I stumbled across mutrace, which is a very handy tool that measures how much time is spent in mutexes.
Running against back-mdb with the test008 setup, using -l 100 on the tester showed some pretty interesting data. This is a run where slapd was configured with listener-threads 2 on my quadcore laptop:
The #1 contended mutex is in the thread pool, pool->ltp_mutex. #2 and #4 are the socket-table mutexes. (With the default of 1 listener thread they're correspondingly higher in the chart.) #3 is slap_op_mutex which is used to ensure unique timestamps in each Operation structure. #5 is the LMDB writer mutex. #6 is the attr_alloc mutex, which we should probably consider eliminating but nothing after that is very significant. 7,8,9 are connection writer mutexes, and #10 is the pending CSN list mutex.
##### mutrace: 0.2 successfully initialized for process slapd (PID: 10868). ^C mutrace: Showing statistics for process slapd (PID: 10868). mutrace: 717 mutexes used.
Mutex #471 (0x0x10bae48) first referenced by: mutrace.c:1448 pthread_mutex_init() tpool.c:217 ldap_pvt_thread_pool_init()
Mutex #488 (0x0x9100f0) first referenced by: mutrace.c:1448 pthread_mutex_init() daemon.c:2906 slapd_daemon()
Mutex #497 (0x0x9107e0) first referenced by: mutrace.c:1448 pthread_mutex_init() init.c:92 slap_init() main.c:787 main() libc-start.c:258 __libc_start_main()
Mutex #487 (0x0x9100a0) first referenced by: mutrace.c:1448 pthread_mutex_init() daemon.c:1629 slapd_daemon_init()
Mutex #635 (0x0x7ffe2d78c040) first referenced by: mutrace.c:1448 pthread_mutex_init() mdb.c:3795 mdb_env_setup_locks() mdb.c:3906 mdb_env_open() init.c:174 mdb_db_open() backend.c:224 backend_startup_one() backend.c:330 backend_startup() init.c:219 slap_startup() main.c:992 main() libc-start.c:258 __libc_start_main()
Mutex #495 (0x0x910720) first referenced by: mutrace.c:1448 pthread_mutex_init() attr.c:708 attr_init() entry.c:95 entry_init() init.c:117 slap_init() main.c:787 main() libc-start.c:258 __libc_start_main()
Mutex #113 (0x0x7ffe2d59c248) first referenced by: mutrace.c:1448 pthread_mutex_init() connection.c:421 connection_init()
Mutex #98 (0x0x7ffe2d59b6c8) first referenced by: mutrace.c:1448 pthread_mutex_init() connection.c:421 connection_init()
Mutex #105 (0x0x7ffe2d59bc88) first referenced by: mutrace.c:1448 pthread_mutex_init() connection.c:421 connection_init()
Mutex #4 (0x0x116cf20) first referenced by: mutrace.c:1448 pthread_mutex_init() backend.c:624 backend_db_init() bconfig.c:1661 config_generic() config.c:345 config_set_vals() config.c:419 config_add_vals() config.c:783 read_config_file() bconfig.c:4439 read_config() main.c:793 main() libc-start.c:258 __libc_start_main()
mutrace: Showing 10 mutexes in order of (write) contention count:
Mutex # Locked Changed Cont. cont.Time[ms] tot.Time[ms] avg.Time[ms] 471 231744 178161 41572 14511.248 634.339 0.003 488 108544 83353 9455 2449.662 356.085 0.003 497 102537 97750 8429 5156.696 245.420 0.002 487 89990 73426 7462 2477.159 318.022 0.004 635 2401 2263 1562 5838.290 905.338 0.377 495 23079 2596 205 65.932 74.804 0.003 113 44029 7438 165 95.214 1260.856 0.029 98 30427 5647 146 79.943 1100.263 0.036 105 18433 4504 119 66.397 965.679 0.052 4 4800 3540 117 30.774 12.467 0.003 ... ... ... ... ... ... ...
mutrace: Note that rwlocks are shown as two lines: write locks then read locks.
mutrace: 45 condition variables used.
Condvar #43 (0x0x10bae70) first referenced by: mutrace.c:1744 pthread_cond_init() tpool.c:220 ldap_pvt_thread_pool_init()
mutrace: Showing 1 condition variables in order of wait contention count:
Cond # Waits Signals Cont. tot.Time[ms] cont.Time[ms] avg.Time[ms] 43 18389 106667 17090 64222.152 10010.642 0.544 ... ... ... ... ... ... ...
mutrace: Total runtime is 10582.976 ms.
mutrace: Results for SMP with 4 processors.
mutrace: WARNING: 2 internal hash collisions detected. Results might not be as reliable as they could be. mutrace: Try to increase --hash-size=, which is currently at 67137. #####
The total runtime is relatively arbitrary, based on how quickly I typed ^C after the slapd-tester program finished. But notice that the condition variable is contended for through almost the entire runtime. This is also in the threadpool, pool->ltp_cond.
I've now written a patch to allow a configurable number of workqueues in a threadpool. Here's the same slapd-tester workload against slapd with queues=4.
##### mutrace -d --hash-size=67137 ../servers/slapd/slapd -f /tmp/testr/slapd.1.conf -h ldap://:9011 -s0 -d0 mutrace: 0.2 successfully initialized for process slapd (PID: 10191). ^C mutrace: Showing statistics for process slapd (PID: 10191). mutrace: 724 mutexes used.
Mutex #535 (0x0x90f7e0) first referenced by: mutrace.c:1448 pthread_mutex_init() init.c:92 slap_init() [0x65742f706d742f00] ??() ??:0
Mutex #528 (0x0x90f0f0) first referenced by: mutrace.c:1448 pthread_mutex_init() daemon.c:2906 slapd_daemon()
Mutex #527 (0x0x90f0a0) first referenced by: mutrace.c:1448 pthread_mutex_init() daemon.c:1629 slapd_daemon_init()
Mutex #691 (0x0xfd8eb0) first referenced by: mutrace.c:1448 pthread_mutex_init() tpool.c:247 ldap_pvt_thread_pool_init0()
Mutex #692 (0x0xfd8f78) first referenced by: mutrace.c:1448 pthread_mutex_init() tpool.c:247 ldap_pvt_thread_pool_init0()
Mutex #694 (0x0xfd9108) first referenced by: mutrace.c:1448 pthread_mutex_init() tpool.c:247 ldap_pvt_thread_pool_init0()
Mutex #693 (0x0xfd9040) first referenced by: mutrace.c:1448 pthread_mutex_init() tpool.c:247 ldap_pvt_thread_pool_init0()
Mutex #530 (0x0x7fa641167040) first referenced by: mutrace.c:1448 pthread_mutex_init() mdb.c:3795 mdb_env_setup_locks() mdb.c:3906 mdb_env_open() init.c:174 mdb_db_open() backend.c:224 backend_startup_one() backend.c:330 backend_startup() init.c:220 slap_startup() libc-start.c:258 __libc_start_main()
Mutex #533 (0x0x90f720) first referenced by: mutrace.c:1448 pthread_mutex_init() attr.c:708 attr_init() entry.c:95 entry_init() init.c:117 slap_init() [0x65742f706d742f00] ??() ??:0
Mutex #592 (0x0x7fa640f76108) first referenced by: mutrace.c:1448 pthread_mutex_init() connection.c:421 connection_init()
mutrace: Showing 10 mutexes in order of (write) contention count: Mutex # Locked Changed Cont. cont.Time[ms] tot.Time[ms] avg.Time[ms] 535 102537 97558 11075 6417.798 247.221 0.002 528 119100 89722 8905 2224.464 373.760 0.003 527 105147 81593 7472 2074.638 339.454 0.003 691 63260 47195 2924 680.201 190.002 0.003 692 60573 47065 2537 504.543 180.968 0.003 694 55821 42992 2296 548.873 171.009 0.003 693 54335 39828 2234 644.864 188.289 0.003 530 2401 2068 1437 6222.018 865.117 0.360 533 23079 2464 246 87.519 66.836 0.003 592 34036 5594 236 91.546 1306.007 0.038 ... ... ... ... ... ... ...
mutrace: Note that rwlocks are shown as two lines: write locks then read locks.
mutrace: 51 condition variables used.
Condvar #45 (0x0xfd8fa0) first referenced by: mutrace.c:1744 pthread_cond_init() tpool.c:250 ldap_pvt_thread_pool_init0()
Condvar #49 (0x0xfd9130) first referenced by: mutrace.c:1744 pthread_cond_init() tpool.c:250 ldap_pvt_thread_pool_init0()
Condvar #47 (0x0xfd9068) first referenced by: mutrace.c:1744 pthread_cond_init() tpool.c:250 ldap_pvt_thread_pool_init0()
Condvar #43 (0x0xfd8ed8) first referenced by: mutrace.c:1744 pthread_cond_init() tpool.c:250 ldap_pvt_thread_pool_init0()
mutrace: Showing 4 condition variables in order of wait contention count:
Cond # Waits Signals Cont. tot.Time[ms] cont.Time[ms] avg.Time[ms] 45 5580 27496 3692 13306.017 792.845 0.142 49 3798 26011 2333 13349.744 658.738 0.173 47 4364 24985 2204 14182.993 1001.988 0.230 43 3929 29665 2102 12901.160 1186.375 0.302 ... ... ... ... ... ... ...
mutrace: Total runtime is 23920.267 ms.
mutrace: Results for SMP with 4 processors. #####
Now the #1 contended mutex is the slap_op_time mutex. #2 and #3 are the socket table mutexes. #4,5,6,7 are the threadpool workqueue mutexes. The important thing to note is that while the total time all 4 of these mutexes are held adds up to about the same as the single mutex in the previous run, the total contended time is less than 1/6th of the previous run. I.e., splitting 1 mutex into 4 reduced our overhead by a factor of 6.
For the condvar, the total contended time of the 4 cvs is 2.7x less than the previous run.
The code is currently available on my mutex branch on ada.openldap.org. The pool-pause feature hasn't yet been tested. When I've finished that I'll merge to master for RE25.
I believe this patch finally gets the frontend out of LMDB's way. I've also looked at restructuring the writer mutexes, but that may prove to be a deadend, based on the signalling requirements we have.