richton@nbcs.rutgers.edu wrote:
Full_Name: Aaron Richton Version: 2.3.40 OS: Solaris 9 URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (128.6.31.135)
One hdb backend on one slave died ~21:58 yesterday...
current thread: t@5 [1] _libc_poll(0xffffffff4f3ff430, 0x0, 0x3e8, 0x0, 0x0, 0x0), at 0xffffffff7f0a741c [2] _select(0x3e8, 0xffffffff7f1bc728, 0xffffffff7f1bc728, 0x0, 0xffffffff7f1bc728, 0x0), at 0xffffffff7f05a74c [3] select(0x0, 0x0, 0x0, 0x0, 0xffffffff4f3ff5b0, 0x0), at 0xffffffff7e0108e8 =>[4] __os_sleep(dbenv = 0x1005b2610, secs = 1U, usecs = 0), line 84 in "os_sleep.c" [5] __memp_sync_int(dbenv = 0x1005b2610, dbmfp = (nil), trickle_max = 0, op = DB_SYNC_CACHE, wrotep = (nil)), line 362 in "mp_sync.c" [6] __memp_sync(dbenv = 0x1005b2610, lsnp = (nil)), line 99 in "mp_sync.c" [7] __txn_checkpoint(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U, flags = 0), line 1389 in "txn.c" [8] __txn_checkpoint_pp(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U, flags = 0), line 1288 in "txn.c" [9] hdb_checkpoint(ctx = 0xffffffff4f3ffc30, arg = 0x1004b4c60), line 165 in "config.c" [10] ldap_int_thread_pool_wrapper(xpool = 0x10041e500), line 478 in "tpool.c"
(dbx) where current thread: t@16 [1] _libc_poll(0xffffffff46ffe3e0, 0x0, 0x3e8, 0x0, 0x0, 0x0), at 0xffffffff7f0a741c [2] _select(0x3e8, 0xffffffff7f1bc728, 0xffffffff7f1bc728, 0x0, 0xffffffff7f1bc728, 0x0), at 0xffffffff7f05a74c [3] select(0x0, 0x0, 0x0, 0x0, 0xffffffff46ffe560, 0x0), at 0xffffffff7e0108e8 =>[4] __os_sleep(dbenv = 0x1005b2610, secs = 1U, usecs = 0), line 84 in "os_sleep.c" [5] __memp_sync_int(dbenv = 0x1005b2610, dbmfp = (nil), trickle_max = 0, op = DB_SYNC_CACHE, wrotep = (nil)), line 439 in "mp_sync.c" [6] __memp_sync(dbenv = 0x1005b2610, lsnp = (nil)), line 99 in "mp_sync.c" [7] __txn_checkpoint(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U, flags = 0), line 1389 in "txn.c" [8] __txn_checkpoint_pp(dbenv = 0x1005b2610, kbytes = 100000U, minutes = 10U, flags = 0), line 1288 in "txn.c" [9] hdb_delete(op = 0xffffffff46fff618, rs = 0xffffffff46fff088), line 537 in "delete.c" [10] syncrepl_entry(si = 0x1004b4e50, op = 0xffffffff46fff618, entry = (nil), modlist = 0xffffffff46fff320, syncstate = 3, syncUUID = 0xffffffff46fff3c0, syncCookie_req = 0xffffffff46fff360, syncCSN = 0xffffffff46fff390), line 2006 in "syncrepl.c" [11] do_syncrep2(op = 0xffffffff46fff618, si = 0x1004b4e50), line 731 in "syncrepl.c" [12] do_syncrepl(ctx = 0xffffffff46fffc30, arg = 0x1004b5030), line 1095 in "syncrepl.c" [13] ldap_int_thread_pool_wrapper(xpool = 0x10041e500), line 478 in "tpool.c"
I can't get db_stat to join the environment. If there's anything else that can be gleaned from slapd itself, I'd be glad to poke around the core; otherwise, I'm off to rm/slapadd...
"This makes sense and shouldn't happen in 2.3.41" would be fine too, but none of the changes (to my eye) looked locking related.
Unfortunately no, nothing familiar here. There's nothing in the BDB documentation that says two threads are not allowed to call txn_checkpoint concurrently, but I suppose it may be excessive to make multiple calls in rapid succession.
One thing that I've started doing recently in my configs is to skip the #bytes option (leave it zero), so that only time-based checkpoints occur. Since they're done in a dedicated task, only one thread at a time can trigger a checkpoint.