Hi,
I've got a two-node MMR setup using syncrepl running 2.4.36 with back-bdb.
Last night, between 1/3 and 1/2 of the entries in our directory disappeared. Fortunately I have olcLogLeve=sync set, so I think I have a line on what happened...
I see this at 4AM long of the morning before the fun started:
Jul 7 04:19:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 04:19:30 zoot slapd[15011]: => bdb_idl_insert_key: c_get failed: Cannot allocate memory (12)
Then many hours later: Jul 7 15:04:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 15:04:30 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12) Jul 7 15:04:30 zoot slapd[15011]: null_callback : error code 0x50 Jul 7 15:04:30 zoot slapd[15011]: slap_graduate_commit_csn: removing 0x7fa33c248a80 20140707190424.188580Z#000000#001#000000 Jul 7 15:04:30 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
This repeats itself several times: Jul 7 16:41:09 zoot slapd[15011]: syncprov_matchops: skipping original sid 001 Jul 7 16:41:09 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 16:41:09 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12) Jul 7 16:41:09 zoot slapd[15011]: null_callback : error code 0x50 Jul 7 16:41:09 zoot slapd[15011]: slap_graduate_commit_csn: removing 0x7fa37e4b4260 20140707204059.442906Z#000000#001#000000 Jul 7 16:41:09 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul 7 16:52:44 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 16:52:44 zoot slapd[2179]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12) Jul 7 16:52:44 zoot slapd[2179]: null_callback : error code 0x50 Jul 7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f74581dae70 20140707205225.029175Z#000000#001#000000 Jul 7 16:52:44 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80) Jul 7 16:52:44 zoot slapd[2179]: slap_queue_csn: queing 0x7f7458be8470 20140707205225.029175Z#000000#001#000000 Jul 7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f74581ad2e0 20140707205225.029175Z#000000#001#000000
Then we get (possibly unrelated) a lengthy list of nonpresent_callbacks (tons more then I list here): Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 5111f346-871c-1033-8d4b-f5dbd10b8ca0, dn uid=rvenkatachalam,ou=Users,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 1e3d94e4-8ce8-1033-8d6a-f5dbd10b8ca0, dn cn=BIOL399-09-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 45843aa6-91b8-1033-81a9-ff6ec337b146, dn cn=MSPC3090-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 4dd09f16-921b-1033-8d8e-f5dbd10b8ca0, dn cn=msit3710-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 0b3e8d76-9607-1033-81cf-ff6ec337b146, dn cn=acct5411-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID a698a278-9677-1033-81ee-ff6ec337b146, dn cn=ESL0051-E1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu
Followed by another lock table / del_nonpresent: Jul 7 17:06:24 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 17:06:24 zoot slapd[2179]: => bdb_idl_delete_key: c_del id failed: Cannot allocate memory (12) Jul 7 17:06:24 zoot slapd[2179]: null_callback : error code 0x50 Jul 7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f73e016cb40 20140707210614.062782Z#000000#001#000000 Jul 7 17:06:24 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80) Jul 7 17:06:24 zoot slapd[2179]: slap_queue_csn: queing 0x7f73e0f0eaa0 20140707210614.062782Z#000000#001#000000 Jul 7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f73e0f180c0 20140707210614.062782Z#000000#001#000000
That repeats itself @ 18:07, 18:47, 18:55. Then the train departs for crazy town... notice this is the other node:
ul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 0e6fea9e-9279-1033-8d97-f5dbd10b8ca0, dn cn=MSPC3070-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID f710e364-958c-1033-8da2-f5dbd10b8ca0, dn cn=MKT5900-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 815ece60-965e-1033-8dab-f5dbd10b8ca0, dn cn=ESL0067-M1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID f9d8a408-96cf-1033-8db0-f5dbd10b8ca0, dn cn=eng2950-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 480d8cbc-96d3-1033-8dbf-f5dbd10b8ca0, dn cn=esl0011-e1-f14,ou=DL,ou=Groups,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu Jul 7 18:55:59 animal slapd[10389]: slap_graduate_commit_csn: removing 0x7f59c8eddc00 20140707225554.051198Z#000000#002#000000 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=MPA3950-03-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=MSIT3950-0R-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=zbamberger,ou=Users,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=hmunawar,ou=Users,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=scrn2940-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=eng2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=id2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=msit3950-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0) Jul 7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002 Jul 7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=mspc3060-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
That goes on, deleting things, until my pager exploded when some critical role accounts got nuked and our radius servers died. I pulled an LDIF when I realized what was happening and between 18:55 and 19:35, our LDIF went from 75MB to 45MB - so about 30MB of stuff disappeared.
What happened? Did I get hit by an oddball BDB/Syncrepl behavior? Do I have a bad misconfiguration in my replication setup?
Thanks for your time,
Aaron ---- Manager of Systems Administration Clark University ITS
On Tue, 8 Jul 2014, Aaron Bennett wrote:
Jul 7 04:19:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries Jul 7 04:19:30 zoot slapd[15011]: => bdb_idl_insert_key: c_get failed: Cannot allocate memory (12)
[...]
That goes on, deleting things, until my pager exploded when some critical role accounts got nuked and our radius servers died. I pulled an LDIF when I realized what was happening and between 18:55 and 19:35, our LDIF went from 75MB to 45MB ? so about 30MB of stuff disappeared.
What happened? Did I get hit by an oddball BDB/Syncrepl behavior? Do I have a bad misconfiguration in my replication setup?
Usually what the syslogs say is right ... underconfigured re: number of locks. Check your DB_CONFIG, per http://www.openldap.org/faq/data/cache/893.html and the associated links. I suppose there's a potential argument to be made for "I wish it died fast instead of continued into a weird syncrepl state."
Other possibilities include a BDB bug, or perhaps you really were out of memory (but I'd also expect core portions of slapd and possibly other processes to be complaining in an OOM situation, which you didn't indicate in your report).
#include obligatory lmdb plug here (no need to play around with this stuff). This may help you justify the transition to management, if needed!
Aaron Bennett wrote:
Hi,
I’ve got a two-node MMR setup using syncrepl running 2.4.36 with back-bdb.
Last night, between 1/3 and 1/2 of the entries in our directory disappeared. Fortunately I have olcLogLeve=sync set, so I think I have a line on what happened…
I see this at 4AM long of the morning before the fun started:
Jul 7 04:19:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul 7 04:19:30 zoot slapd[15011]: => bdb_idl_insert_key: c_get failed: Cannot allocate memory (12)
What happened? Did I get hit by an oddball BDB/Syncrepl behavior? Do I have a bad misconfiguration in my replication setup?
You have a bad misconfig in your BDB, you need to configure a larger lock table.
and the syncrepl code probably needs to be fixed to quit if the backend returns an LDAP_OTHER error code like this. Please submit an ITS for this.
Thanks for your time,
Aaron
Manager of Systems Administration
Clark University ITS
-----Original Message----- From: Howard Chu [mailto:hyc@symas.com] Sent: Tuesday, July 8, 2014 12:08 PM To: Aaron Bennett; openldap-technical@openldap.org Subject: Re: syncrepl_del_nonpresent leading to directory implosion
You have a bad misconfig in your BDB, you need to configure a larger lock table.
and the syncrepl code probably needs to be fixed to quit if the backend returns an LDAP_OTHER error code like this. Please submit an ITS for this.
Thank you -- I will submit an ITS once the dust has cleared.
I adjusted our lock table significantly upwards. However, since then, I'm seeing extreme replication delays in just one direction. I'm not sure if there's an issue with BDB or if I did something else wrong. Here's the chain of events:
Two boxes are called animal and zoot.
1. DB_CONFIG looks like this on both boxes: set_cachesize 0 524288000 1 set_lg_regionmax 262144 set_lg_bsize 2097152 set_flags DB_LOG_AUTOREMOVE set_lk_max_lockers 9000 set_lk_max_locks 9000 set_lk_max_objects 9000 set_lk_partitions 900
2. To make sure I got a clean database, during a maintenance window I stopped slapd on both machines, deleted /var/lib/ldap/* , and reloaded an ldif on animal with this command: sudo -u ldap /usr/local/sbin/slapadd -S 1 -w -l /tmp/animal-now.ldif
3. Then I started slapd and watched zoot had pulled the replica over. Everything seemed kosher -- replication from animal -> zoot is nearly instant as expected, but zoot -> animal is very slow -- like, 15-30 second and often longer then 60, for one very trivial change.
Any ideas what's wrong? If you'd like I can send syncrepl logs or db_stat output.
Thanks for your time,
Aaron Bennett
---- Manager of Systems Administration Clark University ITS
openldap-technical@openldap.org