syncrepl_del_nonpresent leading to directory implosion - openldap-technical

8 Jul 2014


      Hi,
I've got a two-node MMR setup using syncrepl running 2.4.36 with back-bdb.
Last night, between 1/3 and 1/2 of the entries in our directory disappeared.  Fortunately I have olcLogLeve=sync set, so I think I have a line on what happened...
I see this at 4AM long of the morning before the fun started:
Jul  7 04:19:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul  7 04:19:30 zoot slapd[15011]: => bdb_idl_insert_key: c_get failed: Cannot allocate memory (12)
Then many hours later:
Jul  7 15:04:30 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul  7 15:04:30 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12)
Jul  7 15:04:30 zoot slapd[15011]: null_callback : error code 0x50
Jul  7 15:04:30 zoot slapd[15011]: slap_graduate_commit_csn: removing 0x7fa33c248a80 20140707190424.188580Z#000000#001#000000
Jul  7 15:04:30 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
This repeats itself several times:
Jul  7 16:41:09 zoot slapd[15011]: syncprov_matchops: skipping original sid 001
Jul  7 16:41:09 zoot slapd[15011]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul  7 16:41:09 zoot slapd[15011]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12)
Jul  7 16:41:09 zoot slapd[15011]: null_callback : error code 0x50
Jul  7 16:41:09 zoot slapd[15011]: slap_graduate_commit_csn: removing 0x7fa37e4b4260 20140707204059.442906Z#000000#001#000000
Jul  7 16:41:09 zoot slapd[15011]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul  7 16:52:44 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul  7 16:52:44 zoot slapd[2179]: => bdb_idl_delete_key: c_get failed: Cannot allocate memory (12)
Jul  7 16:52:44 zoot slapd[2179]: null_callback : error code 0x50
Jul  7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f74581dae70 20140707205225.029175Z#000000#001#000000
Jul  7 16:52:44 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul  7 16:52:44 zoot slapd[2179]: slap_queue_csn: queing 0x7f7458be8470 20140707205225.029175Z#000000#001#000000
Jul  7 16:52:44 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f74581ad2e0 20140707205225.029175Z#000000#001#000000
Then we get (possibly unrelated) a lengthy list of nonpresent_callbacks (tons more then I list here):
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 5111f346-871c-1033-8d4b-f5dbd10b8ca0, dn uid=rvenkatachalam,ou=Users,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 1e3d94e4-8ce8-1033-8d6a-f5dbd10b8ca0, dn cn=BIOL399-09-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 45843aa6-91b8-1033-81a9-ff6ec337b146, dn cn=MSPC3090-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 4dd09f16-921b-1033-8d8e-f5dbd10b8ca0, dn cn=msit3710-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID 0b3e8d76-9607-1033-81cf-ff6ec337b146, dn cn=acct5411-01-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID a698a278-9677-1033-81ee-ff6ec337b146, dn cn=ESL0051-E1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 17:06:24 zoot slapd[2179]: nonpresent_callback: rid=001 present UUID ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu
Followed by another lock table / del_nonpresent:
Jul  7 17:06:24 zoot slapd[2179]: bdb(dc=clarku,dc=edu): Lock table is out of available lock entries
Jul  7 17:06:24 zoot slapd[2179]: => bdb_idl_delete_key: c_del id failed: Cannot allocate memory (12)
Jul  7 17:06:24 zoot slapd[2179]: null_callback : error code 0x50
Jul  7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f73e016cb40 20140707210614.062782Z#000000#001#000000
Jul  7 17:06:24 zoot slapd[2179]: syncrepl_del_nonpresent: rid=001 be_delete cn=parents_ug_students,ou=Other,ou=Groups,dc=clarku,dc=edu (80)
Jul  7 17:06:24 zoot slapd[2179]: slap_queue_csn: queing 0x7f73e0f0eaa0 20140707210614.062782Z#000000#001#000000
Jul  7 17:06:24 zoot slapd[2179]: slap_graduate_commit_csn: removing 0x7f73e0f180c0 20140707210614.062782Z#000000#001#000000
That repeats itself @ 18:07, 18:47, 18:55.  Then the train departs for crazy town... notice this is the other node:
ul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 0e6fea9e-9279-1033-8d97-f5dbd10b8ca0, dn cn=MSPC3070-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID f710e364-958c-1033-8da2-f5dbd10b8ca0, dn cn=MKT5900-01-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 815ece60-965e-1033-8dab-f5dbd10b8ca0, dn cn=ESL0067-M1-F14,ou=Courses,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID f9d8a408-96cf-1033-8db0-f5dbd10b8ca0, dn cn=eng2950-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID 480d8cbc-96d3-1033-8dbf-f5dbd10b8ca0, dn cn=esl0011-e1-f14,ou=DL,ou=Groups,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: nonpresent_callback: rid=002 nonpresent UUID ccf94c1c-979c-1033-8dc9-f5dbd10b8ca0, dn uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu
Jul  7 18:55:59 animal slapd[10389]: slap_graduate_commit_csn: removing 0x7f59c8eddc00 20140707225554.051198Z#000000#002#000000
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=MPA3950-03-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=MSIT3950-0R-SU14,ou=Courses,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=zbamberger,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=dgunarajasingam,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete uid=hmunawar,ou=Users,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=scrn2940-01-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=eng2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=id2940-02-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=msit3950-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
Jul  7 18:55:59 animal slapd[10389]: syncprov_matchops: skipping original sid 002
Jul  7 18:55:59 animal slapd[10389]: syncrepl_del_nonpresent: rid=002 be_delete cn=mspc3060-0p-su14,ou=DL,ou=Groups,dc=clarku,dc=edu (0)
That goes on, deleting things, until my pager exploded when some critical role accounts got nuked and our radius servers died.  I pulled an LDIF when I realized what was happening and between 18:55 and 19:35, our LDIF went from 75MB to 45MB - so about 30MB of stuff disappeared.
What happened?  Did I get hit by an oddball BDB/Syncrepl behavior?  Do I have a bad misconfiguration in my replication setup?
Thanks for your time,
Aaron
----
Manager of Systems Administration
Clark University ITS