I've been testing the password policy module lately. I updated our development LDAP infrastructure Monday, basically loading and enabling the module, adding a default policy:
dn: cn=default,ou=ppolicy,dc=csupomona,dc=edu cn: default objectClass: organizationalRole objectClass: pwdPolicy pwdAttribute: userPassword pwdLockout: TRUE pwdLockoutDuration: 1800 pwdMaxFailure: 100 pwdFailureCountInterval: 300
And a policy specific for service accounts:
dn: cn=serviceaccount,ou=ppolicy,dc=csupomona,dc=edu cn: serviceaccount objectClass: organizationalRole objectClass: pwdPolicy pwdAttribute: userPassword
After some initial testing and determining that the current implementation is not suitable for account lockouts, I haven't touched it since Monday. Yesterday, I started getting alerts that the systems were low on memory, and upon looking, found extremely bloated slapd processes. Next, I started getting alerts that the accesslog mdb was running out of space.
Upon investigation, I found that the service account that is used to gather statistics was generating a considerable amount of churn every time it authenticated (every 30 seconds). This account was explicitly configured to use the service account password policy:
dn: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu objectClass: person cn: slapd-monitor sn: LDAP monitoring account pwdPolicySubentry: cn=serviceaccount,ou=ppolicy,dc=csupomona,dc=edu
Which should be pretty much "do nothing". (I realize now that this policy will actually result in pwdFailureTime entries being created for failed authentications, and accumulating forever until a successful authentication, but there are no failed authentications for this service account, only successful ones, in this case).
There were 1662836 entries in the access log for this dn since Monday that as far as I can tell solely change the CSN:
dn: reqStart=20140502031556.000000Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140502031556.000000Z reqEnd: 20140502031558.000000Z reqType: modify reqSession: 2 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: entryCSN:= 20140501211559.348458Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140501211559Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: 240ec245-0bac-4e20-b387-19243f720674 creatorsName: cn=accesslog createTimestamp: 20140501211559Z entryCSN: 20140501211559.348458Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140501211559Z
This explains why the accesslog was running out of space, it was full of these. It doesn't explain why the slapd process exploded in memory use, unless I suppose the steady-state memory usage of a slapd this busy processing replication is higher than one that's not quite so busy.
But I'm confused as to why loading the password policy module, for an account with a policy configured to pretty much not do anything, results in a modification of the CSN for every authentication?
I'm going to go peruse the source code to see if I can determine what's going on, but any expert opinions would be most welcome.
Thanks much.
--On May 2, 2014 at 6:01:02 PM -0700 "Paul B. Henson" henson@acm.org wrote:
I've been testing the password policy module lately. I updated our development LDAP infrastructure Monday, basically loading and enabling the module, adding a default policy:
This explains why the accesslog was running out of space, it was full of these. It doesn't explain why the slapd process exploded in memory use, unless I suppose the steady-state memory usage of a slapd this busy processing replication is higher than one that's not quite so busy.
But I'm confused as to why loading the password policy module, for an account with a policy configured to pretty much not do anything, results in a modification of the CSN for every authentication?
I'm going to go peruse the source code to see if I can determine what's going on, but any expert opinions would be most welcome.
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have. I've found it is possible to send MMR into an endless loop in some cases recently. I'm still working on the reproduction case for it, but it happens 100% of the time for a client of ours eventually.
--Quanah
From: Quanah Gibson-Mount Sent: Friday, May 02, 2014 7:22 PM
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have. I've found it is possible to send MMR into an endless loop in some cases recently. I'm still working on the reproduction case for it, but it happens 100% of the time for a client of ours eventually.
Ouch. By any chance, does your client use the password policy module? I've never had a problem like this until I started testing with it in my development instance.
I've enabled the additional logging level and reset my testing to the beginning, and will see if it does it again. When it happens to your client in production, how do you resolve it?
Thanks.
On Fri, May 02, 2014 at 07:22:10PM -0700, Quanah Gibson-Mount wrote:
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have.
Ok, so I reloaded my dev environment with a fresh snapshot of production data, and started it back up. After a couple hours, all of the servers started producing this entry in the accesslog db:
dn: reqStart=20140506065034.000024Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506065034.000024Z reqEnd: 20140506065034.000028Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: entryCSN:= 20140506065034.590711Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506065034Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: f1ff441b-ed2a-4623-a7b6-55931f2f28de creatorsName: cn=accesslog createTimestamp: 20140506065034Z entryCSN: 20140506065034.590711Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506065034Z
The reqStart, reqEnd and entryUUID attributes changed, but the reqMod, entryCSN, and modifyTimestamp are all the same on all entries on all servers.
This continued for hours on end, until the servers ran out of memory and the slapd process was killed:
dn: reqStart=20140506102707.000012Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506102707.000012Z reqEnd: 20140506102709.000000Z reqType: modify reqSession: 2 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: entryCSN:= 20140506065034.590711Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506065034Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: 9c332ed5-f39c-40d1-a8c1-1feb936c1d81 creatorsName: cn=accesslog createTimestamp: 20140506065034Z entryCSN: 20140506065034.590711Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506065034Z
On the server with id 3, the source of this CSN, logs show:
May 5 23:50:34 pip-dev slapd[31584]: conn=8558 op=0 BIND dn="cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu" method=128 May 5 23:50:34 pip-dev slapd[31584]: slap_queue_csn: queing 0x7feb09ffbbd0 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: slap_queue_csn: queing 0x7feafc002af0 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: slap_graduate_commit_csn: removing 0x7feafc30dd20 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: slap_graduate_commit_csn: removing 0x7feafc321610 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: conn=8558 op=0 RESULT tag=97 err=49 text= May 5 23:50:34 pip-dev slapd[31584]: syncprov_sendresp: to=002, cookie=rid=003,sid=003,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: syncprov_sendresp: to=001, cookie=rid=003,sid=003,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: syncprov_sendresp: cookie=rid=000,sid=003,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: conn=1006 op=245464 ABANDON msg=245464 May 5 23:50:34 pip-dev slapd[31584]: conn=1006 op=245465 SRCH base="dc=csupomona,dc=edu" scope=2 deref=0 filter="(objectClass=*)" May 5 23:50:34 pip-dev slapd[31584]: conn=1006 op=245465 SRCH attr=* + May 5 23:50:34 pip-dev slapd[31584]: srs csn 20140506065033.649766Z#000000#000#000000 May 5 23:50:34 pip-dev slapd[31584]: log csn 20140506044033.088487Z#000000#000#000000 May 5 23:50:34 pip-dev slapd[31584]: cmp -512, too old May 5 23:50:34 pip-dev slapd[31584]: log csn 20140506044532.528640Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: cmp -512, too old May 5 23:50:34 pip-dev slapd[31584]: log csn 20140506044532.602250Z#000000#001#000000 [...] May 5 23:50:34 pip-dev slapd[31584]: log csn 20140506065034.351114Z#000000#001#000000 May 5 23:50:34 pip-dev slapd[31584]: cmp 0, too old May 5 23:50:34 pip-dev slapd[31584]: log csn 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: syncprov_search_response: cookie=rid=003,sid=003,csn=20140506065033.649766Z#000000#000#000000;20140506065034.351114Z#000000#001#000000;20140506065033.164941Z#000000#002#000000;20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: conn=1006 op=245465 SEARCH RESULT tag=101 err=0 nentries=1 text= May 5 23:50:34 pip-dev slapd[31584]: conn=8558 fd=19 closed (connection lost) May 5 23:50:34 pip-dev slapd[31584]: do_syncrep2: rid=002 cookie=rid=000,sid=001,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: do_syncrep2: rid=002 CSN too old, ignoring 20140506065034.590711Z#000000#003#000000 (reqStart=20140506065034.000024Z,cn=accesslog) [...] May 5 23:50:34 pip-dev slapd[31584]: slap_queue_csn: queing 0x7feb2011477b 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: slap_queue_csn: queing 0x7feb2010d880 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: do_syncrep2: rid=003 cookie=rid=003,sid=002,csn=20140506065034.675448Z#000000#000#000000 May 5 23:50:34 pip-dev slapd[31584]: do_syncrep2: rid=003 CSN too old, ignoring 20140506065034.675448Z#000000#000#000000 (reqStart=20140506065034.000026Z,cn=accesslog) May 5 23:50:34 pip-dev slapd[31584]: slap_graduate_commit_csn: removing 0x7feb201203b0 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 pip-dev slapd[31584]: slap_graduate_commit_csn: removing 0x7feb20126830 20140506065034.590711Z#000000#003#000000
On another server, the first instance of this CSN shows up:
May 5 23:50:34 fosse-dev slapd[7791]: do_syncrep2: rid=003 cookie=rid=000,sid=003,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: slap_queue_csn: queing 0x7fa744104f7d 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: slap_queue_csn: queing 0x7fa74413d590 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: syncprov_matchops: skipping original sid 003 May 5 23:50:34 fosse-dev slapd[7791]: slap_graduate_commit_csn: removing 0x7fa7441369d0 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: syncprov_sendresp: to=001, cookie=rid=001,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: slap_graduate_commit_csn: removing 0x7fa744136770 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: syncrepl_message_to_op: rid=003 be_modify cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu (0) May 5 23:50:34 fosse-dev slapd[7791]: slap_queue_csn: queing 0x7fa744134100 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: syncprov_sendresp: to=002, cookie=rid=001,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: conn=1021 op=2 ABANDON msg=2 May 5 23:50:34 fosse-dev slapd[7791]: conn=1021 op=3 SRCH base="dc=csupomona,dc=edu" scope=2 deref=0 filter="(objectClass=*)" May 5 23:50:34 fosse-dev slapd[7791]: conn=1021 op=3 SRCH attr=* + May 5 23:50:34 fosse-dev slapd[7791]: srs csn 20140506065033.649766Z#000000#000#000000 May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506044032.985641Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -512, too old May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506044033.088487Z#000000#000#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -512, too old May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506044532.528640Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -512, too old [...] May 5 23:50:34 fosse-dev slapd[7791]: slap_graduate_commit_csn: removing 0x7fa744136770 20140506065034.590711 Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -256, too old May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506055533.107369Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -256, too old [...] May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506062531.905240Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -3, too old May 5 23:50:34 fosse-dev slapd[7791]: do_syncrep2: rid=002 cookie=rid=002,sid=002,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506062532.031640Z#000000#001#000000 May 5 23:50:34 fosse-dev slapd[7791]: cmp -3, too old [...] May 5 23:50:34 fosse-dev slapd[7791]: log csn 20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: syncprov_search_response: cookie=rid=001,csn=20140506065033.649766Z#000000#000#000000;20140506065034.351114Z#000000#001#000000;20140506065033.164941Z#000000#002#000000;20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: conn=1021 op=3 SEARCH RESULT tag=101 err=0 nentries=1 text= May 5 23:50:34 fosse-dev slapd[7791]: do_syncrep2: rid=001 cookie=rid=001,sid=001,csn=20140506065034.590711Z#000000#003#000000 May 5 23:50:34 fosse-dev slapd[7791]: do_syncrep2: rid=001 CSN too old, ignoring 20140506065034.590711Z#000000#003#000000 (reqStart=20140506065034.000024Z,cn=accesslog)
And then they all just go on forever, queing the CSN and removing it.
Any thoughts on what's blowing up?
Thanks...
--On May 6, 2014 at 1:39:22 PM -0700 "Paul B. Henson" henson@acm.org wrote:
On Fri, May 02, 2014 at 07:22:10PM -0700, Quanah Gibson-Mount wrote:
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have.
Ok, so I reloaded my dev environment with a fresh snapshot of production data, and started it back up. After a couple hours, all of the servers started producing this entry in the accesslog db:
May 5 23:50:34 fosse-dev slapd[7791]: syncprov_search_response: cookie=rid=001,csn=20140506065033.649766Z#000000#000#000000; Any thoughts on what's blowing up?
Looks like it lost the serverID, which is exactly what I saw. Note the #000#
--Quanah
--On May 5, 2014 at 7:40:31 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Friday, May 02, 2014 7:22 PM
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have. I've found it is possible to send MMR into an endless loop in some cases recently. I'm still working on the reproduction case for it, but it happens 100% of the time for a client of ours eventually.
Ouch. By any chance, does your client use the password policy module? I've never had a problem like this until I started testing with it in my development instance.
No, we don't use ppolicy. But the issue for us is arising in a functionality that is similar to ppolicy that's Zimbra specific.
I've enabled the additional logging level and reset my testing to the beginning, and will see if it does it again. When it happens to your client in production, how do you resolve it?
You can gdb slapd, and manually fix the serverID in the syncinfo structure, or you can restart all your slapd servers.
--Quanah
--On May 6, 2014 at 9:42:13 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
--On May 6, 2014 at 1:39:22 PM -0700 "Paul B. Henson" henson@acm.org wrote:
On Fri, May 02, 2014 at 07:22:10PM -0700, Quanah Gibson-Mount wrote:
I would suggest (if you haven't) enabling sync replication logging (loglevel sync) in addition to whatever other loglevels you have.
Ok, so I reloaded my dev environment with a fresh snapshot of production data, and started it back up. After a couple hours, all of the servers started producing this entry in the accesslog db:
May 5 23:50:34 fosse-dev slapd[7791]: syncprov_search_response: cookie=rid=001,csn=20140506065033.649766Z#000000#000#000000; Any thoughts on what's blowing up?
Looks like it lost the serverID, which is exactly what I saw. Note the #000#
One other thing -- Did any of your servers go into refresh mode prior to this loop starting?
grep REFRESH in the log.
--Quanah
On Wed, May 07, 2014 at 09:42:12AM -0700, Quanah Gibson-Mount wrote:
One other thing -- Did any of your servers go into refresh mode prior to this loop starting?
Yes, it started logging:
May 5 19:00:38 filmore-dev slapd[7665]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:01:48 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH
It continued logging the lost sync messages almost continuously until:
May 5 19:06:16 filmore-dev slapd[7665]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:19 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:06:26 filmore-dev slapd[7665]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:32 filmore-dev slapd[7665]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 20:10:32 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506031032.000017Z,cn=accesslog), switching to REFRESH May 5 20:10:32 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH May 5 21:31:26 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:10:36 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506051035.000019Z,cn=accesslog), switching to REFRESH May 5 22:10:36 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:45:33 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506054533.000016Z,cn=accesslog), switching to REFRESH May 5 22:45:33 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 23:10:35 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506061035.000008Z,cn=accesslog), switching to REFRESH May 5 23:11:27 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 6 02:15:05 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506075534.000068Z,cn=accesslog), switching to REFRESH May 6 02:28:40 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
The others had similar logs:
May 5 19:00:40 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:00:40 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH [...] May 5 19:10:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 delta-sync lost sync, switching to REFRESH May 5 19:10:36 fosse-dev slapd[7791]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506022033.000008Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 delta-sync lost sync on (reqStart=20140506022033.000009Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync on (reqStart=20140506022033.000009Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:21:26 fosse-dev slapd[7791]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 fosse-dev slapd[7791]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH
May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:02:21 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH [...] May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 20:10:32 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506031032.000017Z,cn=accesslog), switching to REFRESH May 5 20:10:32 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH May 5 21:30:34 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:45:33 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506054533.000016Z,cn=accesslog), switching to REFRESH
I don't know if it's related, but right before the loop it was replicating pwdFailureTime time changes around:
dn: reqStart=20140506064532.000030Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000030Z reqEnd: 20140506064532.000033Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:+ 20140506064532Z reqMod: entryCSN:= 20140506064532.612089Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: a1e99b9b-64b8-4e79-9451-48cac10ff333 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.612089Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
dn: reqStart=20140506064532.000034Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000034Z reqEnd: 20140506064532.000035Z reqType: modify reqSession: 5048 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:- reqMod: entryCSN:= 20140506064532.680931Z#000000#002#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: 557c4163-5170-4dc1-84c4-0159797d5538 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.680931Z#000000#002#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
dn: reqStart=20140506064532.000040Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000040Z reqEnd: 20140506064532.000043Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:+ 20140506064532Z reqMod: entryCSN:= 20140506064532.861668Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: c4ba3c02-b67a-4522-8767-8da1fa084916 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.861668Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
I've filed an ITS on the issue and will see if I can replicate it in our lab. This looks exactly like what I am seeing as well. Howard may be able to provide some gdb actions he would like to see as well.
--Quanah
On May 7, 2014, at 2:57 PM, "Paul B. Henson" henson@acm.org wrote:
On Wed, May 07, 2014 at 09:42:12AM -0700, Quanah Gibson-Mount wrote:
One other thing -- Did any of your servers go into refresh mode prior to this loop starting?
Yes, it started logging:
May 5 19:00:38 filmore-dev slapd[7665]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:01:48 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH
It continued logging the lost sync messages almost continuously until:
May 5 19:06:16 filmore-dev slapd[7665]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:19 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:06:26 filmore-dev slapd[7665]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:32 filmore-dev slapd[7665]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 20:10:32 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506031032.000017Z,cn=accesslog), switching to REFRESH May 5 20:10:32 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH May 5 21:31:26 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:10:36 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506051035.000019Z,cn=accesslog), switching to REFRESH May 5 22:10:36 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:45:33 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506054533.000016Z,cn=accesslog), switching to REFRESH May 5 22:45:33 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 23:10:35 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506061035.000008Z,cn=accesslog), switching to REFRESH May 5 23:11:27 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 6 02:15:05 filmore-dev slapd[7665]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506075534.000068Z,cn=accesslog), switching to REFRESH May 6 02:28:40 filmore-dev slapd[7665]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE
The others had similar logs:
May 5 19:00:40 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:00:40 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH [...] May 5 19:10:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 delta-sync lost sync, switching to REFRESH May 5 19:10:36 fosse-dev slapd[7791]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506022033.000008Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 delta-sync lost sync on (reqStart=20140506022033.000009Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=002 delta-sync lost sync on (reqStart=20140506022033.000009Z,cn=accesslog), switching to REFRESH May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:20:33 fosse-dev slapd[7791]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:21:26 fosse-dev slapd[7791]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 fosse-dev slapd[7791]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH
May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync, switching to REFRESH May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:02:17 shelley-dev slapd[8178]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:02:21 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH [...] May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=002 delta-sync lost sync, switching to REFRESH May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=002 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 19:06:19 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 20:10:32 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506031032.000017Z,cn=accesslog), switching to REFRESH May 5 20:10:32 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 21:30:34 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506043034.000008Z,cn=accesslog), switching to REFRESH May 5 21:30:34 shelley-dev slapd[8178]: do_syncrep2: rid=003 LDAP_RES_INTERMEDIATE - REFRESH_DELETE May 5 22:45:33 shelley-dev slapd[8178]: do_syncrep2: rid=003 delta-sync lost sync on (reqStart=20140506054533.000016Z,cn=accesslog), switching to REFRESH
I don't know if it's related, but right before the loop it was replicating pwdFailureTime time changes around:
dn: reqStart=20140506064532.000030Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000030Z reqEnd: 20140506064532.000033Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:+ 20140506064532Z reqMod: entryCSN:= 20140506064532.612089Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: a1e99b9b-64b8-4e79-9451-48cac10ff333 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.612089Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
dn: reqStart=20140506064532.000034Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000034Z reqEnd: 20140506064532.000035Z reqType: modify reqSession: 5048 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:- reqMod: entryCSN:= 20140506064532.680931Z#000000#002#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: 557c4163-5170-4dc1-84c4-0159797d5538 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.680931Z#000000#002#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
dn: reqStart=20140506064532.000040Z,cn=accesslog objectClass: auditModify structuralObjectClass: auditModify reqStart: 20140506064532.000040Z reqEnd: 20140506064532.000043Z reqType: modify reqSession: 3 reqAuthzID: cn=ldaproot,dc=csupomona,dc=edu reqDN: cn=slapd-monitor,ou=user,ou=service,dc=csupomona,dc=edu reqResult: 0 reqMod: pwdFailureTime:+ 20140506064532Z reqMod: entryCSN:= 20140506064532.861668Z#000000#003#000000 reqMod: modifiersName:= cn=ldaproot,dc=csupomona,dc=edu reqMod: modifyTimestamp:= 20140506064532Z reqEntryUUID: c16729b0-ce1e-4f9d-9123-e0b8f115133c entryUUID: c4ba3c02-b67a-4522-8767-8da1fa084916 creatorsName: cn=accesslog createTimestamp: 20140506064532Z entryCSN: 20140506064532.861668Z#000000#003#000000 modifiersName: cn=accesslog modifyTimestamp: 20140506064532Z
From: Quanah Gibson-Mount Sent: Tuesday, May 06, 2014 9:43 PM
beginning, and will see if it does it again. When it happens to your client in production, how do you resolve it?
You can gdb slapd, and manually fix the serverID in the syncinfo
structure,
or you can restart all your slapd servers.
How are you detecting when it starts? On my dev system, the first symptom is massive memory use by the slapd process, followed by an alert that the accesslog db is over 80% full. Then slapd processes start getting killed off by the OOM mechanism and my dev environment basically implodes. If it happened in production, odds are I wouldn't catch it in time to keep things from going south. How do you trim out the extremely large number of duplicate entries in the accesslog when you are cleaning up after an occurrence in one of your production environments?
Hopefully the underlying issue will be sorted out soon. I'm just going to tell our security guys they are not going to get their account lockouts as long as the password policy module puts my dev environment into conniptions :).
Off-topic, is there a secret handshake for the developer list or something ;)? The couple of messages I posted after I subscribed still haven't shown up, and a couple of inquiries I made on this list about the moderation on that list were never responded to <sigh>...
Thanks.
From: Quanah Gibson-Mount Sent: Wednesday, May 07, 2014 5:58 PM
I've filed an ITS on the issue and will see if I can replicate it in our
lab. This
looks exactly like what I am seeing as well. Howard may be able to provide some gdb actions he would like to see as well.
Cool, I see it is 7849. There's no way with the ITS system to subscribe or something to track bug updates or changes like with other bug trackers?
I don't know that I could reproduce it on demand, but so far it seems every time I reload my dev environment with a snapshot of production data and turn on the password policy module, within a day it starts spinning in that replication loop. So if there's anything I can do to help track it down let me know.
Thanks.
--On May 8, 2014 at 3:10:45 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Wednesday, May 07, 2014 5:58 PM
I've filed an ITS on the issue and will see if I can replicate it in our
lab. This
looks exactly like what I am seeing as well. Howard may be able to provide some gdb actions he would like to see as well.
Cool, I see it is 7849. There's no way with the ITS system to subscribe or something to track bug updates or changes like with other bug trackers?
Not yet. I'm working on moving OpenLDAP to using Bugzilla, but there are some other pieces in front of that.
I don't know that I could reproduce it on demand, but so far it seems every time I reload my dev environment with a snapshot of production data and turn on the password policy module, within a day it starts spinning in that replication loop. So if there's anything I can do to help track it down let me know.
Ok. What is triggering it to do updates to the password policy in dev?
--Quanah
--On May 8, 2014 at 3:07:23 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Tuesday, May 06, 2014 9:43 PM
beginning, and will see if it does it again. When it happens to your client in production, how do you resolve it?
You can gdb slapd, and manually fix the serverID in the syncinfo
structure,
or you can restart all your slapd servers.
How are you detecting when it starts? On my dev system, the first symptom is massive memory use by the slapd process, followed by an alert that the accesslog db is over 80% full. Then slapd processes start getting killed off by the OOM mechanism and my dev environment basically implodes. If it happened in production, odds are I wouldn't catch it in time to keep things from going south. How do you trim out the extremely large number of duplicate entries in the accesslog when you are cleaning up after an occurrence in one of your production environments?
The massive memory consumption would be due to the switch to refresh mode. On the environments I've been using, OOM is disabled (horrible concept), so there's no killing of slapd itself. As for cleaning the accesslog, I stop all servers, move it aside, and restart (it'll create a new accesslog db).
Hopefully the underlying issue will be sorted out soon. I'm just going to tell our security guys they are not going to get their account lockouts as long as the password policy module puts my dev environment into conniptions :).
Yeah, it isn't specific to ppolicy because I don't use it. I'm trying to get this happening in my dev env now.
Off-topic, is there a secret handshake for the developer list or something ;)? The couple of messages I posted after I subscribed still haven't shown up, and a couple of inquiries I made on this list about the moderation on that list were never responded to <sigh>...
Nope... I think I have moderator privs even, but I don't recall where I have to log into to do it. :/ There's a couple of people I can bug about it though.
--Quanah
Paul B. Henson wrote:
From: Quanah Gibson-Mount Sent: Wednesday, May 07, 2014 5:58 PM
I've filed an ITS on the issue and will see if I can replicate it in our
lab. This
looks exactly like what I am seeing as well. Howard may be able to provide some gdb actions he would like to see as well.
Cool, I see it is 7849. There's no way with the ITS system to subscribe or something to track bug updates or changes like with other bug trackers?
I don't know that I could reproduce it on demand, but so far it seems every time I reload my dev environment with a snapshot of production data and turn on the password policy module, within a day it starts spinning in that replication loop. So if there's anything I can do to help track it down let me know.
I'm using slapo-ppolicy and don't see this issue. My deployment uses normal syncrepl (not delta-syncrepl) though.
And I'm not using account lockout. However slapo-ppolicy sets pwdFailureTime in my deployment. All replicas (even read-only consumers) have slapo-accesslog attached to record failed logins etc.
Ciao, Michael.
--On May 8, 2014 at 3:43:06 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
You can gdb slapd, and manually fix the serverID in the syncinfo
structure,
or you can restart all your slapd servers.
How are you detecting when it starts? On my dev system, the first symptom is massive memory use by the slapd process, followed by an alert that the accesslog db is over 80% full. Then slapd processes start getting killed off by the OOM mechanism and my dev environment basically implodes. If it happened in production, odds are I wouldn't catch it in time to keep things from going south. How do you trim out the extremely large number of duplicate entries in the accesslog when you are cleaning up after an occurrence in one of your production environments?
The massive memory consumption would be due to the switch to refresh mode. On the environments I've been using, OOM is disabled (horrible concept), so there's no killing of slapd itself. As for cleaning the accesslog, I stop all servers, move it aside, and restart (it'll create a new accesslog db).
Hopefully the underlying issue will be sorted out soon. I'm just going to tell our security guys they are not going to get their account lockouts as long as the password policy module puts my dev environment into conniptions :).
Yeah, it isn't specific to ppolicy because I don't use it. I'm trying to get this happening in my dev env now.
Hi Paul,
I haven't had any luck in reproducing it in my lab. I'd be curious to know if you could share your cn=config setup (minus rootdn passwords), and describe how you are triggering the ppolicy updates in the lab. I need to find a way to reliably trigger the problem so it can be debugged under GDB. If you're up to gdb debugging, then the first step is to gdb slapd, and set a watchpoint on the serverID, so you can see at which point it gets set to '0' instead of the the correct sid value.
Thanks, Quanah
From: Quanah Gibson-Mount Sent: Thursday, May 08, 2014 3:43 PM
Nope... I think I have moderator privs even, but I don't recall where I have to log into to do it. :/ There's a couple of people I can bug about it though.
The list administrative interface is at:
http://www.openldap.org/lists/mm/admin/openldap-devel
But you probably also don't recall the password to get into it ;).
--On May 13, 2014 at 1:51:30 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Thursday, May 08, 2014 3:43 PM
Nope... I think I have moderator privs even, but I don't recall where I have to log into to do it. :/ There's a couple of people I can bug about it though.
The list administrative interface is at:
http://www.openldap.org/lists/mm/admin/openldap-devel
But you probably also don't recall the password to get into it ;).
Very true. ;)
--Quanah
From: Quanah Gibson-Mount Sent: Monday, May 12, 2014 7:00 PM
I haven't had any luck in reproducing it in my lab. I'd be curious to
know
if you could share your cn=config setup (minus rootdn passwords), and describe how you are triggering the ppolicy updates in the lab.
I'll send you my config off list. The first time this happened, I had enabled the password policy module and configured two policies, one as default, and one explicitly configured on about a dozen service accounts. I played with that for about an hour until I realized the authentication failure granularity was insufficient to meet our account lockout needs. Then it just sat basically idle, and maybe 6-8 hours later it started ramping up memory use and blew up. The second time, I reloaded our dev environment with a snapshot of production data, and started it up with the password policy module loaded, but no actual password policies defined. Once again, within a few hours it started spinning. I'll see if I can get some minimal configuration that reliably reproduces this failure, I don't think our ISO would be on board with shipping you a copy of our production data :).
If you're up to gdb debugging, then the first step is to gdb slapd, and
set
a watchpoint on the serverID, so you can see at which point it gets set to '0' instead of the the correct sid value.
My current binaries don't have debugging symbols, but I will build a binary with debugging enabled and give it a try if I get the time. So you mean the slap_serverID variable defined in servers/slapd/ctxcsn.c?
--On May 13, 2014 at 2:08:30 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Monday, May 12, 2014 7:00 PM
I haven't had any luck in reproducing it in my lab. I'd be curious to
know
if you could share your cn=config setup (minus rootdn passwords), and describe how you are triggering the ppolicy updates in the lab.
I'll send you my config off list. The first time this happened, I had enabled the password policy module and configured two policies, one as default, and one explicitly configured on about a dozen service accounts. I played with that for about an hour until I realized the authentication failure granularity was insufficient to meet our account lockout needs. Then it just sat basically idle, and maybe 6-8 hours later it started ramping up memory use and blew up. The second time, I reloaded our dev environment with a snapshot of production data, and started it up with the password policy module loaded, but no actual password policies defined. Once again, within a few hours it started spinning. I'll see if I can get some minimal configuration that reliably reproduces this failure, I don't think our ISO would be on board with shipping you a copy of our production data :).
That's interesting... it was totally idle (doing nothing at all?).
If you're up to gdb debugging, then the first step is to gdb slapd, and
set
a watchpoint on the serverID, so you can see at which point it gets set to '0' instead of the the correct sid value.
My current binaries don't have debugging symbols, but I will build a binary with debugging enabled and give it a try if I get the time. So you mean the slap_serverID variable defined in servers/slapd/ctxcsn.c?
No, s_sid in the syncops struct in syncprov.c. That is what was flipping from [1|2|3] -> 0 on my systems.
--Quanah
--On May 13, 2014 at 2:21:03 PM -0700 Quanah Gibson-Mount quanah@zimbra.com wrote:
My current binaries don't have debugging symbols, but I will build a binary with debugging enabled and give it a try if I get the time. So you mean the slap_serverID variable defined in servers/slapd/ctxcsn.c?
No, s_sid in the syncops struct in syncprov.c. That is what was flipping from [1|2|3] -> 0 on my systems.
For example, here's what I saw on a problematic master:
(gdb) print *so $1 = {s_next = 0x0, s_base = {bv_len = 12, bv_val = 0xa857d70 "cn=accesslog"}, s_eid = 1, s_op = 0xc13a300, s_rid = 0, s_sid = 0, s_filterstr = {bv_len = 46, bv_val = 0x42ca1f8 ""}, s_flags = 34, s_inuse = 2, s_res = 0x0, s_restail = 0x0, s_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>,
We can see here that s_sid = 0 instead of 2 (The server ID of this master)
The other issue here is that it has also lots the rid (s_rid=0). We use rids of 100 and up in my configurations. So you can watch on both of those, and see what triggers first.
--Quanah
From: Quanah Gibson-Mount Sent: Tuesday, May 13, 2014 2:21 PM
That's interesting... it was totally idle (doing nothing at all?).
Yes, the absolute only activity after slappadd'ing the data and starting the server were the automated accesses to the monitoring backend by a service account.
No, s_sid in the syncops struct in syncprov.c. That is what was flipping from [1|2|3] -> 0 on my systems.
Ah, ok.
--On May 13, 2014 at 5:46:38 PM -0700 "Paul B. Henson" henson@acm.org wrote:
From: Quanah Gibson-Mount Sent: Tuesday, May 13, 2014 2:21 PM
That's interesting... it was totally idle (doing nothing at all?).
Yes, the absolute only activity after slappadd'ing the data and starting the server were the automated accesses to the monitoring backend by a service account.
Ok. Hm, I was wondering if it was related to the accesslog purge, but your purge only happens once a day for things > 1 week old, so that wouldn't have hit in such a short term.
--Quanah
openldap-technical@openldap.org