OpenLDAP syncrepl woes

List overview All Threads
Download

newer

older

Re: OpenLDAP syncrepl woes

slapo-rwm and NAME aliases

Jeffrey Crawford

15 Nov 2011 15 Nov '11

9 p.m.

I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: => bdb_idl_delete_key: cursor failed: Invalid argument (22)

and the various replicas would still have accounts left over but they wouldn't match each other.

Granted the above issues might be explained away in that we don't yet have enough ram on the machines yet, however it does seem to present us with a problem when we notice the discrepancy, how do we during run time re-sync the data from the provider server? I have tried the slapd -c rid=2,csn=20111114000000.000000Z but that doesn't seem to do any good. (I've tried several different values of csn=0 csn=20111114000000.000000Z#000000#000#000000 etc. to no avail)

I guess my question is two fold, how do I really verify replication is working properly and is in sync, and how to I force a replica to just take the current content from a provider without question. (I don't really want to remove the database and have it re-sync, rather have it go through and check the content and update as needed).

Thanks Jeffrey Crawford

Show replies by date

Howard Chu

16 Nov 16 Nov

12:09 a.m.

Jeffrey Crawford wrote:

...

I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...

During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

...

Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: => bdb_idl_delete_key: cursor failed: Invalid argument (22)

and the various replicas would still have accounts left over but they wouldn't match each other.

There are known bugs in syncrepl delete handling. ITS#7052 is probably relevant here. The fix will be in 2.4.27.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jeffrey Crawford

7:40 a.m.

On Wed, Nov 16, 2011 at 12:09 AM, Howard Chu hyc@symas.com wrote:

...

Jeffrey Crawford wrote:

...
I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...
During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

...

...
Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: => bdb_idl_delete_key: cursor failed: Invalid argument (22)

and the various replicas would still have accounts left over but they wouldn't match each other.

There are known bugs in syncrepl delete handling. ITS#7052 is probably relevant here. The fix will be in 2.4.27.

Any idea when it will be released?

...

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jeffrey Crawford

10:23 a.m.

On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawford jeffreyc@ucsc.edu wrote:

...

On Wed, Nov 16, 2011 at 12:09 AM, Howard Chu hyc@symas.com wrote:

...
Jeffrey Crawford wrote:

...
I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...
During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

Oh and we are using bdb 4.6 right now (forgot to answer that)

...

...
...
Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: => bdb_idl_delete_key: cursor failed: Invalid argument (22)

and the various replicas would still have accounts left over but they wouldn't match each other.

There are known bugs in syncrepl delete handling. ITS#7052 is probably relevant here. The fix will be in 2.4.27.

Any idea when it will be released?

...
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

1:27 p.m.

Jeffrey Crawford wrote:

...

On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawfordjeffreyc@ucsc.edu wrote:

...
On Wed, Nov 16, 2011 at 12:09 AM, Howard Chuhyc@symas.com wrote:

...
Jeffrey Crawford wrote:

...
I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...
During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

Oh and we are using bdb 4.6 right now (forgot to answer that)

Running out of memory would cause an obvious error message ("no memory") so that's not likely to be the problem here. Might be worth upgrading to at least BDB 4.8, but again, never having seen BDB spit out that error before, that's just a guess.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jeffrey Crawford

3:49 p.m.

On Wed, Nov 16, 2011 at 1:27 PM, Howard Chu hyc@symas.com wrote:

...

Jeffrey Crawford wrote:

...
On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawfordjeffreyc@ucsc.edu wrote:

...
On Wed, Nov 16, 2011 at 12:09 AM, Howard Chuhyc@symas.com wrote:

...
Jeffrey Crawford wrote:

...
I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...
During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

Oh and we are using bdb 4.6 right now (forgot to answer that)

Running out of memory would cause an obvious error message ("no memory") so that's not likely to be the problem here. Might be worth upgrading to at least BDB 4.8, but again, never having seen BDB spit out that error before, that's just a guess.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

We'll now I'm starting to get worried. I noticed lots of swapping during the time and just assumed that the queue was just getting backed up which was causing the deadlocks. I'll have the vm get more memory and see what behavior we get and report back here in parallel to whatever you decide to do.

Jeffrey

Quanah Gibson-Mount

21 Nov 21 Nov

10:05 a.m.

--On Wednesday, November 16, 2011 3:49 PM -0800 Jeffrey Crawford jeffreyc@ucsc.edu wrote:

...

...
...
Oh and we are using bdb 4.6 right now (forgot to answer that)

With all the patches? Oracle lists 4.

4.6.21 Requires log file format upgrade. change log - patches ( 4)

Quanah Gibson-Mount Sr. Member of Technical Staff Zimbra, Inc A Division of VMware, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Jeffrey Crawford

17 Nov 17 Nov

5:14 p.m.

On Wed, Nov 16, 2011 at 1:27 PM, Howard Chu hyc@symas.com wrote:

...

Jeffrey Crawford wrote:

...
On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawfordjeffreyc@ucsc.edu wrote:

...
On Wed, Nov 16, 2011 at 12:09 AM, Howard Chuhyc@symas.com wrote:

...
Jeffrey Crawford wrote:

...
I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

During the mass delete we got the following sprinkled throughout the

...
logs on all machines:

Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

Oh and we are using bdb 4.6 right now (forgot to answer that)

Running out of memory would cause an obvious error message ("no memory") so that's not likely to be the problem here. Might be worth upgrading to at least BDB 4.8, but again, never having seen BDB spit out that error before, that's just a guess.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/**project/http://www.openldap.org/project/

Not sure if this is significant but I'm been noticing that this error only shows up on deletes. However it also shows up on deletes on the machine I'm running the ldapdelete against. So perhaps this is more of a software issue. I'll go ahead and run this with more ram and I'll check with the sysadmin if they can compile it against bdb 4.8 and see if that changes anything. But I don't think ITS#7052 applies here because the machine I'm doing this against does not use syncrepl, its the provider to others.

This is a machine on a VM. Are there any known issues with that?

Jeffrey

Howard Chu

5:50 p.m.

Jeffrey Crawford wrote:

...

On Wed, Nov 16, 2011 at 1:27 PM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote: Jeffrey Crawford wrote: On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawford<jeffreyc@ucsc.edu mailto:jeffreyc@ucsc.edu> wrote: On Wed, Nov 16, 2011 at 12:09 AM, Howard Chu<hyc@symas.com mailto:hyc@symas.com> wrote: Jeffrey Crawford wrote: I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...

                During the mass delete we got the following sprinkled
                throughout the
                logs on all machines:
                ====
                Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]:
                bdb(dc=domain,dc=name):
                previous transaction deadlock return not resolved


            Wow. I've never seen this error message before. What version
            of OpenLDAP and
            BerkeleyDB are you using?


        FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before,
        right now I think we are squeezing ram right now Part of this
        deployment was to discover how much ram we needed on the virtual
        machine and it was started pretty low.


    Oh and we are using bdb 4.6 right now (forgot to answer that)


Running out of memory would cause an obvious error message ("no memory")
so that's not likely to be the problem here. Might be worth upgrading to
at least BDB 4.8, but again, never having seen BDB spit out that error
before, that's just a guess.

...

Not sure if this is significant but I'm been noticing that this error only shows up on deletes. However it also shows up on deletes on the machine I'm running the ldapdelete against. So perhaps this is more of a software issue. I'll go ahead and run this with more ram and I'll check with the sysadmin if they can compile it against bdb 4.8 and see if that changes anything. But I don't think ITS#7052 applies here because the machine I'm doing this against does not use syncrepl, its the provider to others.

This is a machine on a VM. Are there any known issues with that?

Way back in the dawn of time, there were some VMware implementations that didn't support mutexes correctly. I don't think that's been an issue for many years. There ought to be other error messages in your log, immediately preceding the one you quoted. Post those too.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Jeffrey Crawford

8:29 p.m.

On Thu, Nov 17, 2011 at 5:50 PM, Howard Chu hyc@symas.com wrote:

...

Jeffrey Crawford wrote:

...
On Wed, Nov 16, 2011 at 1:27 PM, Howard Chu <hyc@symas.com mailto:hyc@symas.com> wrote: Jeffrey Crawford wrote: On Wed, Nov 16, 2011 at 7:40 AM, Jeffrey Crawford<jeffreyc@ucsc.edu mailto:jeffreyc@ucsc.edu> wrote: On Wed, Nov 16, 2011 at 12:09 AM, Howard Chu<hyc@symas.com mailto:hyc@symas.com> wrote: Jeffrey Crawford wrote: I'm trying to stabilize our openldap server farm before going live and am finding that despite the contextCSN matching between providers and replicas, the actual content of the server is getting out of sync. This is most prominent when we are testing our population routine and we need to remove all accounts before starting. right now it's only about 22000 entries (It will get much larger).

...
During the mass delete we got the following sprinkled throughout the logs on all machines: ==== Nov 15 15:47:16 idm-prod-ldap-2 slapd[33070]: bdb(dc=domain,dc=name): previous transaction deadlock return not resolved

Wow. I've never seen this error message before. What version of OpenLDAP and BerkeleyDB are you using?

FreeBSD 8.2 with openldap 2.4.26, however like I mentioned before, right now I think we are squeezing ram right now Part of this deployment was to discover how much ram we needed on the virtual machine and it was started pretty low.

Oh and we are using bdb 4.6 right now (forgot to answer that)

Running out of memory would cause an obvious error message ("no memory") so that's not likely to be the problem here. Might be worth upgrading to at least BDB 4.8, but again, never having seen BDB spit out that error before, that's just a guess.

...
Not sure if this is significant but I'm been noticing that this error only shows up on deletes. However it also shows up on deletes on the machine I'm running the ldapdelete against. So perhaps this is more of a software issue. I'll go ahead and run this with more ram and I'll check with the sysadmin if they can compile it against bdb 4.8 and see if that changes anything. But I don't think ITS#7052 applies here because the machine I'm doing this against does not use syncrepl, its the provider to others.

This is a machine on a VM. Are there any known issues with that?

Way back in the dawn of time, there were some VMware implementations that didn't support mutexes correctly. I don't think that's been an issue for many years. There ought to be other error messages in your log, immediately preceding the one you quoted. Post those too.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

There really isn't much there but here is an example really not much around it: (I've modified the usernames only)

Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10706 DEL dn="uid=user1,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10706 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10707 DEL dn="uid=user2,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10707 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10708 DEL dn="uid=user3,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10708 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10709 DEL dn="uid=user4,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: bdb(dc=ucsc,dc=edu): previous transaction deadlock return not resolved Nov 17 21:11:55 localhost slapd[1912]: => bdb_idl_delete_key: cursor failed: Invalid argument (22) Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10709 RESULT tag=107 err=80 text=entry index delete failed Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10710 DEL dn="uid=user5,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10710 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10711 DEL dn="uid=user6,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:56 localhost slapd[1912]: conn=1478 op=10711 RESULT tag=107 err=0 text= Nov 17 21:11:56 localhost slapd[1912]: conn=1478 op=10712 DEL dn="uid=user7,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:56 localhost slapd[1912]: conn=1478 op=10712 RESULT tag=107 err=0 text=

Howard Chu

9:21 p.m.

Jeffrey Crawford wrote:

...

On Thu, Nov 17, 2011 at 5:50 PM, Howard Chuhyc@symas.com wrote:

...
There ought to be other error messages in your log, immediately preceding the one you quoted. Post those too.

...

There really isn't much there but here is an example really not much around it: (I've modified the usernames only)

Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10706 DEL dn="uid=user1,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10706 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10707 DEL dn="uid=user2,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10707 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10708 DEL dn="uid=user3,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10708 RESULT tag=107 err=0 text= Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10709 DEL dn="uid=user4,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: bdb(dc=ucsc,dc=edu): previous transaction deadlock return not resolved Nov 17 21:11:55 localhost slapd[1912]: => bdb_idl_delete_key: cursor failed: Invalid argument (22) Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10709 RESULT tag=107 err=80 text=entry index delete failed Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10710 DEL dn="uid=user5,ou=people,dc=ucsc,dc=edu" Nov 17 21:11:55 localhost slapd[1912]: conn=1478 op=10710 RESULT tag=107 err=0 text=

Strange. The log shows an error occurring while deleting an index. The error message indicates that there was already a deadlock before, but there's no message from the original deadlock, and the indexing code logs *every* error that occurs. Seems more likely a BDB bug.

Also your client is broken, it looks like it completely ignored the failure result from the ldapdelete operation, it just went right on to issue another request.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

16 Nov 16 Nov

1:22 p.m.

Jeffrey Crawford wrote:

...

On Wed, Nov 16, 2011 at 12:09 AM, Howard Chuhyc@symas.com wrote:

...

...
There are known bugs in syncrepl delete handling. ITS#7052 is probably relevant here. The fix will be in 2.4.27.

Any idea when it will be released?

The release branch has been ready to go for a few days. Probably sometime next week.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

4976

Age (days ago)

4981

Last active (days ago)

openldap-technical@openldap.org

11 comments

3 participants

tags (0)

participants (3)

Howard Chu
Jeffrey Crawford
Quanah Gibson-Mount