openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES Solaris 10 x86
Layout:
ldapmaster <-syncrepl-> ldapslave01/02/03/04 <-syncrepl-> data-clusters.
It has come to light that we have some sync inconsistencies. At the moment, a customer domain that shows correct entries on ldapmaster, ldapslave02, ldapslave04 (and all servers syncing from them).
But has incorrect, or rather missing, entries on ldapslave01 and ldapslave03. There are no differences between these hosts (they are in fact HDD clones) and config files are pushed from git, with only RID changed.
The logs on ldapslave03 for one of the broken entries (in this case, ou=DNS). I have loglevel=sync on all servers:
slaplog.20100407.gz:Mar 3 12:27:12 ldapslave03.unix slapd[27355]: [ID 561622 local4.debug] syncrepl_del_nonpresent: rid 329 be_delete DNSHostName=@,DNSZoneName=example.com,ou=dns,$DC (66)
slaplog.20100407.gz:Mar 3 12:27:12 ldapslave03.unix slapd[27355]: [ID 561622 local4.debug] syncrepl_del_nonpresent: rid 329 be_delete DNSZoneName=example.com,ou=dns,$DC (66)
What is "syncrepl_del_nonpresent"? Is it something I should be worried about? If I count the number of entries with said error:
ldapslave02: 42 ldapslave03: 7240
Which makes me wonder if it is a global problem for us, but is exaggerated on some servers.
I notice that the provisioning log for that customer's domain has about 24 "+dns" and "-dns" entries in a row. Not entirely sure why the customer was changing their DNS back and forth so much, but perhaps it is related.
Can it be that a "delete/create/delete" sequence of the same DN, sent to master, but which has not yet been pushed out to all slaves, may trigger this situation? Surely all replication is in strict time sequence though.
Is there anything I can do presently?
Any advise is most appreciated.
Lund
Jorgen Lundman wrote:
openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES Solaris 10 x86
Layout:
ldapmaster <-syncrepl-> ldapslave01/02/03/04 <-syncrepl-> data-clusters.
I deleted the entire openldap-data directory on ldapslave03, and let it syncrepl the entire database from ldapmaster. It took about 48 hours. After that, I turned port 389 on again as usual.
It took about 18 hours for the first messages:
/var/log/slaplog.20100413.gz:Apr 13 15:07:31 ldapslave03.unix slapd[27475]: [ID 561622 local4.debug] syncrepl_del_nonpresent: rid 329 be_delete DNSHostName=www,DNSZoneName=$customer.com,ou=dns,dc=$DC (66)
/var/log/slaplog.20100413.gz:Apr 13 15:07:32 ldapslave03.unix slapd[27475]: [ID 561622 local4.debug] syncrepl_del_nonpresent: rid 329 be_delete DNSRecord=A1,DNSHostName=www,DNSZoneName=$customer.com,ou=dns,dc=$DC (0)
So it would appear that we do have an on-going problem.
Perhaps it is time to go to a newer version of OpenLDAP. The 2.3.41 version was recommended to us about 3 years ago on this list. What is the recommended/most-stable version at the moment? Should we also upgrade BerkeleyDB?
Thanks,
Lund
--On Wednesday, April 14, 2010 8:47 AM +0900 Jorgen Lundman lundman@lundman.net wrote:
Perhaps it is time to go to a newer version of OpenLDAP. The 2.3.41 version was recommended to us about 3 years ago on this list. What is the recommended/most-stable version at the moment? Should we also upgrade BerkeleyDB?
The latest, most stable release is always found at:
http://www.openldap.org/software/download/
OpenLDAP 2.4 requires BDB 4.4 or later. The only reliable replication mechanism in OpenLDAP 2.3 is using delta-syncrepl.
You can also read http://www.openldap.org/software/release/changes.html and search for "syncrepl" and "syncprov" to see all the fixes to it since OpenLDAP 2.3.
--Quanah
--
Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
On Wed, 14 Apr 2010, Jorgen Lundman wrote:
Perhaps it is time to go to a newer version of OpenLDAP. The 2.3.41 version was recommended to us about 3 years ago on this list. What is the recommended/most-stable version at the moment? Should we also upgrade BerkeleyDB?
The official recommendation is at http://www.openldap.org/software/download/
Unofficially, I'm very happy with 2.4.21, BDB 4.8.26, at the moment. We've been on those revs for a bit over a month now. Both the behavior and the stability have been excellent.
Jorgen Lundman wrote:
Jorgen Lundman wrote:
openldap-2.3.41 db-4.2.52.NC-PLUS_5_PATCHES Solaris 10 x86
/var/log/slaplog.20100413.gz:Apr 13 15:07:31 ldapslave03.unix slapd[27475]: [ID 561622 local4.debug] syncrepl_del_nonpresent: rid 329 be_delete DNSHostName=www,DNSZoneName=$customer.com,ou=dns,dc=$DC (66)
Just information sharing, no current issues...
I spun up some XENs worlds, one for master and one for client, for my own test case.
I did the operations of adding a DNS entry A-TYPE to a zone ("roger.test-this.com" in this case), then deleting said record and watching the sync log output of the slave.
Any number of these operations, add->del->add->del, appeared to work correctly.
However, when I did:
1) add "roger.test-this.com" 2) stop slapd on slave. 3) delete "roger.test-this.com" 4) start slapd on slave.
I get the following entries: (I only start slapd, no other operations)
May 11 14:35:47 ldapslave00.unix slapd[28256]: [ID 542995 local4.debug] slapd sh utdown: waiting for 1 threads to terminate May 11 14:35:48 ldapslave00.unix slapd[28256]: [ID 486161 local4.debug] slapd st opped. May 11 14:36:15 ldapslave00.unix slapd[28264]: [ID 702911 local4.debug] @(#) $Op enLDAP: slapd 2.3.41 (May 7 2010 11:24:55) $ May 11 14:36:15 ldapslave00.unix slapd[28265]: [ID 100111 local4.debug] slapd st arting May 11 14:36:16 ldapslave00.unix slapd[28265]: [ID 804257 local4.debug] do_syncr ep2: rid 279 LDAP_RES_INTERMEDIATE - SYNC_ID_SET May 11 14:36:16 ldapslave00.unix slapd[28265]: [ID 561622 local4.debug] syncrepl _del_nonpresent: rid 279 be_delete DNSHostName=roger,DNSZoneName=test-this.com,o u=dns,$DC (66)
So I have a test-case to work with.
As a side note, I do notice that if I then further executed:
add "www.test-this.com"
.. the slave ignores it, then ..
del "www.test-this.com"
the slave syncs this fine and we are back in business and consistent again.
I started working on upgrade path, I picked:
db-4.8.30 openldap-2.4.21
I first upgraded the LDAP master. I tried various techniques, since I can't use db_upgrade directly. But the new 2.4.21 is happy to sync against the old 2.3.41, so I think I will simply run it in parallel until I am sure it has caught up. (You guys have a clever way to test and know when LDAP syncrepl has caught up?)
With LDAP master running 2.4.21, and LDAP slave running 2.3.41, the test-case still works. Ie, it still breaks.
I then also upgraded LDAP slave to 2.4.21 and I can confirm the problem no longer happens. So it has indeed already been fixed. And I have an upgrade path to follow...
--On May 12, 2010 11:52:02 AM +0900 Jorgen Lundman lundman@lundman.net wrote:
I first upgraded the LDAP master. I tried various techniques, since I can't use db_upgrade directly. But the new 2.4.21 is happy to sync against the old 2.3.41, so I think I will simply run it in parallel until I am sure it has caught up. (You guys have a clever way to test and know when LDAP syncrepl has caught up?)
You should use the supported tools to migrate your database -- slapcat the master, then slapadd that to the replica... Same with rebuilding the master. That is the only officially supported upgrade path from 2.3 -> 2.4.
--Quanah
You should use the supported tools to migrate your database -- slapcat the master, then slapadd that to the replica... Same with rebuilding the master. That is the only officially supported upgrade path from 2.3 -> 2.4.
I did try that too. 'slapcat' took 6 minutes, and 'slapadd' to an empty directory wanted 15hours. If I move the ldif file to the x4500 where I have 16GB of /tmp, I can 'slapadd' it all in 13 minutes, then rsync the data back to the master and disk (3 minutes).
I just thought the whole rsyncing the binary db files to be 'uglier' than having 2.4 syncrepl against a 2.3 until it was up to date.
Lund
--On May 13, 2010 9:09:52 AM +0900 Jorgen Lundman lundman@lundman.net wrote:
You should use the supported tools to migrate your database -- slapcat the master, then slapadd that to the replica... Same with rebuilding the master. That is the only officially supported upgrade path from 2.3 -> 2.4.
I did try that too. 'slapcat' took 6 minutes, and 'slapadd' to an empty directory wanted 15hours. If I move the ldif file to the x4500 where I have 16GB of /tmp, I can 'slapadd' it all in 13 minutes, then rsync the data back to the master and disk (3 minutes).
Sounds to me like you haven't properly tuned your database then. I suggest you read up on properly tuning a DB_CONFIG file (and possibly using a shared memory key as part of that), and I'm guessing you didn't use the -q flag with slapadd?
--Quanah
Sounds to me like you haven't properly tuned your database then. I suggest you read up on properly tuning a DB_CONFIG file (and possibly using a shared memory key as part of that), and I'm guessing you didn't use the -q flag with slapadd?
That is entirely possible, I'm entirely self-taught on LDAP.
I have the DB_CONFIG lines in slapd.conf so that they are automatically put into DB_CONFIG file. I just assumed that slapadd would do that. (Just tested, and it does).
I did not use -q.
# time slapadd -q -f slapd-import.conf -l ../full-dump.txt bdb_monitor_db_open: monitoring disabled; configure monitor database to enable *## 10.41% eta 49m44s elapsed 05m46s spd 1.5 M/s
That could be done in a maintenance window of 2 hours at very least.
Is then the official OpenLDAP policy is to do that, even though having 2.4 syncrepl from 2.3 would 'mean less than 5 minutes down-time' ? Clearly that part is attractive to me at 3am..
For amusement sake, /tmp import on the x4540:
-## 10.08% eta 10m24s elapsed 01m09s spd 2.0 M/s
And the DB_CONFIG settings:
dbconfig set_lk_detect DB_LOCK_DEFAULT dbconfig set_lg_max 52428800 dbconfig set_cachesize 0 67108864 1 dbconfig set_flags db_log_autoremove dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500
--On May 13, 2010 12:18:00 PM +0900 Jorgen Lundman lundman@lundman.net wrote:
And the DB_CONFIG settings:
dbconfig set_lk_detect DB_LOCK_DEFAULT dbconfig set_lg_max 52428800 dbconfig set_cachesize 0 67108864 1 dbconfig set_flags db_log_autoremove dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500
You don't provide relevant data on the size of your database to make any comment one way or the other on whether your "set_cachesize" value is correct.
I would suggest you provide the output of du -c -h *.bdb in your OpenLDAP 2.3 database directory.
--Quanah
dbconfig set_lk_detect DB_LOCK_DEFAULT dbconfig set_lg_max 52428800 dbconfig set_cachesize 0 67108864 1 dbconfig set_flags db_log_autoremove dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500
You don't provide relevant data on the size of your database to make any comment one way or the other on whether your "set_cachesize" value is correct.
I would suggest you provide the output of du -c -h *.bdb in your OpenLDAP 2.3 database directory.
Coming right up! I appreciate your time in doing some sanity checks for me.
2.8M DNSData.bdb 3.9M DNSHostName.bdb 2.4M DNSIPAddr.bdb 1.4M DNSType.bdb 1.5M accountStatus.bdb 258K amavisSpamLover.bdb 642K deliveryMode.bdb 1.2G dn2id.bdb 62M entryCSN.bdb 84M entryUUID.bdb 13M gecos.bdb 898K gidNumber.bdb 2.5G id2entry.bdb 9.0M mail.bdb 7.6M mailAlternateAddress.bdb 5.6M o.bdb 3.9M objectClass.bdb 1.0M radiusGroupName.bdb 14M uid.bdb 12M uidNumber.bdb (no -c in Solaris, the whole dir is: ) 9.0G .
If it is at all interesting, on the live master (not the test server with identical data) the running data look like:
# /usr/local/BerkeleyDB.4.2/bin/db_stat -m 80MB 2KB 912B Total cache size. 1 Number of caches. 80MB 8KB Pool individual cache size. 0 Requested pages mapped into the process' address space. 163M Requested pages found in the cache (98%). 3324522 Requested pages not found in the cache. 25800 Pages created in the cache. 3324522 Pages read into the cache. 1893069 Pages written from the cache to the backing file. 3333040 Clean pages forced from the cache. 832 Dirty pages forced from the cache. 0 Dirty pages written by trickle-sync thread. 16450 Current total page count. 16303 Current clean page count. 147 Current dirty page count. 8191 Number of hash buckets used for page location. 169M Total number of times hash chains searched for a page. 14 The longest hash chain searched for a page. 383M Total number of hash buckets examined for page location. 479M The number of hash bucket locks granted without waiting. 4563 The number of hash bucket locks granted after waiting. 62 The maximum number of times any hash bucket lock was waited for. 12M The number of region locks granted without waiting. 11491 The number of region locks granted after waiting. 3350398 The number of page allocations. 6704823 The number of hash buckets examined during allocations 4161 The max number of hash buckets examined for an allocation 3333872 The number of pages examined during allocations 2064 The max number of pages examined for an allocation
Also, master LDAP slapd.conf has this cache related line:
lastmod on checkpoint 128 15 cachesize 5000
Now, live slapd consumes about 800MB of memory when it is running, which is probably acceptable as the server is dedicated to LDAP master. (Has 2GB RAM).
--On May 13, 2010 2:23:33 PM +0900 Jorgen Lundman lundman@lundman.net wrote:
dbconfig set_lk_detect DB_LOCK_DEFAULT dbconfig set_lg_max 52428800 dbconfig set_cachesize 0 67108864 1 dbconfig set_flags db_log_autoremove dbconfig set_lk_max_objects 1500 dbconfig set_lk_max_locks 1500 dbconfig set_lk_max_lockers 1500
You don't provide relevant data on the size of your database to make any comment one way or the other on whether your "set_cachesize" value is correct.
I would suggest you provide the output of du -c -h *.bdb in your OpenLDAP 2.3 database directory.
Coming right up! I appreciate your time in doing some sanity checks for me.
2.8M DNSData.bdb 3.9M DNSHostName.bdb 2.4M DNSIPAddr.bdb 1.4M DNSType.bdb 1.5M accountStatus.bdb 258K amavisSpamLover.bdb 642K deliveryMode.bdb 1.2G dn2id.bdb 62M entryCSN.bdb 84M entryUUID.bdb 13M gecos.bdb 898K gidNumber.bdb 2.5G id2entry.bdb 9.0M mail.bdb 7.6M mailAlternateAddress.bdb 5.6M o.bdb 3.9M objectClass.bdb 1.0M radiusGroupName.bdb 14M uid.bdb 12M uidNumber.bdb (no -c in Solaris, the whole dir is: ) 9.0G .
Now, live slapd consumes about 800MB of memory when it is running, which is probably acceptable as the server is dedicated to LDAP master. (Has 2GB RAM).
Your server is not sufficiently powerful enough to optimally run your database. Your database size is approximately 4GB in size (dn2id.bdb + id2entry.bdb + index databases). For an optimal slapadd run, your BDB cachesize (set_cachesize from DB_CONFIG) must be able to contain this entire dataset in memory. So you have only 1/2 the necessary memory to do an optimal slapadd load. This is why your slapadd takes 15+ hours to run. For your database itself to run optimally under slapd, it needs to be able to hold the sum of dn2id.bdb + id2entry.bdb, or in your case, 1.2G+2.5G=3.7GB. So again, you do not have enough memory. To make matters worse, your DB_CONFIG only allocates 64MB to slapd. Which, quite frankly, is an insanely small number for a database the size you are operating.
One other tuning parameter that will help slapadd that I failed to mention is the tool-threads setting. It should be set to the number of real cores the box has. However, this setting will not help in your resource constrained environment.
The first thing you need to do is get new hardware or increase the available RAM in the system you have. Since the running slapd is going to take the DB_CACHESIZE + thread overhead + slapd cachesizes memory, you probably need closer to 8GB of RAM to run your LDAP service.
--Quanah
openldap-software@openldap.org