Buchan Milne [mailto:bgmilne@obsidian.co.za]
Well, I just ran db_archive and caused widespread chaos
because most (all?)
of the replicas stopped responding to queries. (I have yet
to perform a
post-mortem)
You ran db_archive for the first time, on *all* replicas at the same time????
Yes. We administer 80+ servers that are more or less identically configured, and typically perform small admin tasks in "for" loop, which is what I did in this case.
I know that there's a bug in bdb 4.2 that causes logs to be
held open even
though they're no longer required. Upgrading bdb is not on
the cards right
now so I need to work around that problem by stopping and starting openldap.
This may or may not be the cause of your problems. Additionally, this is affected by your database configuration and checkpointing settings.
Copied below...
So the question I have just at the moment is, when I run
db_archive, should
openldap be running or not running?
It should be safe, depending on your configuration, to run db_archive. However, due to the tasks it does (just deleting all the unused log files would have a similar effect), it can be quite IO intensive, and you may incur IO starvation when doing it, impacting performance of any other application using files on the same block devices (e.g. slapd).
As an additional piece of background, we are currently running all the replication out of an intermediate server (it's a transistional setup). As far as I can tell, it all hit the fan when the db_archive ran on that intermediate server. Obviously I should have left that one out of the list but I didn't think of it at the time.
These are all the non-comment entries in DB_CONFIG:
set_cachesize 0 268435456 1 set_data_dir db set_lg_regionmax 262144 set_lg_bsize 2097152 set_lg_dir logs set_tmp_dir /tmp
Database definition entries in slapd.conf:
database bdb suffix "dc=example,dc=com" rootdn "cn=root,dc=example,dc=com" rootpw secret directory /var/local/openldap-2.3.32/openldap-data mode 0600 sizelimit 12000
Replication entries on most servers:
syncrepl rid=123 provider=ldap://server.example.com type=refreshAndPersist searchbase="dc=example,dc=com" scope=sub schemachecking=off bindmethod=simple binddn="cn=root,dc=example,dc=com" credentials=secret retry=5,5,30,5,60,5,300,+
And replication entries on the intermediate host:
syncrepl rid=123 provider=ldap://master.example.com type=refreshAndPersist interval=00:00:01:00 searchbase="dc=example,dc=com" scope=sub schemachecking=off bindmethod=simple binddn="cn=root,dc=example,dc=com" credentials=secret retry=5,5,30,5,60,5,300,+
overlay syncprov syncprov-checkpoint 10 5 syncprov-sessionlog 100
--On Thursday, March 01, 2007 10:48 AM +1300 Lesley Walker lesley.walker@opus.co.nz wrote:
Buchan Milne [mailto:bgmilne@obsidian.co.za]
Well, I just ran db_archive and caused widespread chaos
because most (all?)
of the replicas stopped responding to queries. (I have yet
to perform a
post-mortem)
You ran db_archive for the first time, on *all* replicas at the same time????
Yes. We administer 80+ servers that are more or less identically configured, and typically perform small admin tasks in "for" loop, which is what I did in this case.
I know that there's a bug in bdb 4.2 that causes logs to be
held open even
though they're no longer required. Upgrading bdb is not on
the cards right
now so I need to work around that problem by stopping and starting openldap.
This may or may not be the cause of your problems. Additionally, this is affected by your database configuration and checkpointing settings.
Copied below...
So the question I have just at the moment is, when I run
db_archive, should
openldap be running or not running?
It should be safe, depending on your configuration, to run db_archive. However, due to the tasks it does (just deleting all the unused log files would have a similar effect), it can be quite IO intensive, and you may incur IO starvation when doing it, impacting performance of any other application using files on the same block devices (e.g. slapd).
As an additional piece of background, we are currently running all the replication out of an intermediate server (it's a transistional setup). As far as I can tell, it all hit the fan when the db_archive ran on that intermediate server. Obviously I should have left that one out of the list but I didn't think of it at the time.
These are all the non-comment entries in DB_CONFIG:
set_cachesize 0 268435456 1 set_data_dir db set_lg_regionmax 262144 set_lg_bsize 2097152 set_lg_dir logs set_tmp_dir /tmp
Database definition entries in slapd.conf:
database bdb suffix "dc=example,dc=com" rootdn "cn=root,dc=example,dc=com" rootpw secret directory /var/local/openldap-2.3.32/openldap-data mode 0600 sizelimit 12000
I see no checkpoint directive for this database. You definitely want that.
--Quanah
-- Quanah Gibson-Mount Principal Software Developer ITS/Shared Application Services Stanford University GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html
Quanah Gibson-Mount wrote:
I see no checkpoint directive for this database. You definitely want that.
Okay. I've done some reading and now know how to specify it, but I'm open to suggestions as to how to choose the numbers.
We have just over 9000 records (increasing by a handful every day). Updates are probably no more than 100 per day but we need to make sure they replicate quickly. "people" records are about 5-6kb.
Would something like this be appropriate? checkpoint 10 5
And on the BDB version question:
The question is, what version of BDB is slapd linked to? For example, if slapd is linked against BDB 4.4, and you ran the BDB 4.2 db_archive, that's bad.
No, it was linked against 4.2.
Lesley W
openldap-software@openldap.org