Hi folks,
In 25 years of using UNIX I've seen my share of baffling problems but this is right up there with the best of them.
We run slapd (2.1.22 - they are under Configuration Management freezes for Operations) on 2 Solaris 9 systems, master & replicant, and things had been running just fine. After experiencing some NFS issues with our NetApp, we made the decision to copy the file systems that the LDAP servers (also our NIS servers) depend on to local disk on those two systems.
A few days ago my officemate noticed the replicant "slapd" wasn't accepting connections. I looked and it showed the ldap port 389 as being in BOUND state, not LISTEN.
The master was still LISTENing but queries to it would return results and then hang. "lsof" of that server showed that these connections ended up being stuck in CLOSE_WAIT state.
At some point both machines were rebooted, and now each one has the same problem - "slapd" isn't coming up all the way. It gets as far (running in debug -1 mode) as this:
slapd startup: initiated. bdb_db_open: dc=my,dc=top,dc=level,dc=domain bdb_db_open: dbenv_open(/opt/openldap/var/openldap-data)
If I run it under truss, the last thing I see before it appears to hang is it opening one of the .bdb files:
1333: open64("/opt/openldap/var/openldap-data/dn2id.bdb", O_RDWR) = 11
It then does a write() and a few brk()'s and then hangs in an LWP.
If I ^\ it, it dumps core and the traceback is
ldap1:1:195 [/usr/local/src/networking/OpenLDAP/openldap-2.1.22] # adb /opt/openldap/libexec/slapd core core file = core -- program ``/opt/openldap/libexec/slapd'' on platform SUNW,Ultra-60 SIGQUIT: Quit $C ffbff430 libc.so.1`___lwp_cond_wait+8(fea38cd0, fea38cb8, 0, 9c1ac, b97ac, 13e6b0) ffbff490 libdb-4.1.so`__db_pthread_mutex_lock+0xac(13c838, fea38cb8, 1, fea38cd0, ff2cb824, 0) ffbff4f0 libdb-4.1.so`__lock_get_internal+0xbf4(1397e8, 0, 0, 0, 1, 1397e8) [...] ffbff7e8 bdb_last_id+0x138(0, 0, 87100, 103178, 4, 0) ffbff890 bdb_db_open+0x384(0, 87000, 58000, 9e510, 41, 9e534) ffbff908 backend_startup+0x25c(9f000, 1, 800, 0, 0, f57c8) ffbff978 main+0x6a4(9, ffbffa54, 1, 76388, 88, 9d000) ffbff9f0 _start+0x5c(0, 0, 0, 0, 0, 0)
It never gets as far as seeing 'slapd starting' so the listeners never put the socket(s) into LISTEN state. It looks to me like the "slapd" startup is stuck down in the BerkeleyDB (4.1.25) code waiting on some kind of a pthreads mutex lock. I've stared at the relevant routine (mutex/mut_pthread.c) but have no clue what the holdup is. (Though I note ominously that it contains several "#ifdef HAVE_MUTEX_SOLARIS_LWP" clauses.)
I have no idea what changed to cause this to happen or why it's waiting on the lock, much less know what to do about it to solve it.
Any ideas??? Should I try rebuilding BerkeleyDB 4.1.25 with a different mutex that doesn't use Solaris LWP's?
Thanks in advance,
- Greg
Greg Earle wrote:
Hi folks,
Hi Greg
In 25 years of using UNIX I've seen my share of baffling problems but this is right up there with the best of them.
We run slapd (2.1.22 - they are under Configuration Management freezes for Operations) on 2 Solaris 9 systems, master & replicant, and things had been running just fine. After experiencing some NFS issues with our NetApp, we made the decision to copy the file systems that the LDAP servers (also our NIS servers) depend on to local disk on those two systems.
The BerkeleyDB docs specifically state that BDB only works with local filesystems...
A few days ago my officemate noticed the replicant "slapd" wasn't accepting connections. I looked and it showed the ldap port 389 as being in BOUND state, not LISTEN.
The master was still LISTENing but queries to it would return results and then hang. "lsof" of that server showed that these connections ended up being stuck in CLOSE_WAIT state.
At some point both machines were rebooted, and now each one has the same problem - "slapd" isn't coming up all the way. It gets as far (running in debug -1 mode) as this:
You didn't allow slapd to shutdown cleanly and clean up its BerkeleyDB environment. BerkeleyDB locks are persisted in the filesystem. You need to run db_recover to clean these up before starting up slapd next.
Your OpenLDAP release is ancient, and your BerkeleyDB release is ancient. There are many many known bugs in both. In the current OpenLDAP releases we detect unclean shutdowns and recover automatically, among other things.
--On Saturday, May 19, 2007 12:55 PM -0700 Greg Earle earle@isolar.dyndns.org wrote:
Hi folks,
Any ideas??? Should I try rebuilding BerkeleyDB 4.1.25 with a different mutex that doesn't use Solaris LWP's?
Since you are running such an ancient version of OpenLDAP, it doesn't perform automatic DB recover after an unclean shutdown, which is what you did. So you need to run the BDB 4.1.25 db_recover *before* starting slapd. And you should seriously look at upgrading to a modern, supported release (like 2.3.25 with bdb 4.2.52+patches).
--Quanah
-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
openldap-software@openldap.org