Hi folk,
First off, let me say that per our last conversation about this, I have not yet rebuild cyrus-sasl/openldap against a different Kerberos dist. (I was going to build against 1.5.. right now I'm at 1.2.8.. we tend to steer clear of Heimdal) Anyway, on April 28th, at 12:05AM, all three of our slave servers' slapds died. All for apparently different reasons:
First slave: #0 0xfed592f0 in __sigprocmask () from /usr/lib/libthread.so.1 #1 0xfed4e59c in _resetsig () from /usr/lib/libthread.so.1 #2 0xfed4dd3c in _sigon () from /usr/lib/libthread.so.1 #3 0xfed50d98 in _thrp_kill () from /usr/lib/libthread.so.1 #4 0xfedcbdc4 in raise () from /usr/lib/libc.so.1 #5 0xfedb5a5c in abort () from /usr/lib/libc.so.1 #6 0xfedb5d00 in _assert () from /usr/lib/libc.so.1 #7 0xff3188b8 in ber_sockbuf_ctrl (sb=0x18d53b38, opt=1, arg=0xd7c00e6c) at sockbuf.c:89 #8 0x0003071c in connection_get (s=111) at connection.c:312 #9 0x00032ef8 in connection_read (s=111) at connection.c:1307 #10 0x0002f84c in slapd_daemon_task (ptr=0x14d218) at daemon.c:2353 #11 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 #12 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Second slave: #0 0xfee12d38 in fseek () from /usr/lib/libc.so.1 #1 0xfe343680 in krb5_ktfileint_internal_read_entry () from /local/kerberos/lib/libkrb5.so.3 #2 0xfe343ec8 in krb5_ktfileint_read_entry () from /local/kerberos/lib/libkrb5.so.3 #3 0xfe342660 in krb5_ktfile_get_entry () from /local/kerberos/lib/libkrb5.so.3 #4 0xfe35bc44 in krb5_rd_req_decrypt_tkt_part () from /local/kerberos/lib/libkrb5.so.3 #5 0xfe35bdcc in krb5_rd_req_decoded_opt () from /local/kerberos/lib/libkrb5.so.3 #6 0xfe35c594 in krb5_rd_req_decoded () from /local/kerberos/lib/ libkrb5.so.3 #7 0xfe35bb10 in krb5_rd_req () from /local/kerberos/lib/libkrb5.so.3 #8 0xfecd81ec in krb5_gss_accept_sec_context () from /local/kerberos/lib/libgssapi_krb5.so.2 #9 0xfece12c4 in gss_accept_sec_context () from /local/kerberos/lib/libgssapi_krb5.so.2 #10 0xfed02410 in gssapi_server_mech_step () from /local/lib/sasl2/libgssapiv2.so.2 #11 0xff1d95c0 in sasl_server_step () from /local/lib/libsasl2.so.2 #12 0xff1d92b4 in sasl_server_start () from /local/lib/libsasl2.so.2 #13 0x00074998 in slap_sasl_bind (op=0x516fe80, rs=0xd4c01af0) at sasl.c:1393 #14 0x0004c4d4 in fe_op_bind (op=0x516fe80, rs=0xd4c01af0) at bind.c:276 ---Type <return> to continue, or q <return> to quit--- #15 0x0004bddc in do_bind (op=0x516fe80, rs=0xd4c01af0) at bind.c:200 #16 0x00032afc in connection_operation (ctx=0x170948, arg_v=0x516fe80) at connection.c:1132 #17 0xff33cbb4 in ldap_int_thread_pool_wrapper (xpool=0x181b08) at tpool.c:478 #18 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 #19 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Third slave: #0 0xfedc1fe0 in _malloc_unlocked () from /usr/lib/libc.so.1 #1 0xfedc1db0 in malloc () from /usr/lib/libc.so.1 #2 0xff039c48 in default_malloc_ex () from /local/lib/libcrypto.so. 0.9.7 #3 0xff03a638 in CRYPTO_malloc () from /local/lib/libcrypto.so.0.9.7 #4 0xff0a52bc in EVP_DigestInit_ex () from /local/lib/libcrypto.so. 0.9.7 #5 0xff04cb90 in HMAC_Init_ex () from /local/lib/libcrypto.so.0.9.7 #6 0xff18c1dc in tls1_mac () from /local/lib/libssl.so.0.9.7 #7 0xff1877a8 in ssl3_read_bytes () from /local/lib/libssl.so.0.9.7 #8 0xff185608 in ssl3_read () from /local/lib/libssl.so.0.9.7 #9 0xff18d368 in SSL_read () from /local/lib/libssl.so.0.9.7 #10 0xff35e7f4 in sb_tls_read (sbiod=0xff8b7b0, buf=0x175d0d80, len=4) at tls.c:589 #11 0xff319520 in sb_debug_read (sbiod=0xd02fe00, buf=0x175d0d80, len=4) at sockbuf.c:795 #12 0xff3431e8 in sb_sasl_read (sbiod=0xfc0cd08, buf=0xccb14b3, len=8) at cyrus.c:270 #13 0xff319520 in sb_debug_read (sbiod=0xc4f0e50, buf=0xccb14b3, len=8) at sockbuf.c:795 #14 0xff31866c in ber_int_sb_read (sb=0xe03bc80, buf=0xccb14b3, len=8) at sockbuf.c:409 #15 0xff3161e8 in ber_get_next (sb=0xe03bc80, len=0xd7c00ef4, ber=0xccb14a8) at io.c:504 #16 0x000330a8 in connection_read (s=71) at connection.c:1527 ---Type <return> to continue, or q <return> to quit--- #17 0x0002f84c in slapd_daemon_task (ptr=0x14d218) at daemon.c:2353 #18 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 #19 0xfed5b124 in _thread_start () from /usr/lib/libthread.so.1 Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Now, the second one is most likely the "recompile it against a different Kerberos" issue. However, the last seems to be directly SSL related and the first ... I have no idea about the first. We have openldap built against openssl 0.9.7i.
So you might be asking yourself, what occurs at 12:05am in our world? Nothing in particular as far as I can tell from the logs, which is mighty odd. Also this never happens with the master server (but then again it's not getting direct queries) The last request in the logs seems to have been a fairly simple lookup (I X'd out the uid): Apr 28 00:00:05 uni01ds.unity.ncsu.edu slapd[1058]: [ID 469902 local4.debug] con n=2189346 op=115 SRCH base="ou=accounts,dc=ncsu,dc=edu" scope=2 deref=0 filter=" (&(|(objectClass=posixAccount)(objectClass=inetOrgPerson) (objectClass=shadowAcco unt))(|(uidNumber=XXXXX)))" Apr 28 00:00:05 uni01ds.unity.ncsu.edu slapd[1058]: [ID 744844 local4.debug] con n=2189346 op=115 SRCH attr=uid cn cn homeDirectory loginShell uidNumber
Looks like they really all crashed at right after midnight and were discovered and fixed 5 minutes later.
Any suggestions? =( I know many of you are running this under Solaris. Anyone had any particular problems doing so?
Daniel
--On April 30, 2007 9:58:07 AM -0400 Daniel Henninger daniel@ncsu.edu wrote:
Hi folk,
First off, let me say that per our last conversation about this, I have not yet rebuild cyrus-sasl/openldap against a different Kerberos dist. (I was going to build against 1.5.. right now I'm at 1.2.8.. we tend to steer clear of Heimdal) Anyway, on April 28th, at 12:05AM, all three of our slave servers' slapds died. All for apparently different reasons:
Why do you "steer clear" of Heimdal for linking the server libraries against? In any case, MIT Krb5 1.2 is known to not be thread safe.
Now, the second one is most likely the "recompile it against a different Kerberos" issue. However, the last seems to be directly SSL related and the first ... I have no idea about the first. We have openldap built against openssl 0.9.7i.
Never had an issue with OpenSSL like you are seeing.
Any suggestions? =( I know many of you are running this under Solaris. Anyone had any particular problems doing so?
Never saw any.
--Quanah
-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
--On April 30, 2007 9:58:07 AM -0400 Daniel Henninger daniel@ncsu.edu wrote:
Hi folk,
First off, let me say that per our last conversation about this, I have not yet rebuild cyrus-sasl/openldap against a different Kerberos dist. (I was going to build against 1.5.. right now I'm at 1.2.8.. we tend to steer clear of Heimdal) Anyway, on April 28th, at 12:05AM, all three of our slave servers' slapds died. All for apparently different reasons:
Why do you "steer clear" of Heimdal for linking the server libraries against? In any case, MIT Krb5 1.2 is known to not be thread safe.
History. In the past when I had tried to use heimdal with something else it caused a wealth of problems. That may not be the case now, but I don't really see the point in using multiple implementations of Kerberos if I can avoid it so I have never gone back to reevaluate. =)
So that's what the problem is with 1.2? Not thread safe? Ok. That's good to know!
Now, the second one is most likely the "recompile it against a different Kerberos" issue. However, the last seems to be directly SSL related and the first ... I have no idea about the first. We have openldap built against openssl 0.9.7i.
Never had an issue with OpenSSL like you are seeing.
Hrm. What version are you built against?
Any suggestions? =( I know many of you are running this under Solaris. Anyone had any particular problems doing so?
Never saw any.
Hrm. I can't help but think I may have something configured around at the kernel level (/etc/system) but I don't see anything in particular that I may have done wrong there. I'll at least try to narrow down away from the kerberos issues soonish. It's just weird that all three croaked at the same time. =)
Thanks!
Daniel
--On May 2, 2007 4:26:19 PM -0400 Daniel Henninger daniel@ncsu.edu wrote:
--On April 30, 2007 9:58:07 AM -0400 Daniel Henninger daniel@ncsu.edu wrote:
Hi folk,
First off, let me say that per our last conversation about this, I have not yet rebuild cyrus-sasl/openldap against a different Kerberos dist. (I was going to build against 1.5.. right now I'm at 1.2.8.. we tend to steer clear of Heimdal) Anyway, on April 28th, at 12:05AM, all three of our slave servers' slapds died. All for apparently different reasons:
Why do you "steer clear" of Heimdal for linking the server libraries against? In any case, MIT Krb5 1.2 is known to not be thread safe.
History. In the past when I had tried to use heimdal with something else it caused a wealth of problems. That may not be the case now, but I don't really see the point in using multiple implementations of Kerberos if I can avoid it so I have never gone back to reevaluate. =)
I used Heimdal on my servers because MIT at the time was just completely unstable. Since then, I continued to use it because MIT's implementation was significantly slower. Since all it is used for are the libraries, it isn't really a pain to be dealing with.
So that's what the problem is with 1.2? Not thread safe? Ok. That's good to know!
Yep, and in later versions, disable the replay cache if you want to get any type of performance at all out of MIT.
--Quanah
-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
openldap-software@openldap.org