On Thu, Feb 16, 2017 at 03:53:40PM -0800, Quanah Gibson-Mount wrote:
Yeah, so this is the operation that actually failed... It'd be interesting to know if it succeeded in the primary DB, but failed when writing to the accesslog DB (I.e., the master and its consumers are now out of sync for that entry), or if the entire write op failed (master and consumers are in sync for the entry)
I had another crash today :(. The logs leading up to the crash were:
Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 MOD attr=eduPersonAffiliation eduPersonPrimaryAffi liation cppEduPersonAffiliation Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 RESULT tag=103 err=0 text= Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 MOD attr=cppEduPersonExpiration Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 RESULT tag=103 err=0 text= Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 MOD attr=cppEduPersonStatusFlag Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 RESULT tag=103 err=0 text=
Interesting, this time there was no failed operation on the client side. All three of these opereration succeeded, then the server died and the load balancer cut over to the secondary. The client got some errors while it was trying to initiate the next operation it wanted to do, but it retried a couple of times and successfully performed it on the secondary once the load balancer cut over.
When I brought the primary back online, all three of these operations were reflected in its local database and had been replicated to the other servers.
The backtrace was the same:
Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000485c61 in modify_add_values (e=e@entry=0x7f417d5b8480, mod=mod@entry=0x7f4170000980, permissive=0, text=text@entry=0x7f417d5b89a0, textbuf=textbuf@entry=0x7f417d5b84d0 "modify/delete: eduPersonAffiliation: no such attribute", textlen=textlen@entry=256) at /var/lib/portage/tmp/portage/net-nds/openldap-2.4.44-r1/work/openldap-2.4.44/servers/slapd/mods.c:61 61 if ( !BER_BVISNULL( &mod->sm_values[mod->sm_numvals] )) {
It seems like this has something to do with duplicated operations? The first time this operation was applied, the eduPersonAffiliation attribute did exist, with a single value that was removed. Then for some reason it's like it tried to remove it again and went boom? But there was no operation that *actually* failed. Everything that the client tried to do was done.
BTW, I opened the ITS for tracking, it's #8609. Thanks...