--On Wednesday, February 15, 2017 6:36 PM -0800 "Paul B. Henson" henson@acm.org wrote:
On Wed, Feb 15, 2017 at 12:22:29PM -0800, Quanah Gibson-Mount wrote:
I would suggest filing an ITS with the full backtrace info, so I can track it.
Ok, will do.
It could be useful to have the entry data from the accesslog as well for the failed replication op, as we can see the failed entry DN in the output of your backtrace.
That would be in the accesslog on the server that crashed? Hmm, the server that crashed is the master, and all updates were going to it. Am I confused, or did the update that caused the crash come in via syncrepl though, and hence originate from a different server? So the accesslog entry you want would be from that server, not the server that crashed? But given no other servers should have been receiving updates, how would an update have been received via replication? Or is this another issue like the memberOf problem where updates are being improperly replicated?
It appears to be crashing while writing the change to the accesslog database. It's odd that the value for the attribute is NULL. Do we know for sure what the client doing the write op to the server is sending?
Hmm, looking at the logs that correspond with one of the crashes:
This operation appears to succeed? Then there's this:
Feb 14 04:00:13 fosse slapd[12524]: conn=37859 op=806 MOD dn="uid=vntruong,ou=user,dc=csupomona,dc=edu" Feb 14 04:00:13 fosse slapd[12524]: conn=37859 op=806 MOD attr=csupomonaEduPersonExpiration
Yeah, so this is the operation that actually failed... It'd be interesting to know if it succeeded in the primary DB, but failed when writing to the accesslog DB (I.e., the master and its consumers are now out of sync for that entry), or if the entire write op failed (master and consumers are in sync for the entry)
when I restarted the server. I guess I am confused; the entryCSN has serverID 0, the ID of this server, so this isn't a replicated op, it's an op from this server. So why does the backtrace show the change coming in via syncrepl? It seems like it's getting applied twice. The change is deleting the attribute, so the second time it's getting applied you would get a no such attribute error...
Hm, so I guess my question would be is it doing the op like this:
dn: ... changetype: modify replace: csupomonaEduPersonExpiration csupomonaEduPersonExpiration:
Or is it doing it like this:
dn: ... changetype: modify delete: csupomonaEduPersonExpiration
Because the NULL value seems to imply the former.
--Quanah
--
Quanah Gibson-Mount Product Architect Symas Corporation Packaged, certified, and supported LDAP solutions powered by OpenLDAP: http://www.symas.com
On Thu, Feb 16, 2017 at 03:53:40PM -0800, Quanah Gibson-Mount wrote:
It appears to be crashing while writing the change to the accesslog database. It's odd that the value for the attribute is NULL. Do we know for sure what the client doing the write op to the server is sending?
The code is in perl, and looks like this:
$entry->replace(eduPersonAffiliation => @eduPersonAffiliation);
In this case, the array @eduPersonAffiliation is empty, effectively deleting the attribute. I'm not 100% sure what this generates on the wire, I'd have to debug it. I can say it's the same code that's been running for a decade or so with no issues.
Yeah, so this is the operation that actually failed... It'd be interesting to know if it succeeded in the primary DB, but failed when writing to the accesslog DB
I already reran that operation to fix the expiration, but the next time it crashes I'll take a look at the primary and secondaries first to see if they're out of sync.
Hm, so I guess my question would be is it doing the op like this:
Hmm, the expiration code when removing an expiration looks like:
$entry->delete('cppEduPersonExpiration');
So it should be a delete on the wire for removing the attribute. You think it crashed on the expiration operating, even though the backtrace shows the segfault having the eduPersonAffiliation accesslog reqStart id?
dn: ... changetype: modify replace: csupomonaEduPersonExpiration csupomonaEduPersonExpiration:
Or is it doing it like this:
dn: ... changetype: modify delete: csupomonaEduPersonExpiration
Because the NULL value seems to imply the former.
Years ago, I had code that did that and then I got a version of Net::LDAP that sent null attributes across the wire. Even back then OpenLDAP was upset by such behavior. I quickly updated my code to check if (in your case) @eduPersonAffiliation had zero entries and issue a proper $entry->delete('eduPersonAffiliation') call in that case. I've not had any issues since. Perhaps, your Net::LDAP module version has changed and it is sending what is being logged across the wire instead of the delete you are expecting.
On 2/17/17, 17:44, "openldap-technical on behalf of Paul B. Henson" <openldap-technical-bounces@openldap.org on behalf of henson@acm.org> wrote:
The code is in perl, and looks like this:
$entry->replace(eduPersonAffiliation => @eduPersonAffiliation);
On Tue, Feb 21, 2017 at 01:04:33PM +0000, Frank Swasey wrote:
case. I've not had any issues since. Perhaps, your Net::LDAP module version has changed and it is sending what is being logged across the wire instead of the delete you are expecting.
Hmm, I don't believe we've updated Net::LDAP recently, the only change since the crashes started was the openldap update. It probably would be cleaner to do an explicit delete rather than a replace with an empty array, but it's always done the right thing in the past so it's never come up. I'll add that to my todo list :), thanks...
On Thu, Feb 16, 2017 at 03:53:40PM -0800, Quanah Gibson-Mount wrote:
Yeah, so this is the operation that actually failed... It'd be interesting to know if it succeeded in the primary DB, but failed when writing to the accesslog DB (I.e., the master and its consumers are now out of sync for that entry), or if the entire write op failed (master and consumers are in sync for the entry)
I had another crash today :(. The logs leading up to the crash were:
Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 MOD attr=eduPersonAffiliation eduPersonPrimaryAffi liation cppEduPersonAffiliation Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145931 RESULT tag=103 err=0 text= Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 MOD attr=cppEduPersonExpiration Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145932 RESULT tag=103 err=0 text= Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 MOD dn="uid=mlmungia,ou=user,dc=cpp,dc=edu" Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 MOD attr=cppEduPersonStatusFlag Mar 4 03:53:52 themis slapd[27421]: conn=1668323 op=145933 RESULT tag=103 err=0 text=
Interesting, this time there was no failed operation on the client side. All three of these opereration succeeded, then the server died and the load balancer cut over to the secondary. The client got some errors while it was trying to initiate the next operation it wanted to do, but it retried a couple of times and successfully performed it on the secondary once the load balancer cut over.
When I brought the primary back online, all three of these operations were reflected in its local database and had been replicated to the other servers.
The backtrace was the same:
Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000485c61 in modify_add_values (e=e@entry=0x7f417d5b8480, mod=mod@entry=0x7f4170000980, permissive=0, text=text@entry=0x7f417d5b89a0, textbuf=textbuf@entry=0x7f417d5b84d0 "modify/delete: eduPersonAffiliation: no such attribute", textlen=textlen@entry=256) at /var/lib/portage/tmp/portage/net-nds/openldap-2.4.44-r1/work/openldap-2.4.44/servers/slapd/mods.c:61 61 if ( !BER_BVISNULL( &mod->sm_values[mod->sm_numvals] )) {
It seems like this has something to do with duplicated operations? The first time this operation was applied, the eduPersonAffiliation attribute did exist, with a single value that was removed. Then for some reason it's like it tried to remove it again and went boom? But there was no operation that *actually* failed. Everything that the client tried to do was done.
BTW, I opened the ITS for tracking, it's #8609. Thanks...
openldap-technical@openldap.org