rein@basefarm.no skrev:
We have seen deadlock/hang situations in our master server today, which looks like a deadlock caused by the si_csn_rwlock lock being held while syncprov_checkpoint() is running. The patch at the end should (I hope) fix this.
Hm, I looks as if I was a bit to quick here, it just hung got again :-(. And this time without anyone competing for this lock, only the lock on the glue suffix entry. I wonder if it can be the upgrade from db 4.6.18 to 4.6.21.1 I dit yesterday that is the real problem. I'll try to downgrade and see if that helps.
And it didn't :-( The problem seem to be that something readlocks our glue suffix entry before forgetting about the lock. Which quickly causes the entire server to deadlock when the writelock required to update the contextCSN in the suffix entry locks out all the readers.
So far the problem seem to be triggered by someone attempting to modify an entry in a subordinate syncrepl consumer backend that results in a referral to the backend master. But I haven't had very much time to look into this problem yet, so I'm still on very thin ice here. I'll return with a new ITS when I have found out more.
We are currently running with a workaround that simply grants the writelocks on the glue suffix entry without actually doing it. As the glue entry is the only entry in that database it should be pretty safe, and a potential corruption of the database is not any big problem.
Please put this case on hold. Sorry!
It currently looks as this patch addresses the symptom and not the real problem, although I'm not sure what could happen if a checkpoint is triggered while the suffix entry is locked by another thread. I do believe it should be considered if not a bug so an enhancement, as holding locks for as short time as possible is always a good thing. You'll have to choose whether to close this ITS or use the patch.
Rein