Hi all,
For largely historical reasons we run slapd servers on most clients (this will probably change in the future - I'm just giving this information as background). We're seeing problems when some of these machines are busy, particularly, it seems, with memory intensive activity, although it's hard to substantiate as I generally only see the machines after they've broken. It's annoying as I can't reproduce these problems.
We see quite a few problems with slapd getting into a state where it's deferring operations, for whatever reason - I think I understand these - these are when slapd basically says sorry, I'm too busy doing X, so I'll defer Y until I have time. Is this accurate?
The second case I'm also seeing is bdb complaining about locks being no longer valid, e.g.
slapd[3780]: bdb(dc=inf,dc=ed,dc=ac,dc=uk): DB_LOCK->lock_put: Lock is no longer valid
slapd seems to keep going for the time being until getting into a state where it defers all binding operations and goes into some kind of spin where it sits at 99% cpu and has to be killed with a -9.
I suppose I have a couple of questions about the "Lock is no longer valid" error....
- What causes it? - Is it something I can prevent by configuration changes (for instance, would increasing the numbers of locks, lockers and objects help?)
We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with a bdb backend running 4.2.52 with all 6 recommended patches.
The only DBCONFIG settings we currently have are:
dbconfig set_cachesize 0 67108864 1 dbconfig set_lg_regionmax 262144 dbconfig set_lg_bsize 2097152
Thanks in advance Toby Blake School of Informatics University of Edinburgh
<quote who="Toby Blake">
Hi all,
Hi Toby.
For largely historical reasons we run slapd servers on most clients (this will probably change in the future - I'm just giving this information as background).
Why?
We're seeing problems when some of these machines are busy, particularly, it seems, with memory intensive activity, although it's hard to substantiate as I generally only see the machines after they've broken. It's annoying as I can't reproduce these problems.
It's going to be hard to pin point then ;-) How much memory/CPU etc. do these clients have and what other services do they provide?
We see quite a few problems with slapd getting into a state where it's deferring operations, for whatever reason - I think I understand these
- these are when slapd basically says sorry, I'm too busy doing X, so
I'll defer Y until I have time. Is this accurate?
Yes. What kind of clients are searching/binding to them? Local?
The second case I'm also seeing is bdb complaining about locks being no longer valid, e.g.
slapd[3780]: bdb(dc=inf,dc=ed,dc=ac,dc=uk): DB_LOCK->lock_put: Lock is no longer valid
slapd seems to keep going for the time being until getting into a state where it defers all binding operations and goes into some kind of spin where it sits at 99% cpu and has to be killed with a -9.
Is everything local? Nothing mounted locally, like NFS for the directory data.
I suppose I have a couple of questions about the "Lock is no longer valid" error....
- What causes it?
- Is it something I can prevent by configuration changes (for instance, would increasing the numbers of locks, lockers and objects help?)
One for the dev team. I do know this is an error message from Berkeley DB by grepping the source.
We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with a bdb backend running 4.2.52 with all 6 recommended patches.
I hope you mean 5, as there are only 5 listed on the Oracle site.
The only DBCONFIG settings we currently have are:
dbconfig set_cachesize 0 67108864 1 dbconfig set_lg_regionmax 262144 dbconfig set_lg_bsize 2097152
I take it dbconfig is a keyword you've added for this example, as it's not valid.
Thanks in advance Toby Blake School of Informatics University of Edinburgh
--On Wednesday, July 04, 2007 8:40 PM +0100 Gavin Henry ghenry@suretecsystems.com wrote:
We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with a bdb backend running 4.2.52 with all 6 recommended patches.
I hope you mean 5, as there are only 5 listed on the Oracle site.
There are 6 recommended patches to BDB 4.2.52, 5 of which come from the Oracle site.
--Quanah
-- Quanah Gibson-Mount Principal Software Engineer Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration
Hi there,
Firstly, many thanks for the replies...
Hi Toby.
For largely historical reasons we run slapd servers on most clients (this will probably change in the future - I'm just giving this information as background).
Why?
Why will this change or why did we do it in the first place? I wasn't party to these decisions at the time, so I can't really comment on the reasons for them. I could speculate wildly, but I'd prefer not to.
We're seeing problems when some of these machines are busy, particularly, it seems, with memory intensive activity, although it's hard to substantiate as I generally only see the machines after they've broken. It's annoying as I can't reproduce these problems.
It's going to be hard to pin point then ;-) How much memory/CPU etc. do these clients have and what other services do they provide?
They're typically desktop or lab machines for academics, students, etc. Hardware-wise they're Dell desktop boxes of a few years old - a 2.4GHz processor with 512MB of memory is typical. Something I should have mentioned is that they're running Fedora Core 5, with a few running FC6.
As for what services they provide, general desktop services, but also could be running long-running or intensive jobs. Many of the machines are also in a condor pool and this does seem to cause more problems.
Do you know if slapd gets unhappy if other processes use up lots of memory? This is my current line of investigation - I'll try to make it unhappy by using increasing amounts of memory.
I suppose what I'm trying to determine is - is it the client activity that's causing problems (i.e. a misbehaving client or similar) or is it slapd itself getting unhappy for other reasons (possibly due to resources being used by other programs)? Or a combination of both?
We see quite a few problems with slapd getting into a state where it's deferring operations, for whatever reason - I think I understand these
- these are when slapd basically says sorry, I'm too busy doing X, so
I'll defer Y until I have time. Is this accurate?
Yes. What kind of clients are searching/binding to them? Local?
All local. As for what kind of clients - typical linux desktop activity I suppose. Hard to be specific about this really, as it will change from host to host.
The second case I'm also seeing is bdb complaining about locks being no longer valid, e.g.
slapd[3780]: bdb(dc=inf,dc=ed,dc=ac,dc=uk): DB_LOCK->lock_put: Lock is no longer valid
slapd seems to keep going for the time being until getting into a state where it defers all binding operations and goes into some kind of spin where it sits at 99% cpu and has to be killed with a -9.
Is everything local? Nothing mounted locally, like NFS for the directory data.
Machines will have both NFS and AFS for home directory data.
I suppose I have a couple of questions about the "Lock is no longer valid" error....
- What causes it?
- Is it something I can prevent by configuration changes (for instance, would increasing the numbers of locks, lockers and objects help?)
One for the dev team. I do know this is an error message from Berkeley DB by grepping the source.
Yes, I saw it in the source, but don't know it well enough to be sure of what's causing it.
We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with a bdb backend running 4.2.52 with all 6 recommended patches.
I hope you mean 5, as there are only 5 listed on the Oracle site.
As Quanah said, there are 6.
The only DBCONFIG settings we currently have are:
dbconfig set_cachesize 0 67108864 1 dbconfig set_lg_regionmax 262144 dbconfig set_lg_bsize 2097152
I take it dbconfig is a keyword you've added for this example, as it's not valid.
Sorry, I should have been more specific - this is in slapd.conf - look in the man page for slapd-bdb - this is just a way of getting directives into DB_CONFIG.
Cheers Toby
<quote who="Toby Blake">
Hi there,
Firstly, many thanks for the replies...
np.
Hi Toby.
For largely historical reasons we run slapd servers on most clients (this will probably change in the future - I'm just giving this information as background).
Why?
Why will this change or why did we do it in the first place? I wasn't party to these decisions at the time, so I can't really comment on the reasons for them. I could speculate wildly, but I'd prefer not to.
Understood.
We're seeing problems when some of these machines are busy, particularly, it seems, with memory intensive activity, although it's hard to substantiate as I generally only see the machines after they've broken. It's annoying as I can't reproduce these problems.
It's going to be hard to pin point then ;-) How much memory/CPU etc. do these clients have and what other services do they provide?
They're typically desktop or lab machines for academics, students, etc. Hardware-wise they're Dell desktop boxes of a few years old - a 2.4GHz processor with 512MB of memory is typical. Something I should have mentioned is that they're running Fedora Core 5, with a few running FC6.
OK.
As for what services they provide, general desktop services, but also could be running long-running or intensive jobs. Many of the machines are also in a condor pool and this does seem to cause more problems.
Do you know if slapd gets unhappy if other processes use up lots of memory? This is my current line of investigation - I'll try to make it unhappy by using increasing amounts of memory.
Yes.
I suppose what I'm trying to determine is - is it the client activity that's causing problems (i.e. a misbehaving client or similar) or is it slapd itself getting unhappy for other reasons (possibly due to resources being used by other programs)? Or a combination of both?
Probably both. If a client keeps sending lots of bind/search requests at once, slapd will queue/defer them.
We see quite a few problems with slapd getting into a state where it's deferring operations, for whatever reason - I think I understand these
- these are when slapd basically says sorry, I'm too busy doing X, so
I'll defer Y until I have time. Is this accurate?
Yes. What kind of clients are searching/binding to them? Local?
All local. As for what kind of clients - typical linux desktop activity I suppose. Hard to be specific about this really, as it will change from host to host.
OK.
Is this happening on all desktops then?
The second case I'm also seeing is bdb complaining about locks being no longer valid, e.g.
slapd[3780]: bdb(dc=inf,dc=ed,dc=ac,dc=uk): DB_LOCK->lock_put: Lock is no longer valid
slapd seems to keep going for the time being until getting into a state where it defers all binding operations and goes into some kind of spin where it sits at 99% cpu and has to be killed with a -9.
Is everything local? Nothing mounted locally, like NFS for the directory data.
Machines will have both NFS and AFS for home directory data.
Not the data directory then, ok.
I suppose I have a couple of questions about the "Lock is no longer valid" error....
- What causes it?
- Is it something I can prevent by configuration changes (for instance, would increasing the numbers of locks, lockers and objects help?)
One for the dev team. I do know this is an error message from Berkeley DB by grepping the source.
Yes, I saw it in the source, but don't know it well enough to be sure of what's causing it.
Likewise.
We're running openldap 2.3.35 with ITS#4924 and ITS#4925 patches with a bdb backend running 4.2.52 with all 6 recommended patches.
I hope you mean 5, as there are only 5 listed on the Oracle site.
As Quanah said, there are 6.
The only DBCONFIG settings we currently have are:
dbconfig set_cachesize 0 67108864 1 dbconfig set_lg_regionmax 262144 dbconfig set_lg_bsize 2097152
I take it dbconfig is a keyword you've added for this example, as it's not valid.
Sorry, I should have been more specific - this is in slapd.conf - look in the man page for slapd-bdb - this is just a way of getting directives into DB_CONFIG.
Yeah, my mistake. I forgot about that way.
Cheers Toby
Hi again Gavin,
<most stuff snipped>
As for what services they provide, general desktop services, but also could be running long-running or intensive jobs. Many of the machines are also in a condor pool and this does seem to cause more problems.
Do you know if slapd gets unhappy if other processes use up lots of memory? This is my current line of investigation - I'll try to make it unhappy by using increasing amounts of memory.
Yes.
I suppose what I'm trying to determine is - is it the client activity that's causing problems (i.e. a misbehaving client or similar) or is it slapd itself getting unhappy for other reasons (possibly due to resources being used by other programs)? Or a combination of both?
Probably both. If a client keeps sending lots of bind/search requests at once, slapd will queue/defer them.
Excellent, this does look to be the case. I've just run a bit of a test by eating up all memory and swap and seeing if I could upset slapd - it seemed OK for a while, then a full search of the directory triggered off a "Lock is no longer valid" and it's now distinctly unhappy. So, a client that not only eats memory, but also uses up other resources, to the detriment of slapd, can only produce problems.
I suppose the way forward is to migrate away from running local slapd everywhere, perhaps to a proxy-caching type of solution, but this is going to require some proper planning and investigation.
Again, thanks for your help.
Cheers Toby
<quote who="Toby Blake">
Hi again Gavin,
<most stuff snipped>
As for what services they provide, general desktop services, but also could be running long-running or intensive jobs. Many of the machines are also in a condor pool and this does seem to cause more problems.
Do you know if slapd gets unhappy if other processes use up lots of memory? This is my current line of investigation - I'll try to make it unhappy by using increasing amounts of memory.
Yes.
I suppose what I'm trying to determine is - is it the client activity that's causing problems (i.e. a misbehaving client or similar) or is it slapd itself getting unhappy for other reasons (possibly due to resources being used by other programs)? Or a combination of both?
Probably both. If a client keeps sending lots of bind/search requests at once, slapd will queue/defer them.
Excellent, this does look to be the case. I've just run a bit of a test by eating up all memory and swap and seeing if I could upset slapd - it seemed OK for a while, then a full search of the directory triggered off a "Lock is no longer valid" and it's now distinctly unhappy. So, a client that not only eats memory, but also uses up other resources, to the detriment of slapd, can only produce problems.
Agreed.
I suppose the way forward is to migrate away from running local slapd everywhere, perhaps to a proxy-caching type of solution, but this is going to require some proper planning and investigation.
Yes. Feel free to run your plans via the list.
Again, thanks for your help.
np.
Cheers Toby
On Thursday, 5 July 2007, Toby Blake wrote:
Hi again Gavin,
<most stuff snipped>
As for what services they provide, general desktop services, but also could be running long-running or intensive jobs. Many of the machines are also in a condor pool and this does seem to cause more problems.
Do you know if slapd gets unhappy if other processes use up lots of memory? This is my current line of investigation - I'll try to make it unhappy by using increasing amounts of memory.
Yes.
I suppose what I'm trying to determine is - is it the client activity that's causing problems (i.e. a misbehaving client or similar) or is it slapd itself getting unhappy for other reasons (possibly due to resources being used by other programs)? Or a combination of both?
Probably both. If a client keeps sending lots of bind/search requests at once, slapd will queue/defer them.
Excellent, this does look to be the case. I've just run a bit of a test by eating up all memory and swap and seeing if I could upset slapd - it seemed OK for a while, then a full search of the directory triggered off a "Lock is no longer valid" and it's now distinctly unhappy. So, a client that not only eats memory, but also uses up other resources, to the detriment of slapd, can only produce problems.
Yes, it is probably advisable not to use only a local daemon to provide a service required by other processes that will take away resources from the daemon they rely on ...
I suppose the way forward is to migrate away from running local slapd everywhere, perhaps to a proxy-caching type of solution, but this is going to require some proper planning and investigation.
Are you using nscd ? Yes, I know the point of the local slapd is similar to that of nscd, but if you don't use nscd, each binary that does any user/group lookups will have a connection to your ldap server. Using nscd should not require any planning/investigation, just testing.
If nscd is unsuitable, you may also consider nss_db and nss_updatedb (this may require planning/investigation).
Oh, and your custom replication method looks a lot like syncrepl ...
Regards, Buchan
<quote who="Toby Blake">
Hi there,
Firstly, many thanks for the replies...
Hi Toby.
For largely historical reasons we run slapd servers on most clients (this will probably change in the future - I'm just giving this information as background).
Why?
Why will this change or why did we do it in the first place? I wasn't party to these decisions at the time, so I can't really comment on the reasons for them. I could speculate wildly, but I'd prefer not to.
We're seeing problems when some of these machines are busy, particularly, it seems, with memory intensive activity, although it's hard to substantiate as I generally only see the machines after they've broken. It's annoying as I can't reproduce these problems.
It's going to be hard to pin point then ;-) How much memory/CPU etc. do these clients have and what other services do they provide?
They're typically desktop or lab machines for academics, students, etc. Hardware-wise they're Dell desktop boxes of a few years old - a 2.4GHz processor with 512MB of memory is typical. Something I should have mentioned is that they're running Fedora Core 5, with a few running FC6.
Any chance of moving to one central Directory server on proper hardware, with proper resources?
btw, are the desktops replicas of a central master?
I hope you mean 5, as there are only 5 listed on the Oracle site.
As Quanah said, there are 6.
I still only see 5 at:
http://www.oracle.com/technology/products/berkeley-db/db/update/4.2.52/patch...
Gavin.
<snip>
They're typically desktop or lab machines for academics, students, etc. Hardware-wise they're Dell desktop boxes of a few years old - a 2.4GHz processor with 512MB of memory is typical. Something I should have mentioned is that they're running Fedora Core 5, with a few running FC6.
Any chance of moving to one central Directory server on proper hardware, with proper resources?
btw, are the desktops replicas of a central master?
Yes, the model is essentially one of a central write-master with all clients being replicas (using our own replication technology). We do want to think about moving away from this, however, possibly using local proxy-caching with a handful of servers replicating from the master with syncrepl. It's all still to be investigated properly, but there's certainly an incentive for that.
I hope you mean 5, as there are only 5 listed on the Oracle site.
As Quanah said, there are 6.
I still only see 5 at:
http://www.oracle.com/technology/products/berkeley-db/db/update/4.2.52/patch...
Sorry, I meant to include the link in the last message about this. We get these patches from Quanah's stanford page...
http://www.stanford.edu/services/directory/openldap/configuration/bdb-build-...
Cheers Toby
<quote who="Toby Blake">
<snip>
They're typically desktop or lab machines for academics, students, etc. Hardware-wise they're Dell desktop boxes of a few years old - a 2.4GHz processor with 512MB of memory is typical. Something I should have mentioned is that they're running Fedora Core 5, with a few running FC6.
Any chance of moving to one central Directory server on proper hardware, with proper resources?
btw, are the desktops replicas of a central master?
Yes, the model is essentially one of a central write-master with all clients being replicas (using our own replication technology). We do want to think about moving away from this, however, possibly using local proxy-caching with a handful of servers replicating from the master with syncrepl. It's all still to be investigated properly, but there's certainly an incentive for that.
Ok, well you should have really mentioned "using our own replication technology", and since we have no way of knowing what this is or why you are using it, we can't possibly help diagnose if this is having an effect on slapd, other than what we have already discussed.
Why not use an open/rfc'd proven replication technology?
I hope you mean 5, as there are only 5 listed on the Oracle site.
As Quanah said, there are 6.
I still only see 5 at:
http://www.oracle.com/technology/products/berkeley-db/db/update/4.2.52/patch...
Sorry, I meant to include the link in the last message about this. We get these patches from Quanah's stanford page...
http://www.stanford.edu/services/directory/openldap/configuration/bdb-build-...
I know that page. So Quanah, where are you getting these insider patches from! ;-)
Cheers Toby
Yes, the model is essentially one of a central write-master with all clients being replicas (using our own replication technology). We do want to think about moving away from this, however, possibly using local proxy-caching with a handful of servers replicating from the master with syncrepl. It's all still to be investigated properly, but there's certainly an incentive for that.
Ok, well you should have really mentioned "using our own replication technology", and since we have no way of knowing what this is or why you are using it, we can't possibly help diagnose if this is having an effect on slapd, other than what we have already discussed.
Why not use an open/rfc'd proven replication technology?
Again, decisions from the past. Yes, apologies, I should have mentioned our own replication, although I don't believe this to be much of a factor here - in essence the replication is a pull-based form of synchronisation which knows the last modification date of the local directory and then queries the master for entries modified after that date. It uses standard LDAP operations to make the changes to the local directory.
Cheers Toby
openldap-software@openldap.org