syncrepl consumer is slow

List overview All Threads
Download

newer

older

ITS#8102, syncrepl concurrency

ITS8100: What to do about a fresh...

Howard Chu

28 Jan 2015 28 Jan '15

7:12 p.m.

One thing I just noticed, while testing replication with 3 servers on my laptop - during a refresh, the provider gets blocked waiting to write to the consumers after writing about 4000 entries. I.e., the consumers aren't processing fast enough to keep up with the search running on the provider.

(That's actually not too surprising since reads are usually faster than writes anyway.)

The consumer code has lots of problems as it is, just adding this note to the pile.

I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt. As such, there's no point in safely/synchronously writing any of the received entries - they're useless until the final contextCSN update.

The implementation approach would be to define a new control e.g. "fast write" for the consumer to pass to the underlying backend on any write op. We would also have to e.g. add an MDB_TXN_NOSYNC flag to mdb_txn_begin() (BDB already has the equivalent flag).

This would only be used for writes that are part of a refresh phase. In persist mode the provider and consumers' write speeds should be more closely matched so it wouldn't be necessary or useful.

Comments?

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Show replies by date

Quanah Gibson-Mount

28 Jan 28 Jan

8:20 p.m.

--On January 29, 2015 at 3:12:17 AM +0000 Howard Chu hyc@symas.com wrote:

...

This would only be used for writes that are part of a refresh phase. In persist mode the provider and consumers' write speeds should be more closely matched so it wouldn't be necessary or useful.

I've had a few cases on extremely busy systems with multiple replicas/mmr nodes where they literally never catch up. Only way I've been able to resolve those cases is to stop them, slapcat the master, slapadd, and restart. Hopefully this change would alleviate that scenario.

--Quanah

-- Quanah Gibson-Mount Platform Architect Zimbra, Inc -------------------- Zimbra :: the leader in open source messaging and collaboration

Howard Chu

8:34 p.m.

Quanah Gibson-Mount wrote:

...

--On January 29, 2015 at 3:12:17 AM +0000 Howard Chu hyc@symas.com wrote:

...
This would only be used for writes that are part of a refresh phase. In persist mode the provider and consumers' write speeds should be more closely matched so it wouldn't be necessary or useful.

I've had a few cases on extremely busy systems with multiple replicas/mmr nodes where they literally never catch up. Only way I've been able to resolve those cases is to stop them, slapcat the master, slapadd, and restart. Hopefully this change would alleviate that scenario.

Yes, I'm seeing the same thing. And yes, that's my hope as well. Not sure if it's enough; like I said there are other performance issues in the consumer code.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Emmanuel Lécharny

11:10 p.m.

Le 29/01/15 04:12, Howard Chu a écrit :

...

One thing I just noticed, while testing replication with 3 servers on my laptop - during a refresh, the provider gets blocked waiting to write to the consumers after writing about 4000 entries. I.e., the consumers aren't processing fast enough to keep up with the search running on the provider.

(That's actually not too surprising since reads are usually faster than writes anyway.)

The consumer code has lots of problems as it is, just adding this note to the pile.

I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort

the consumer will always have to start over from the beginning on

its next refresh attempt. As such, there's no point in safely/synchronously writing any of the received entries - they're useless until the final contextCSN update.

The implementation approach would be to define a new control e.g. "fast write" for the consumer to pass to the underlying backend on any write op. We would also have to e.g. add an MDB_TXN_NOSYNC flag to mdb_txn_begin() (BDB already has the equivalent flag).

This would only be used for writes that are part of a refresh phase. In persist mode the provider and consumers' write speeds should be more closely matched so it wouldn't be necessary or useful.

Comments?

The proposal sounds sane.

Speaking of which we had a discussion about some other features that could be fine to have : when a consumer reconnect to a provider, the consumer has no idea about how many entries it will receives. It would be valuable to pass an extra information in the exchanged cookie, which would be the number of updated entries. That could provide a hint for users or admin who would like to know about how long the update would take on a consumer (assuming we log such an information). Also batching the updates in the backend, ie grouping the updates before syncing them, could be interesting to have, still associated with some logs, again allowing the admin/user to know about the update progression.

Something like:

syncrepl : 1240 entries to update syncrpel : 200/1240 entries updated syncrpel : 400/1240 entries updated ... syncrepl : server up to date.

Hallvard Breien Furuseth

30 Jan 30 Jan

1:21 a.m.

On 29. jan. 2015 04:12, Howard Chu wrote:

...

I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt.

dbnosync loses consistency after a system crash, and it loses the knowledge that the DB may be inconsistent. At least with back-mdb. The safe thing to do after such a crash is to throw away the DB and fetch the entire thing from the provider. Which I gather would need to happen automatically with such an option.

-- Hallvard

Michael Ströder

1:30 a.m.

Hallvard Breien Furuseth wrote:

...

On 29. jan. 2015 04:12, Howard Chu wrote:

...
I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt.

dbnosync loses consistency after a system crash, and it loses the knowledge that the DB may be inconsistent. At least with back-mdb. The safe thing to do after such a crash is to throw away the DB and fetch the entire thing from the provider. Which I gather would need to happen automatically with such an option.

From my purely operatinal standpoint:

The consumer does not have valid contextCSN before being fully synced. This must be ensured. Everyting else can be handled separately. In a serious deployment the monitoring will have the red light on for this replica, decent health-check in load-balancers will disable using this replica.

=> don't over-engineer too many things to happen automagically, especially if you're not 100% sure that this auto-magic is rock-solid on every supported OS platform and in every exotic operational situation.

Ciao, Michael.

Howard Chu

2 Feb 2 Feb

8:11 p.m.

Hallvard Breien Furuseth wrote:

...

On 29. jan. 2015 04:12, Howard Chu wrote:

...
I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt.

dbnosync loses consistency after a system crash, and it loses the knowledge that the DB may be inconsistent. At least with back-mdb. The safe thing to do after such a crash is to throw away the DB and fetch the entire thing from the provider. Which I gather would need to happen automatically with such an option.

Another option here is simply to perform batching. Now that we have the TXN api exposed in the backend interface, we could just batch up e.g. 500 entries per txn. much like slapadd -q already does. Ultimately we ought to be able to get syncrepl refresh to occur at nearly the same speed as slapadd -q.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Emmanuel Lécharny

9:13 p.m.

Le 03/02/15 05:11, Howard Chu a écrit :

...

Hallvard Breien Furuseth wrote:

...
On 29. jan. 2015 04:12, Howard Chu wrote:

...
I'm considering adding an option to the consumer to write its entries with dbnosync during the refresh phase. The rationale being, there's nothing to lose anyway if the refresh is interrupted. I.e., the consumer can't update its contextCSN until the very end of the refresh, so any partial refresh that gets interrupted is wasted effort - the consumer will always have to start over from the beginning on its next refresh attempt.

dbnosync loses consistency after a system crash, and it loses the knowledge that the DB may be inconsistent. At least with back-mdb. The safe thing to do after such a crash is to throw away the DB and fetch the entire thing from the provider. Which I gather would need to happen automatically with such an option.

Another option here is simply to perform batching. Now that we have the TXN api exposed in the backend interface, we could just batch up e.g. 500 entries per txn. much like slapadd -q already does. Ultimately we ought to be able to get syncrepl refresh to occur at nearly the same speed as slapadd -q.

Batching is ok, except that you never know how many entries you'll going to have, thus you will have to actually write the data after a period of time, even if you don't have the 500 entries.

This is where it would be cool to extend the cookie to receive the expected number of updates you are going to receive (which will be obviously be 1 in a normal running R&P replication, but > 1 most of the time when reconnecting). In this case, youc an anticipate the batching operation without having to tke care of the time issue.

My 2 cts.

Howard Chu

3 Feb 3 Feb

12:41 a.m.

Emmanuel Lécharny wrote:

...

Le 03/02/15 05:11, Howard Chu a écrit :

...
Another option here is simply to perform batching. Now that we have the TXN api exposed in the backend interface, we could just batch up e.g. 500 entries per txn. much like slapadd -q already does. Ultimately we ought to be able to get syncrepl refresh to occur at nearly the same speed as slapadd -q.

Batching is ok, except that you never know how many entries you'll going to have, thus you will have to actually write the data after a period of time, even if you don't have the 500 entries.

This isn't a problem - we know exactly when refresh completes, so we can finish the batch regardless of how many entries are left over.

Testing this out with the experimental ITS#8040 patch - with lazy commit the 2.8M entries (2.5GB data) takes ~10 minutes for the refresh to pull them across. With batching 500 entries/txn+lazy commit it takes ~7 minutes, a decent improvement. It's still 2x slower than slapadd -q though, which loads the data in 3-1/2 minutes.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Howard Chu

12:54 a.m.

Howard Chu wrote:

...

Emmanuel Lécharny wrote:

...
Le 03/02/15 05:11, Howard Chu a écrit :

...
Another option here is simply to perform batching. Now that we have the TXN api exposed in the backend interface, we could just batch up e.g. 500 entries per txn. much like slapadd -q already does. Ultimately we ought to be able to get syncrepl refresh to occur at nearly the same speed as slapadd -q.

Batching is ok, except that you never know how many entries you'll going to have, thus you will have to actually write the data after a period of time, even if you don't have the 500 entries.

This isn't a problem - we know exactly when refresh completes, so we can finish the batch regardless of how many entries are left over.

Testing this out with the experimental ITS#8040 patch - with lazy commit the 2.8M entries (2.5GB data) takes ~10 minutes for the refresh to pull them across. With batching 500 entries/txn+lazy commit it takes ~7 minutes, a decent improvement. It's still 2x slower than slapadd -q though, which loads the data in 3-1/2 minutes.

In case anyone else wants to try this out, patch attached.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Emmanuel Lécharny

1:42 a.m.

Le 03/02/15 09:41, Howard Chu a écrit :

...

Emmanuel Lécharny wrote:

...
Le 03/02/15 05:11, Howard Chu a écrit :

...
Another option here is simply to perform batching. Now that we have the TXN api exposed in the backend interface, we could just batch up e.g. 500 entries per txn. much like slapadd -q already does. Ultimately we ought to be able to get syncrepl refresh to occur at nearly the same speed as slapadd -q.

Batching is ok, except that you never know how many entries you'll going to have, thus you will have to actually write the data after a period of time, even if you don't have the 500 entries.

This isn't a problem - we know exactly when refresh completes, so we can finish the batch regardless of how many entries are left over.

True for Refresh. I was thinking more specifically of updates when we are connected.

The idea of pushing the expected number of updates within the cookie is for information purposes : having this number traced in the logs/monitored could help in some cases where the refresh phase takes long : the users will not stop the server thinking it has stalled.

...

Testing this out with the experimental ITS#8040 patch - with lazy commit the 2.8M entries (2.5GB data) takes ~10 minutes for the refresh to pull them across. With batching 500 entries/txn+lazy commit it takes ~7 minutes, a decent improvement. It's still 2x slower than slapadd -q though, which loads the data in 3-1/2 minutes.

Not bad at all. What makes it 2x slower, btw?

Emmanuel Lécharny

11 May 11 May

10:15 a.m.

Restarting this thread...

we have had some interesting discussion today that I wanted to share.

Hypothesis : 1 server has been down for a long time, and the contextCSN is older than the one of the other servers, forcing a refresh mode with more than the content of the AccessLog.

Quanah said that in some heavily servers, the only way for the consumer to catch up is to slapcat/slapadd/restart the consumer. I wonder if it would not be a way to deal with server that are to far behind the running server, but as a mechanism that is included in the refresh phase (ie, the restarted server will detect that it has to grab the set of entries and load them, os if a human being was doing a slapcat/slapadd/restart).

More specifically, is there a way to know how many entries we will have to update, and is there a way to know when it will be faster to be brutal (the Quanah way) compared to let the refresh mechanism doing its job.

Another point : as soon as the server is restarted, it can receive incoming requests, which will send back outdated response, until the refresh is completed (and i'm not talking about updates that could also be applied on an outdated base, with the consequences if there are some missing parents). In many cases, that would be a real problem, typically if the LDAP servers are considered as part of a shared pool of server, with a load balance mecahnism to spread the load. Wouldn't be more realistic to simply consider the server as not available until the refresh phase is completed ?

Thanks !

Quanah Gibson-Mount

10:47 a.m.

--On Monday, May 11, 2015 8:15 PM +0200 Emmanuel Lécharny elecharny@gmail.com wrote:

...

Quanah said that in some heavily servers, the only way for the consumer to catch up is to slapcat/slapadd/restart the consumer. I wonder if it would not be a way to deal with server that are to far behind the running server, but as a mechanism that is included in the refresh phase (ie, the restarted server will detect that it has to grab the set of entries and load them, os if a human being was doing a slapcat/slapadd/restart).

A specific example we had in the past was quarterly updates for students @ Stanford, which could push out 10's of thousands of updates to the single-node master. Generally of the 6 slaves, 2-3 would remain current, and the other 3 would fall hours or days behind. Since serving out siginficantly out of date data was not an option, we'd generally have to resort to reloading the ones that got stuck behind to get the sync'd up in a timely fashion.

...

Another point : as soon as the server is restarted, it can receive incoming requests, which will send back outdated response, until the refresh is completed (and i'm not talking about updates that could also be applied on an outdated base, with the consequences if there are some missing parents). In many cases, that would be a real problem, typically if the LDAP servers are considered as part of a shared pool of server, with a load balance mecahnism to spread the load. Wouldn't be more realistic to simply consider the server as not available until the refresh phase is completed ?

There's already an option for this, new for OpenLDAP 2.5 IIRC, that makes it return LDAP_BUSY or some such until it is "caught up". However, if you enable that option, it always returns this response, which is problematic, because a server may routinely flip between "caught up" and not "caught up". I.e., it is not unusual for a system to be a second or so behind other masters. Here's real world data from a client I just ran:

[zimbra@zm-mmr01 ~]$ ./libexec/zmreplchk Master: ldap://zm-mmr01.client.net:389 ServerID: 1 Code: 6 Status: 0y 0M 0w 0d 0h 0m 1s behind CSNs: 20150504222317.897445Z#000000#001#000000 20150511174531.424005Z#000000#002#000000 20150501181032.360324Z#000000#00a#000000 20150511174535.964334Z#000000#00b#000000 Master: ldap://zm-mmr00.client.net:389 ServerID: 2 Code: 0 Status: In Sync CSNs: 20150504222317.897445Z#000000#001#000000 20150511174531.424005Z#000000#002#000000 20150501181032.360324Z#000000#00a#000000 20150511174535.964334Z#000000#00b#000000 Master: ldap://nvl-mmr10.client.net:389 ServerID: 10 Code: 6 Status: 0y 0M 0w 0d 0h 0m 1s behind CSNs: 20150504222317.897445Z#000000#001#000000 20150511174531.424005Z#000000#002#000000 20150501181032.360324Z#000000#00a#000000 20150511174536.315403Z#000000#00b#000000 Master: ldap://nvl-mmr11.client.net:389 ServerID: 11 Code: 6 Status: 0y 0M 0w 0d 0h 0m 1s behind CSNs: 20150504222317.897445Z#000000#001#000000 20150511174531.424005Z#000000#002#000000 20150501181032.360324Z#000000#00a#000000 20150511174536.315403Z#000000#00b#000000

--Quanah

Quanah Gibson-Mount Platform Architect Zimbra, Inc. -------------------- Zimbra :: the leader in open source messaging and collaboration

Howard Chu

1:17 p.m.

Emmanuel Lécharny wrote:

...

Restarting this thread...

we have had some interesting discussion today that I wanted to share.

Hypothesis : 1 server has been down for a long time, and the contextCSN is older than the one of the other servers, forcing a refresh mode with more than the content of the AccessLog.

Quanah said that in some heavily servers, the only way for the consumer to catch up is to slapcat/slapadd/restart the consumer. I wonder if it would not be a way to deal with server that are to far behind the running server, but as a mechanism that is included in the refresh phase (ie, the restarted server will detect that it has to grab the set of entries and load them, os if a human being was doing a slapcat/slapadd/restart).

More specifically, is there a way to know how many entries we will have to update, and is there a way to know when it will be faster to be brutal (the Quanah way) compared to let the refresh mechanism doing its job.

Not a worthwhile direction to pursue. Doing the equivalent of a full slapcat/slapadd across the network will use even more bandwidth than the current syncrepl. None of this addresses the underlying causes of why the consumer is slow, so the original problem will remain.

There are two main problems: 1) the AVL tree used for presentlist is still extremely inefficient in both CPU and memory use. 2) the consumer does twice as much work for a single modification as the provider. I.e., the consumer does a write op to the backend for the modification, and then a second write op to update its contextCSN. The provider only does the original modification, and caches the contextCSN update.

If we fix both of these issues, consumer speed should be much faster. Nothing else is worth investigating until these two areas are reworked.

For (1) I've been considering a stripped down memory-only version of LMDB. There are plenty of existing memory-only Btree implementations out there already though, if anyone has a favorite it would probably save us some time to use an existing library. The Linux kernel has one (lib/btree.c) but it's under GPL so we can't use it directly.

...

Another point : as soon as the server is restarted, it can receive incoming requests, which will send back outdated response, until the refresh is completed (and i'm not talking about updates that could also be applied on an outdated base, with the consequences if there are some missing parents). In many cases, that would be a real problem, typically if the LDAP servers are considered as part of a shared pool of server, with a load balance mecahnism to spread the load. Wouldn't be more realistic to simply consider the server as not available until the refresh phase is completed ?

This was ITS#7616. We tried it and it caused a lot of problems. It has been reverted.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Emmanuel Lécharny

3:34 p.m.

Le 11/05/15 22:17, Howard Chu a écrit :

...

Emmanuel Lécharny wrote:

...
Restarting this thread...

we have had some interesting discussion today that I wanted to share.

Hypothesis : 1 server has been down for a long time, and the contextCSN is older than the one of the other servers, forcing a refresh mode with more than the content of the AccessLog.

Quanah said that in some heavily servers, the only way for the consumer to catch up is to slapcat/slapadd/restart the consumer. I wonder if it would not be a way to deal with server that are to far behind the running server, but as a mechanism that is included in the refresh phase (ie, the restarted server will detect that it has to grab the set of entries and load them, os if a human being was doing a slapcat/slapadd/restart).

More specifically, is there a way to know how many entries we will have to update, and is there a way to know when it will be faster to be brutal (the Quanah way) compared to let the refresh mechanism doing its job.

Not a worthwhile direction to pursue. Doing the equivalent of a full slapcat/slapadd across the network will use even more bandwidth than the current syncrepl. None of this addresses the underlying causes of why the consumer is slow, so the original problem will remain.

IMHO, network congestion is not a real pb. Assuming you are running a 1Gb ethernet network, the time it takes to transmit 1 milion 1Kb entries is only 10 seconds. It will be barely noticable compared to the time it will take to load those 1 M entries into your consumer. Even with a 100Gb ethernet newtork, this is not a big part of the problem.

...

There are two main problems:

the AVL tree used for presentlist is still extremely inefficient

in both CPU and memory use. 2) the consumer does twice as much work for a single modification as the provider. I.e., the consumer does a write op to the backend for the modification, and then a second write op to update its contextCSN.

Updating the contextCSN is an extra operation on the consumer, but as you have to update potentially tens of indexes when updating an entry (on both teh consumer and the producer), it's not really twoce more work. It's an additianal operation, but that would not double the time it costs on the producer.

The question would be : how do we update the contextCSN only periodically, to mitigate this extra cost, and it seems you proposed to batch the updates for this reason. By using btaches of 500 updates, this extra cost will be almost unnoticable, and one would expect the work on the consumer to be the same as on the producer side, right ?

...

The provider only does the original modification, and caches the contextCSN update.

If we fix both of these issues, consumer speed should be much faster. Nothing else is worth investigating until these two areas are reworked.

Agreed in most of the case. Although for use cases of an important number of updates have occured while a consumer is off line, another strategy might work. That this other strategy is to stop the consumer, slapcat the producer, slapadd the result and restart the server, all with the command line, instead of having it implemented in the server code, was what I was suggesting, but this is another story for a corner case that is not frequent. Plus we don't know at which point this would be the correct strategy (ie, for how many updates should we consider it as a better startegy than the current implementation ?).

...

For (1) I've been considering a stripped down memory-only version of LMDB. There are plenty of existing memory-only Btree implementations out there already though, if anyone has a favorite it would probably save us some time to use an existing library. The Linux kernel has one (lib/btree.c) but it's under GPL so we can't use it directly.

Q : do you need to keep the presentList ina BTree at all ?

...

...
Another point : as soon as the server is restarted, it can receive incoming requests, which will send back outdated response, until the refresh is completed (and i'm not talking about updates that could also be applied on an outdated base, with the consequences if there are some missing parents). In many cases, that would be a real problem, typically if the LDAP servers are considered as part of a shared pool of server, with a load balance mecahnism to spread the load. Wouldn't be more realistic to simply consider the server as not available until the refresh phase is completed ?

This was ITS#7616. We tried it and it caused a lot of problems. It has been reverted.

The two options were to either send a referral (not ideal, as we have no control whatesoever on the client API) and LDAP_BUSY. A third option would be possible : chaining the request to the server from which the replication updates are coming from. Doing so will guarantee that the client will gets a updated version of the data, as the producer is up to date. There is still an issue though if both servers are replicating each other (pretty much the pb with referrals). OTOH, if the other server is also in refresh mode, it should be possible to return a LDAP_BUSY if it is capable of detecting that the requests come from another server, not for a client. Maybe it's far fetched...

Howard Chu

12 May 12 May

5:34 a.m.

Emmanuel Lécharny wrote:

...

Le 11/05/15 22:17, Howard Chu a écrit :

...
There are two main problems:

the AVL tree used for presentlist is still extremely inefficient

in both CPU and memory use. 2) the consumer does twice as much work for a single modification as the provider. I.e., the consumer does a write op to the backend for the modification, and then a second write op to update its contextCSN.

...

Updating the contextCSN is an extra operation on the consumer, but as you have to update potentially tens of indexes when updating an entry (on both teh consumer and the producer), it's not really twoce more work. It's an additianal operation, but that would not double the time it costs on the producer.

You're forgetting one very important thing - each operation is a single transaction in the backend, and transactions are synchronous by default. The main cost is not the indexing, it's the txn fsync, and yes, it is twice the cost when you're doing two txns instead of just one.

...

The question would be : how do we update the contextCSN only periodically, to mitigate this extra cost, and it seems you proposed to batch the updates for this reason. By using btaches of 500 updates, this extra cost will be almost unnoticable, and one would expect the work on the consumer to be the same as on the producer side, right ?

Yes.

...

...
The provider only does the original modification, and caches the contextCSN update.

If we fix both of these issues, consumer speed should be much faster. Nothing else is worth investigating until these two areas are reworked.

...

Agreed in most of the case. Although for use cases of an important number of updates have occured while a consumer is off line, another strategy might work. That this other strategy is to stop the consumer, slapcat the producer, slapadd the result and restart the server, all with the command line, instead of having it implemented in the server code, was what I was suggesting, but this is another story for a corner case that is not frequent. Plus we don't know at which point this would be the correct strategy (ie, for how many updates should we consider it as a better startegy than the current implementation ?).

If the consumer has been offline for a long time, then this discussion is moot. No clients will be looking at it, so the risk of serving out-of-date information to clients is zero. In that case, it doesn't matter what strategy you use, they'll all work.

...

...
For (1) I've been considering a stripped down memory-only version of LMDB. There are plenty of existing memory-only Btree implementations out there already though, if anyone has a favorite it would probably save us some time to use an existing library. The Linux kernel has one (lib/btree.c) but it's under GPL so we can't use it directly.

Q : do you need to keep the presentList ina BTree at all ?

Good question. We process it by doing a single search over the target range, and removing presentlist entries for each entry returned by the search. Since the search order is random, we want fast search access to the presentlist.

We could alternatively do a dynamic array and walk the presentlist in order, doing (entryUUID=x) searches on each element. The overhead of doing X individual searches is worse than doing one global search though.

...

...
...
Another point : as soon as the server is restarted, it can receive incoming requests, which will send back outdated response, until the refresh is completed (and i'm not talking about updates that could also be applied on an outdated base, with the consequences if there are some missing parents). In many cases, that would be a real problem, typically if the LDAP servers are considered as part of a shared pool of server, with a load balance mecahnism to spread the load. Wouldn't be more realistic to simply consider the server as not available until the refresh phase is completed ?

This was ITS#7616. We tried it and it caused a lot of problems. It has been reverted.

The two options were to either send a referral (not ideal, as we have no control whatesoever on the client API) and LDAP_BUSY. A third option would be possible : chaining the request to the server from which the replication updates are coming from. Doing so will guarantee that the client will gets a updated version of the data, as the producer is up to date. There is still an issue though if both servers are replicating each other (pretty much the pb with referrals). OTOH, if the other server is also in refresh mode, it should be possible to return a LDAP_BUSY if it is capable of detecting that the requests come from another server, not for a client. Maybe it's far fetched...

In practice, two MMR servers pointed at each other would never make progress.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Emmanuel Lécharny

6:48 a.m.

Le 12/05/15 14:34, Howard Chu a écrit :

...

Emmanuel Lécharny wrote:

...
Le 11/05/15 22:17, Howard Chu a écrit :

...
There are two main problems:

the AVL tree used for presentlist is still extremely inefficient

in both CPU and memory use. 2) the consumer does twice as much work for a single modification as the provider. I.e., the consumer does a write op to the backend for the modification, and then a second write op to update its contextCSN.

...
Updating the contextCSN is an extra operation on the consumer, but as you have to update potentially tens of indexes when updating an entry (on both teh consumer and the producer), it's not really twoce more work. It's an additianal operation, but that would not double the time it costs on the producer.

You're forgetting one very important thing - each operation is a single transaction in the backend, and transactions are synchronous by default. The main cost is not the indexing, it's the txn fsync, and yes, it is twice the cost when you're doing two txns instead of just one.

Good point.

...

...
The question would be : how do we update the contextCSN only periodically, to mitigate this extra cost, and it seems you proposed to batch the updates for this reason. By using btaches of 500 updates, this extra cost will be almost unnoticable, and one would expect the work on the consumer to be the same as on the producer side, right ?

Yes.

...
...
The provider only does the original modification, and caches the contextCSN update.

If we fix both of these issues, consumer speed should be much faster. Nothing else is worth investigating until these two areas are reworked.

...
Agreed in most of the case. Although for use cases of an important number of updates have occured while a consumer is off line, another strategy might work. That this other strategy is to stop the consumer, slapcat the producer, slapadd the result and restart the server, all with the command line, instead of having it implemented in the server code, was what I was suggesting, but this is another story for a corner case that is not frequent. Plus we don't know at which point this would be the correct strategy (ie, for how many updates should we consider it as a better startegy than the current implementation ?).

If the consumer has been offline for a long time, then this discussion is moot. No clients will be looking at it, so the risk of serving out-of-date information to clients is zero. In that case, it doesn't matter what strategy you use, they'll all work.

Another good point. It would require a use that send a hell lots of updates while no client is having activity, and a disconnected consumer - all three conditions at the same time - to face my scenario. Quite rare. I had in mind this user who was updating his database with millions of updates once in a while (say, once a year), during night, and who find than teh consumer is not up and running in the morning.

Not sure then it worth the effort to find a way to mitigate such corner case.

...

...
...
For (1) I've been considering a stripped down memory-only version of LMDB. There are plenty of existing memory-only Btree implementations out there already though, if anyone has a favorite it would probably save us some time to use an existing library. The Linux kernel has one (lib/btree.c) but it's under GPL so we can't use it directly.

Q : do you need to keep the presentList ina BTree at all ?

Good question. We process it by doing a single search over the target range, and removing presentlist entries for each entry returned by the search. Since the search order is random, we want fast search access to the presentlist.

We could alternatively do a dynamic array and walk the presentlist in order, doing (entryUUID=x) searches on each element. The overhead of doing X individual searches is worse than doing one global search though.

If the goal is to find all the entries that are not present in the DB, wouldn't it be faster to simply quick sort the entryUUID we have received? Both algorithms (AVL insertions and quickSort) are in O(n x Log(n)) - if you except the possibility that the quicksort degenerates in O(n2), of course - but Quicksort is faster than AVL when it comes to order a set of values.

3708

Age (days ago)

3811

Last active (days ago)

openldap-devel@openldap.org

16 comments

5 participants

tags (0)

participants (5)

Emmanuel Lécharny
Hallvard Breien Furuseth
Howard Chu
Michael Ströder
Quanah Gibson-Mount