Hi,
While doing a few slapadd testruns comparing the RE23 with the RE24 version I ran into a strange issue. I ran test with different LDIFs (100k, 500k and 1000k Entries) and especially with the 500k and 1000k LDIF, slapadd from 2.3.43 was significantly faster than the 2.4.12 version.
2.3.43 loaded the 500k database in 13m54s while it took 33m49s with 2.4.12. The 1000k testcase to 29m41s on 2.3 (still faster than the 500k on 2.4). I didn't finish the 2.4 run with the 1000k database I stopped it after about an hour.
I used exactly the same configuration on exactly the same hardware/os for the tests (a HP Proliant DL 580 G3 with 4 3.33GHz Xeons, 8GB of RAM, SLES10-SP2, ext3 filesystem). BerkelyDB Version was 4.5.20 with the following DB_CONFIG:
set_cachesize 0 4294967295 1 set_lg_regionmax 262144 set_lg_bsize 2097152
cachesize in slapd.conf was large enough to hold the entire database and tool- threads was set to 4.
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function. Just to verify this I ran a testbuild with the trickle_task disabled(). And slapadd's performance was back to a normal level, comparable to the 2.3.43 release.
AFAIK the trickle_task() was introduced into 2.4 to increase slapadd throughput but has exactly the opposite effect on my test system. Did anybody else make similar experiences? Or do you see anything thats obviously wrong with my testcases?
Am Freitag 21 November 2008 14:21:06 schrieb Ralf Haferkamp:
Hi,
While doing a few slapadd testruns comparing the RE23 with the RE24 version I ran into a strange issue. I ran test with different LDIFs (100k, 500k and 1000k Entries) and especially with the 500k and 1000k LDIF, slapadd from 2.3.43 was significantly faster than the 2.4.12 version.
2.3.43 loaded the 500k database in 13m54s while it took 33m49s with 2.4.12. The 1000k testcase to 29m41s on 2.3 (still faster than the 500k on 2.4). I didn't finish the 2.4 run with the 1000k database I stopped it after about an hour.
I used exactly the same configuration on exactly the same hardware/os for the tests (a HP Proliant DL 580 G3 with 4 3.33GHz Xeons, 8GB of RAM, SLES10-SP2, ext3 filesystem). BerkelyDB Version was 4.5.20 with the following DB_CONFIG:
set_cachesize 0 4294967295 1 set_lg_regionmax 262144 set_lg_bsize 2097152
cachesize in slapd.conf was large enough to hold the entire database and tool- threads was set to 4.
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function. Just to verify this I ran a testbuild with the trickle_task disabled(). And slapadd's performance was back to a normal level, comparable to the 2.3.43 release.
AFAIK the trickle_task() was introduced into 2.4 to increase slapadd throughput but has exactly the opposite effect on my test system. Did anybody else make similar experiences? Or do you see anything thats obviously wrong with my testcases?
Btw, I just re-ran the test with an unmodified 2.4.12 compiled against db-4.7.25. With that combination slapadd is still a bit slower than the OpenLDAP 2.3/db-4.5.20 combination, but the difference not quite as huge: 16m8s compared to 13m54s for the 500k Entries.
Ralf Haferkamp wrote:
Am Freitag 21 November 2008 14:21:06 schrieb Ralf Haferkamp:
Hi,
While doing a few slapadd testruns comparing the RE23 with the RE24 version I ran into a strange issue. I ran test with different LDIFs (100k, 500k and 1000k Entries) and especially with the 500k and 1000k LDIF, slapadd from 2.3.43 was significantly faster than the 2.4.12 version.
2.3.43 loaded the 500k database in 13m54s while it took 33m49s with 2.4.12. The 1000k testcase to 29m41s on 2.3 (still faster than the 500k on 2.4). I didn't finish the 2.4 run with the 1000k database I stopped it after about an hour.
I used exactly the same configuration on exactly the same hardware/os for the tests (a HP Proliant DL 580 G3 with 4 3.33GHz Xeons, 8GB of RAM, SLES10-SP2, ext3 filesystem). BerkelyDB Version was 4.5.20 with the following DB_CONFIG:
set_cachesize 0 4294967295 1 set_lg_regionmax 262144 set_lg_bsize 2097152
cachesize in slapd.conf was large enough to hold the entire database and tool- threads was set to 4.
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function. Just to verify this I ran a testbuild with the trickle_task disabled(). And slapadd's performance was back to a normal level, comparable to the 2.3.43 release.
AFAIK the trickle_task() was introduced into 2.4 to increase slapadd throughput but has exactly the opposite effect on my test system. Did anybody else make similar experiences? Or do you see anything thats obviously wrong with my testcases?
Btw, I just re-ran the test with an unmodified 2.4.12 compiled against db-4.7.25. With that combination slapadd is still a bit slower than the OpenLDAP 2.3/db-4.5.20 combination, but the difference not quite as huge: 16m8s compared to 13m54s for the 500k Entries.
Hmm. As I recall, the trickle task improved performance when the DB was much larger than the cachesize. The problem being, when the cache gets filled, without the trickle task your adds get stuck waiting for cache pagein/out to occur (synchronously), while with the trickle task the add threads can run unimpeded. Perhaps we need a config option to control this.
Ralf Haferkamp wrote:
Am Freitag 21 November 2008 14:21:06 schrieb Ralf Haferkamp:
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function. Just to verify this I ran a testbuild with the trickle_task disabled(). And slapadd's performance was back to a normal level, comparable to the 2.3.43 release.
Something else came to mind. The trickle_task obviously cannot increase the overall volume of I/O, so there's no good reason for it to make things more than twice as slow. Except, that if you're getting reads mixed with writes you will lose a lot in disk seek time.
Since you said that your BDB cache size is large enough to contain the entire DB in each case, try this: preload the LDIF into the FS cache before running slapadd. (dd if=<ldif> of=/dev/null bs=1024k)
That will eliminate any seek overhead during the run.
AFAIK the trickle_task() was introduced into 2.4 to increase slapadd throughput but has exactly the opposite effect on my test system. Did anybody else make similar experiences? Or do you see anything thats obviously wrong with my testcases?
Btw, I just re-ran the test with an unmodified 2.4.12 compiled against db-4.7.25. With that combination slapadd is still a bit slower than the OpenLDAP 2.3/db-4.5.20 combination, but the difference not quite as huge: 16m8s compared to 13m54s for the 500k Entries.
On Fri, 21 Nov 2008, Howard Chu wrote: ...
Something else came to mind. The trickle_task obviously cannot increase the overall volume of I/O, so there's no good reason for it to make things more than twice as slow.
Really? Why can't a DB page be modified, written out by memp_trickle(), modified again, and then written out by txn_checkpoint()? Without the memp_trickle(), it would have only been written out once, no? Also, a page dirtied in a transaction may be written out before the transaction is resolved. If the transaction is aborted, then the page is undirtied and wouldn't even be written as part of the next checkpoint.
So I think the rule is actually the *opposite*: calling memp_trickle() cannot *reduce* the overall volume of I/O, as anything that's dirtied and committed between the last checkpoint and the next one must be written at least once by the time the next checkpoint occurs.
Since you said that your BDB cache size is large enough to contain the entire DB in each case, try this: preload the LDIF into the FS cache before running slapadd. (dd if=<ldif> of=/dev/null bs=1024k)
If the entire thing fits in cache, the only way memp_trickle() could help would be by spreading out the writes required by checkpoints. That is, it helps if it writes a page *and* that page is not dirtied before it is forcibly written out (by checkpoint) or dropped from the cache (by space pressure).
Philip Guenther
Philip Guenther wrote:
On Fri, 21 Nov 2008, Howard Chu wrote: ...
Something else came to mind. The trickle_task obviously cannot increase the overall volume of I/O, so there's no good reason for it to make things more than twice as slow.
Really? Why can't a DB page be modified, written out by memp_trickle(), modified again, and then written out by txn_checkpoint()?
Because we're talking about slapadd here, and it doesn't work that way.
Am Samstag 22 November 2008 02:59:28 schrieb Howard Chu:
Ralf Haferkamp wrote:
Am Freitag 21 November 2008 14:21:06 schrieb Ralf Haferkamp:
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function. Just to verify this I ran a testbuild with the trickle_task disabled(). And slapadd's performance was back to a normal level, comparable to the 2.3.43 release.
Something else came to mind. The trickle_task obviously cannot increase the overall volume of I/O, so there's no good reason for it to make things more than twice as slow. Except, that if you're getting reads mixed with writes you will lose a lot in disk seek time.
Since you said that your BDB cache size is large enough to contain the entire DB in each case, try this: preload the LDIF into the FS cache before running slapadd. (dd if=<ldif> of=/dev/null bs=1024k)
That will eliminate any seek overhead during the run.
That didn't have any effect. But as I ran the tests multiple time, I guess the LDIF has been in the FS cache already anyways.
But as OpenLDAP 2.4 linked against db-4.7.25 is almost as fast as OpenLDAP 2.3 (as I wrote in my second post on friday) I think that this is a problem in db-4.5.20 and wonder if we should just disable the trickle-task when linking against a 4.5.X libdb?
When linking against db-4.7.25 slapadd is still a little faster when the trickle-task is disabled: 15m19s (without the trickle-task) vs. 16m12s for the 500k Entries LDIF.
But if the task really helps to speedup slapadd in very large environments that difference might be acceptable.
Ralf Haferkamp writes:
When linking against db-4.7.25 slapadd is still a little faster when the trickle-task is disabled: 15m19s (without the trickle-task) vs. 16m12s for the 500k Entries LDIF.
Due to this mutex lock in bdb_tool_entry_put(), perhaps?
if (( slapMode & SLAP_TOOL_QUICK ) && (( e->e_id & 0xfff ) == 0xfff )) { ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); }
Weird code. That's not the normal way to use a cond. Why wait for a mutex before signalling, when the code doesn't change any variable which is tested around the cond_wait?
Hallvard B Furuseth wrote:
Ralf Haferkamp writes:
When linking against db-4.7.25 slapadd is still a little faster when the trickle-task is disabled: 15m19s (without the trickle-task) vs. 16m12s for the 500k Entries LDIF.
Due to this mutex lock in bdb_tool_entry_put(), perhaps?
if (( slapMode& SLAP_TOOL_QUICK )&& (( e->e_id& 0xfff ) == 0xfff )) { ldap_pvt_thread_mutex_lock(&bdb_tool_trickle_mutex ); ldap_pvt_thread_cond_signal(&bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock(&bdb_tool_trickle_mutex ); }
Weird code. That's not the normal way to use a cond. Why wait for a mutex before signalling, when the code doesn't change any variable which is tested around the cond_wait?
That was to prevent the signal from being lost. Though I suppose it doesn't matter, since the signal would get sent again soon enough.
Howard Chu writes:
That was to prevent the signal from being lost. Though I suppose it doesn't matter, since the signal would get sent again soon enough.
If it's lost, that'll usually be because memp_trickle() is busily flusing pages. It won't return until it has nothing to do, in which case immediately calling it again won't be very useful.
A few other thoughts:
Could memp_trickle() have a bug so it loses "page is frequently used" info or whatever Berkeley DB is using? Then it could cause such pages to need to be re-read when they'd otherwise not need to be.
The memp_trickle() call the dirty page percentage to 70%. Might be interesting to see how a run without the trickle task compares to one with trickle + 43% extra DB cache. (1/0.7 = 1.43).
In the first message, Ralf Haferkamp wrote:
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function.
The trickle task takes over write work that is otherwise done from other functions. If it spends as much or more_ time in pwrite() from trickle_task() than it does in pwrite() when there is no trickle task, that'd be interesting...
Hallvard B Furuseth wrote:
Howard Chu writes:
That was to prevent the signal from being lost. Though I suppose it doesn't matter, since the signal would get sent again soon enough.
If it's lost, that'll usually be because memp_trickle() is busily flusing pages. It won't return until it has nothing to do, in which case immediately calling it again won't be very useful.
Right. Perhaps we should remove that mutex lock/unlock.
A few other thoughts:
Could memp_trickle() have a bug so it loses "page is frequently used" info or whatever Berkeley DB is using? Then it could cause such pages to need to be re-read when they'd otherwise not need to be.
I guess that's possible; I haven't really looked at BDB's cache replacement policy. I've always assumed it was simple LRU tho frequency-based would be more useful.
The memp_trickle() call the dirty page percentage to 70%. Might be interesting to see how a run without the trickle task compares to one with trickle + 43% extra DB cache. (1/0.7 = 1.43).
Good idea.
In the first message, Ralf Haferkamp wrote:
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function.
The trickle task takes over write work that is otherwise done from other functions. If it spends as much or more_ time in pwrite() from trickle_task() than it does in pwrite() when there is no trickle task, that'd be interesting...
I wrote:
Howard Chu writes:
That was to prevent the signal from being lost. Though I suppose it doesn't matter, since the signal would get sent again soon enough.
If it's lost, that'll usually be because memp_trickle() is busily flusing pages. It won't return until it has nothing to do, in which case immediately calling it again won't be very useful.
...unless the doc is inaccurate. __memp_trickle() first computes the number of "buffers" to clear, and then calls __memp_sync_int() to clear them. It does not loop if more dirty pages have arrived during the __memp_sync_int() call. OTOH __memp_sync_int() may be magical enough to do so, I haven't tried to figure out how it works.
Did I send this? Trying again - I wrote:
...unless the doc is inaccurate.
which in any case would be no reason to make bdb_tool_entry_put() wait for env->memp_trickle(). If you keep the mutex lock, set a flag while locked and move memp_trickle() out of the lock.
diff -u -2 -r1.130 servers/slapd/back-bdb/tools.c --- servers/slapd/back-bdb/tools.c 13 Oct 2008 06:35:31 -0000 1.130 +++ servers/slapd/back-bdb/tools.c 25 Nov 2008 14:05:57 -0000 @@ -78,4 +78,5 @@ static ldap_pvt_thread_mutex_t bdb_tool_trickle_mutex; static ldap_pvt_thread_cond_t bdb_tool_trickle_cond; +static int bdb_tool_trickle_flag;
static void * bdb_tool_index_task( void *ctx, void *ptr ); @@ -139,4 +140,5 @@ slapd_shutdown = 1; ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); + bdb_tool_trickle_flag = 1; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); @@ -524,4 +526,5 @@ if (( slapMode & SLAP_TOOL_QUICK ) && (( e->e_id & 0xfff ) == 0xfff )) { ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); + bdb_tool_trickle_flag = 1; ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); @@ -1105,13 +1108,16 @@ int wrote;
- ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); while ( 1 ) { - ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond, - &bdb_tool_trickle_mutex ); + ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); + while ( !bdb_tool_trickle_flag ) + ldap_pvt_thread_cond_wait( &bdb_tool_trickle_cond, + &bdb_tool_trickle_mutex ); + bdb_tool_trickle_flag = 0; + ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); + if ( slapd_shutdown ) break; env->memp_trickle( env, 30, &wrote ); } - ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex );
return NULL;
Am Dienstag 25 November 2008 15:07:21 schrieb Hallvard B Furuseth:
Did I send this? Trying again - I wrote:
...unless the doc is inaccurate.
which in any case would be no reason to make bdb_tool_entry_put() wait for env->memp_trickle(). If you keep the mutex lock, set a flag while locked and move memp_trickle() out of the lock.
I gave your attached patch a short try and it is a bit slower (between 10 and 20 seconds in my tests), than the one with the lock/unlock calls removed.
The question is, if it is really needed to catch every signal. AFAIK the signal is sent every 4096th entry anyway, so if one is lost there will soon be another one that triggers memp_trickle(). Howard, what's your opinion on this?
Ralf Haferkamp wrote:
Am Dienstag 25 November 2008 15:07:21 schrieb Hallvard B Furuseth:
Did I send this? Trying again - I wrote:
...unless the doc is inaccurate.
which in any case would be no reason to make bdb_tool_entry_put() wait for env->memp_trickle(). If you keep the mutex lock, set a flag while locked and move memp_trickle() out of the lock.
I gave your attached patch a short try and it is a bit slower (between 10 and 20 seconds in my tests), than the one with the lock/unlock calls removed.
The question is, if it is really needed to catch every signal. AFAIK the signal is sent every 4096th entry anyway, so if one is lost there will soon be another one that triggers memp_trickle().
Agreed.
Howard, what's your opinion on this?
I think we can simply omit the lock/unlock here.
Am Montag 24 November 2008 23:47:29 schrieb Hallvard B Furuseth:
Howard Chu writes:
That was to prevent the signal from being lost. Though I suppose it doesn't matter, since the signal would get sent again soon enough.
If it's lost, that'll usually be because memp_trickle() is busily flusing pages. It won't return until it has nothing to do, in which case immediately calling it again won't be very useful.
A few other thoughts:
Could memp_trickle() have a bug so it loses "page is frequently used" info or whatever Berkeley DB is using? Then it could cause such pages to need to be re-read when they'd otherwise not need to be.
The memp_trickle() call the dirty page percentage to 70%. Might be interesting to see how a run without the trickle task compares to one with trickle + 43% extra DB cache. (1/0.7 = 1.43).
In the first message, Ralf Haferkamp wrote:
I did some profiling (with valgrinds callgrind tool) to find out where all the time is spend and it revealed that 2.4 spend a significantly larger amount of systime in the pwrite() function than 2.3. Most of that seemed to come from the bdb_tool_trickle_task() that calls libdb's memp_trickle() function.
The trickle task takes over write work that is otherwise done from other functions. If it spends as much or more_ time in pwrite() from trickle_task() than it does in pwrite() when there is no trickle task, that'd be interesting...
As I wrote I compared OpenLDAP 2.3 with 2.4 (both linked against libdb-4.5). And libdb-4.5 seems to be buggy with regard to the memp_trickle() function. I'd need to re-run the profiling with libdb-4.7. Though I am not sure if it is really worth it, given improvements when using slapadd with the lock/unlock calls removed in bdb_tool_entry_put.
Am Montag 24 November 2008 23:12:36 schrieb Hallvard B Furuseth:
Ralf Haferkamp writes:
When linking against db-4.7.25 slapadd is still a little faster when the trickle-task is disabled: 15m19s (without the trickle-task) vs. 16m12s for the 500k Entries LDIF.
Due to this mutex lock in bdb_tool_entry_put(), perhaps?
if (( slapMode & SLAP_TOOL_QUICK ) && (( e->e_id & 0xfff ) == 0xfff )) { ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); }
Weird code. That's not the normal way to use a cond. Why wait for a mutex before signalling, when the code doesn't change any variable which is tested around the cond_wait?
Removing the _lock call brings the slapadd time down to 15m29s. So I guess that would be worth to fix. I'll commit it if nobody objects.
Am Dienstag 25 November 2008 14:51:54 schrieb Ralf Haferkamp:
Am Montag 24 November 2008 23:12:36 schrieb Hallvard B Furuseth:
Ralf Haferkamp writes:
When linking against db-4.7.25 slapadd is still a little faster when the trickle-task is disabled: 15m19s (without the trickle-task) vs. 16m12s for the 500k Entries LDIF.
Due to this mutex lock in bdb_tool_entry_put(), perhaps?
if (( slapMode & SLAP_TOOL_QUICK ) && (( e->e_id & 0xfff ) == 0xfff )) { ldap_pvt_thread_mutex_lock( &bdb_tool_trickle_mutex ); ldap_pvt_thread_cond_signal( &bdb_tool_trickle_cond ); ldap_pvt_thread_mutex_unlock( &bdb_tool_trickle_mutex ); }
Weird code. That's not the normal way to use a cond. Why wait for a mutex before signalling, when the code doesn't change any variable which is tested around the cond_wait?
Removing the _lock call brings the slapadd time down to 15m29s. So I guess that would be worth to fix. I'll commit it if nobody objects.
I re-ran the test multiple times and the results show that slapadd with the _lock/_unlock disabled gives pretty much the same speed as slapadd with the trickle-task disabled.