Full_Name: Maxime Besson Version: 2.4.48 OS: Amazon Linux URL: ftp://ftp.openldap.org/incoming/ Submission from: (NULL) (92.184.104.114)
During a cloud provider migration, I have migrated an OpenLDAP instance using MDB + Syncrepl (in mirror mode, with session log) from Ubuntu to Amazon Linux.
When starting a server with no data, syncrepl kicks in and starts replicating the other server's database (12G, 7M entries).
In Ubuntu this phase takes about 25 minutes (with dbnosync). In Amazon Linux, with the exact same cn=config and daemon options, it took 10 days on first try, at 100% CPU consumption the whole time.
I observed that the start of the import is pretty fast (data.mdb grows by 10M/s) but import speed slows down rapidly to a crawl.
After some investigation, I discovered that disabling the sessionlog brings the import times to normal.
Digging further, I found that most of the CPU time is spent in syncprov_add_slog, and particularly this line:
/* Keep the list in csn order. */ ... for ( sep = &sl->sl_head; *sep; sep = &(*sep)->se_next ) {
> if ( ber_bvcmp( &se->se_csn, &(*sep)->se_csn ) < 0 ) {
se->se_next = *sep; *sep = se; break; } }
The sessionlog appears to be growing endlessly:
(gdb) print sl->sl_num $1 = 681896 (gdb) print sl->sl_size $3 = 100
Digging around the source, it seems that the session log is supposed to be cleaned up at the end of syncprov_add_slog, if it's not playing:
if (!sl->sl_playing) { while ( sl->sl_num > sl->sl_size ) { ...
However:
(gdb) print sl->sl_playing $4 = 543388517
This looked like an uninitialized value to me, and indeed, sp_cf_gen doesn't seem to initialize this field:
if ( !sl ) { sl = ch_malloc( sizeof( sessionlog )); sl->sl_mincsn = NULL; sl->sl_sids = NULL; sl->sl_num = 0; sl->sl_numcsns = 0; sl->sl_head = sl->sl_tail = NULL; ldap_pvt_thread_mutex_init( &sl->sl_mutex ); si->si_logs = sl; }
I tried the following patch:
--- openldap-2.4.48.orig/servers/slapd/overlays/syncprov.c 2019-07-23 16:46:22.000000000 +0200 +++ openldap-2.4.48/servers/slapd/overlays/syncprov.c 2020-01-08 11:33:16.770110282 +0100 @@ -3082,6 +3082,7 @@ sl = ch_malloc( sizeof( sessionlog )); sl->sl_mincsn = NULL; sl->sl_sids = NULL; + sl->sl_playing = 0; sl->sl_num = 0; sl->sl_numcsns = 0; sl->sl_head = sl->sl_tail = NULL;
And it got rid of the issue. My guess is that most of the time, sl_playing happens to be 0, but on Amazon Linux, for some reason (patched glibc?), the sessionlog lands in the wrong chunk of uninitilized memory
Additional info:
# rpm -q glibc glibc-2.26-32.amzn2.0.2.x86_64
# uname -a Linux 4.14.152-127.182.amzn2.x86_64 #1 SMP Thu Nov 14 17:32:43 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
# /usr/local/openldap/libexec/slapd -V @(#) $OpenLDAP: slapd 2.4.48 (Jan 8 2020 11:39:38) $
(LTB build)