Hello,
OpenLDAP 2.4.26 + BerkeleyDB 5.2.28, also seen on OpenLDAP 2.4.22 + BerkeleyDB 4.6.21
I am trying to investigate a response time issue. Basically I am doing a search with ldapsearch that results in the transfer of an 8.5 Mb binary entry. This takes around 800 ms measured on the server itself from the request has been received until the response has been delivered.. Looking at what is going on on the network I see a large delay between slapd receiving the search request and the actual start of the transfer. This delay accounts for around 600ms.
So what is going on during this time ?
I am on Solaris 10, so I use dtrace to look at slapd, and it turns out that it calls memmove(<ADDR>,<ADDR> + 1, ~8.5 Mb). That is it shifts the result 1 byte, and it does this 4 times. Each memmove takes around 150 ms, which accounts almost completely for the delay before doing the transfer over network. I have written a small C program which does the same memmove and verified that on my setup, it does indeed take 150ms to shift 8.5 Mb 1 byte.
The stacktrace up until each memmove is
0 42451 memmove:entry *** Moving from 213650401 to 213650402 *** libc.so.1`memmove slapd`0x82461c9 slapd`ber_printf+0x4ff slapd`slap_send_search_entry+0x1a94 slapd`bdb_search+0x1ce0 slapd`fe_op_search+0x574 slapd`do_search+0xc82 slapd`0x80927a9 slapd`0x8092d21 slapd`0x8210034
The first line shows the source and destination address for memmove. The numbers themselves are not interesting, but the difference (1) is.
Why is it shifting the buffer 1 byte ? If this could be optimized somehow, it could lead to a huge improvement in response time.
Thanks, Rene
Rene B. Elgaard writes:
Why is it [ber_printf] shifting the buffer 1 byte ?
A BER-encoded sequence/set consists of {tag, length, data}. The length of the length depends on the size of the data. ber_printf() does not initially know how big the data will be, so it initially reserves 5 bytes for the length.
When the sequence is complete, liblber fills in the actual length and moves the data up to close the gap. This happens recursively, since there are sequences/sets inside sequences/sets. Also liblber grows the buffer dynamically, so there can be some reallocs moving the data too.
If this could be optimized somehow, it could lead to a huge improvement in response time.
For now I suggest we always use 5 bytes when the length exceeds some threshold. This is valid in BER and in LDAP's somewhat restricted BER, but not valid DER. Possibly some LDAP implementations or even some part of OpenLDAP expect DER anyway, so we might have to clean up further or revert such a change.
Maybe the realloc strategy can also be tweaked.
Next would be two-pass generation of sequences/sets. First use 5-byte lengts and remember their positions, then shorten them. That'll eliminate all but one memmove. This needs some new data structures to track the length positions though.
Maybe it'd be easier to let the caller do more of the work. Call liblber twice, first just so it can find the lengths and then again with the same data so it can fill that out.
To get rid if the final memmove, we'd have to feed a data structure describing the entire BER element to liblber, so it could get it right at the first attempt. Not anytime soon, that'd require rewriting both liblber and anything using it.
I'm not currently volunteering to do any of this, sorry.
Le 3/12/12 6:04 PM, Hallvard Breien Furuseth a écrit :
Rene B. Elgaard writes:
Why is it [ber_printf] shifting the buffer 1 byte ?
A BER-encoded sequence/set consists of {tag, length, data}. The length of the length depends on the size of the data. ber_printf() does not initially know how big the data will be, so it initially reserves 5 bytes for the length.
When the sequence is complete, liblber fills in the actual length and moves the data up to close the gap. This happens recursively, since there are sequences/sets inside sequences/sets. Also liblber grows the buffer dynamically, so there can be some reallocs moving the data too.
If this could be optimized somehow, it could lead to a huge improvement in response time.
For now I suggest we always use 5 bytes when the length exceeds some threshold. This is valid in BER and in LDAP's somewhat restricted BER, but not valid DER. Possibly some LDAP implementations or even some part of OpenLDAP expect DER anyway, so we might have to clean up further or revert such a change.
Maybe the realloc strategy can also be tweaked.
Next would be two-pass generation of sequences/sets. First use 5-byte lengts and remember their positions, then shorten them. That'll eliminate all but one memmove. This needs some new data structures to track the length positions though.
Maybe it'd be easier to let the caller do more of the work. Call liblber twice, first just so it can find the lengths and then again with the same data so it can fill that out.
This is what we do at Apache Directory. Works well for big data, cost a bit more time for smaller ones. The idea is to allocate the exact amount of memory needed to store the full result.
OTOH, creating such a buffer this is an atrocious solution when you have to transfer Mb of data, as you will just suck a hell lot of memory on the server before the sockect can swallow all of the data.
At some point, you'd like to transfer the entry as a set of buffers, with huge data being read directly on the disk, instead of transiting by the memory.
Not an obvious optimization...
On Mon, 12 Mar 2012 18:30:11 +0100, Emmanuel Lécharny wrote:
Le 3/12/12 6:04 PM, Hallvard Breien Furuseth a écrit :
For now I suggest we always use 5 bytes when the length exceeds some threshold. This is valid in BER and in LDAP's somewhat restricted BER, but not valid DER. (...)
Whoops, that's a bit more work than it sounds like. OpenLDAP uses LBER_USE_DER flag to request minimal lengths etc. We shouldn't break DER for liblber users who actually want DER. So we need a new flag LBER_USE_LDAPBER, and then walk through ~90 uses of der_alloc() and LBER_USE_DER to see what we mean where.
Next would be two-pass generation of sequences/sets. First use 5-byte lengts and remember their positions, then shorten them. That'll eliminate all but one memmove. This needs some new data structures to track the length positions though.
Some ber_printf formats like 'V' take a sequence/set of strings, but just outputs them one at a time. liblber could easly do two passes over these. Maybe keep the lengths at the end of the ber buffer, then use them in 2nd pass.
Maybe it'd be easier to let the caller do more of the work. Call liblber twice, first just so it can find the lengths and then again with the same data so it can fill that out.
This is what we do at Apache Directory. Works well for big data, cost a bit more time for smaller ones. The idea is to allocate the exact amount of memory needed to store the full result.
Sounds nice. And we've been dreaming of rewriting the libraries for years. And all its callers, that's the trouble. But sooner or later we'll have to do it.
slapd does have entry_partsize() and entry_encode() for this for internally stored entries. We're just not using it for messages.
OTOH, creating such a buffer this is an atrocious solution when you have to transfer Mb of data, as you will just suck a hell lot of memory on the server before the sockect can swallow all of the data.
At some point, you'd like to transfer the entry as a set of buffers, with huge data being read directly on the disk, instead of transiting by the memory.
Not an obvious optimization...
It wouldn't have helped that much either before back-mdb, since most of slapd operates on full entries. With back-mdb we don't actually have to "read" from disk the parts of an entry we do not use, which finally makes such optimizations non-hairtearing.
openldap-technical@openldap.org