Howard,
(Cc:-ed openldap-devel@OpenLDAP.org in opposite to our off-line conversation).
Howard Chu wrote:
I'm wondering if it's worth the effort to rewrite the client's LDIF parser as I did for slapadd -q.
As I said I cannot test on the machine where I did the original tests. But I tried to test OpenLDAP's LDIF parser with -n.
$ time ldapadd -f test.ldif -n [..] real 1m30.402s user 1m29.090s sys 0m0.376s
Now I have a small Python script which uses the module 'ldif' from python-ldap for reading in the LDIF file. I've implemented module 'ldif' in pure Python but off course the string module in the underlying Python standard lib is implemented in C. And the Python runtime environment does all the ugly memory management. :-)
$ time python count_members.py < test.ldif [..] real 0m19.145s user 0m18.349s sys 0m0.568s
I re-ran the tests twice, so test.ldif should have been in the filesystem cache.
How does that sound to you? I'm not sure what different actions ldapadd -n does in comparison to my simple script. But at least count_members.py also reads the complete entries into a tuple containing the DN as string and the entry as so-called dictionary.
Ciao, Michael.
Michael Ströder wrote:
How does that sound to you? I'm not sure what different actions ldapadd -n does in comparison to my simple script. But at least count_members.py also reads the complete entries into a tuple containing the DN as string and the entry as so-called dictionary.
Sounds about as expected, the client side really hasn't gotten much attention as far as performance goes.
I just noticed something else, which I'm hesitant to call a bug, since RFC2849's grammar doesn't actually forbid it.
ldapmodify will accept multiple modifications without a separator in between them. E.g.:
dn: dc=example,dc=com changetype: modify add: cn cn: foo sn: bar description: xyzzy
All of the examples show that the attrval-spec's in the mod-spec show matching AttributeDescriptions, but nothing in the grammar says they must match. The above LDIF will create a valid request to add the 3 different attribute values to the target entry.
The examples imply a particular intent, but there's nothing in the grammar that dictates this intent. Obviously the AttributeDescription in the mod-spec is redundant in all cases except when deleting all of the values of an attribute. (Did I mention that I've always thought the mod-spec definition was garbage? The format I use for the logschema has none of these problems or inefficiences...)
At 02:05 PM 11/23/2006, Howard Chu wrote:
ldapmodify will accept multiple modifications without a separator in between them. E.g.:
dn: dc=example,dc=com changetype: modify add: cn cn: foo sn: bar description: xyzzy
While the above may adhere to the ABNF, it's nonsense. The changetype is modify, so we're representing a modifyRequest. The modify request contains a single add operation of cn. Hence, any provided values must be of cn. The second and third values are not of cn. That's an error.
While some parsers might be liberal in handling this error, implicitly inserting additional add operations, that behavior is something authors of LDIF files should not rely on. There certainly are parsers that, quite properly, error on such LDIF input.
RFC 2849 could do with some clarification. The intent was to create a one-to-one mapping between LDIF change records and LDAP update requests.
Kurt
Kurt D. Zeilenga wrote:
At 02:05 PM 11/23/2006, Howard Chu wrote:
ldapmodify will accept multiple modifications without a separator in between them. E.g.:
dn: dc=example,dc=com changetype: modify add: cn cn: foo sn: bar description: xyzzy
While the above may adhere to the ABNF, it's nonsense. The changetype is modify, so we're representing a modifyRequest. The modify request contains a single add operation of cn. Hence, any provided values must be of cn. The second and third values are not of cn. That's an error.
While some parsers might be liberal in handling this error, implicitly inserting additional add operations, that behavior is something authors of LDIF files should not rely on. There certainly are parsers that, quite properly, error on such LDIF input.
RFC 2849 could do with some clarification. The intent was to create a one-to-one mapping between LDIF change records and LDAP update requests.
That's what I figured, but that's clearly not what it does. The code also collapses some things together, which definitely doesn't create a one-to-one mapping. E.g.,
dn: dc=example,dc=com changetype: modify add: cn cn: foo - add: cn cn: bar -
This should represent two separate update requests, but it gets collapsed into a single one:
add: cn cn: foo cn: bar
That's not a big deal for Adds, but there are other cases that are more troublesome. E.g.
dn: dc=example,dc=com changetype: modify delete: cn cn: foo - delete: cn -
In the old code this will yield an incorrect result on the server; only the "cn: foo" value will get deleted instead of the entire cn attribute. This case is handled correctly in the code I just committed, but in most other respects the new code behaves the same as the old.
At 09:19 AM 11/24/2006, Howard Chu wrote:
Kurt D. Zeilenga wrote:
At 02:05 PM 11/23/2006, Howard Chu wrote:
ldapmodify will accept multiple modifications without a separator in between them. E.g.:
dn: dc=example,dc=com changetype: modify add: cn cn: foo sn: bar description: xyzzy
While the above may adhere to the ABNF, it's nonsense. The changetype is modify, so we're representing a modifyRequest. The modify request contains a single add operation of cn. Hence, any provided values must be of cn. The second and third values are not of cn. That's an error. While some parsers might be liberal in handling this error, implicitly inserting additional add operations, that behavior is something authors of LDIF files should not rely on. There certainly are parsers that, quite properly, error on such LDIF input. RFC 2849 could do with some clarification. The intent was to create a one-to-one mapping between LDIF change records and LDAP update requests.
That's what I figured, but that's clearly not what it does. The code also collapses some things together, which definitely doesn't create a one-to-one mapping.
In some cases, the parser can be said to liberal in what it accepts. In some cases, the parser is simply broken.
Kurt
Back on the server side of things...
Using ldapadd to load my test database (380836 entries, 533MB LDIF, 1.2GB id2entry size) takes well over an hour using synchronous writes.
With DB_TXN_WRITE_NOSYNC and a 512 MB BDB cache, plus periodic checkpoints, it takes 39:21.55 minutes. With the patch in add.c:1.247, and the duplicate check in slap_mods_check replaced by a quicksort, it takes 20:23.24 minutes.
With a 1GB BDB cache it takes 14:06.48 minutes. With the above and the new ldapadd client code, it takes 12:49.74 minutes.
The question arises now about whether to keep the attribute values in sorted order, or to just sort a temporary copy at some strategic points and discard the results. Keeping in sorted order breaks a number of tests in the test suite, since results are compared to LDIF files with values in the original order. But it offers the possibility of speeding up Modifies, value_find, and a number of other functions. It's definitely got a lot of potential upside, at the cost of breaking current expectations about attribute value ordering.
Any thoughts?
Howard Chu wrote:
Back on the server side of things...
Using ldapadd to load my test database (380836 entries, 533MB LDIF, 1.2GB id2entry size) takes well over an hour using synchronous writes.
With DB_TXN_WRITE_NOSYNC and a 512 MB BDB cache, plus periodic checkpoints, it takes 39:21.55 minutes. With the patch in add.c:1.247, and the duplicate check in slap_mods_check replaced by a quicksort, it takes 20:23.24 minutes.
With a 1GB BDB cache it takes 14:06.48 minutes. With the above and the new ldapadd client code, it takes 12:49.74 minutes.
With synchronous writes and the original code in HEAD, 1GB BDB cache, it took 1:33:08.74 to ldapadd the database. With the patched add.c and the quicksort it took 1:14:52.11. As a point of reference, it took only 2:42.64 for slapadd -q.
Unfortunately the bulk of the ldapadd time is really determined by BerkeleyDB. With DB_TXN_NOSYNC and transaction logs written to a tmpfs, the original code in HEAD completes the ldapadd in 18:54.56. With the add.c patch that goes down to 13:35.82. With the quicksort in slap_mods_check that drops down further to 7:19.52.
Howard Chu wrote:
With synchronous writes and the original code in HEAD, 1GB BDB cache, it took 1:33:08.74 to ldapadd the database. With the patched add.c and the quicksort it took 1:14:52.11. As a point of reference, it took only 2:42.64 for slapadd -q.
Unfortunately the bulk of the ldapadd time is really determined by BerkeleyDB. With DB_TXN_NOSYNC and transaction logs written to a tmpfs, the original code in HEAD completes the ldapadd in 18:54.56. With the add.c patch that goes down to 13:35.82. With the quicksort in slap_mods_check that drops down further to 7:19.52.
One further refinement - omitting checkpoint calls from the main operation, and leaving them to occur only in the checkpoint thread, drops the time down to 6:16.55. I'd say this is the desired behavior when you only configure a checkpoint time and not a checkpoint size, but if you configure a size then you really need to call after each commit.
I'm not sure how realistic any of these configurations are, but I suppose with a fast enough disk subsystem synchronous writes wouldn't take such a heavy toll.
Howard Chu wrote:
Howard Chu wrote:
With synchronous writes and the original code in HEAD, 1GB BDB cache, it took 1:33:08.74 to ldapadd the database. With the patched add.c and the quicksort it took 1:14:52.11. As a point of reference, it took only 2:42.64 for slapadd -q.
Unfortunately the bulk of the ldapadd time is really determined by BerkeleyDB. With DB_TXN_NOSYNC and transaction logs written to a tmpfs, the original code in HEAD completes the ldapadd in 18:54.56. With the add.c patch that goes down to 13:35.82. With the quicksort in slap_mods_check that drops down further to 7:19.52.
One further refinement - omitting checkpoint calls from the main operation, and leaving them to occur only in the checkpoint thread, drops the time down to 6:16.55. I'd say this is the desired behavior when you only configure a checkpoint time and not a checkpoint size, but if you configure a size then you really need to call after each commit.
With the silly quadratic stuff out of the way, there's only small incremental gains to be made. The oc_bvfind change got things down to 5:55. Optimizing entry_dup will probably get it down to 5:20 or so. At this point, having come down from 1:33:09 to 6:00 or so, within about a 2x factor of slapadd -q is probably good enough. The remaining lber overhead isn't going to go away very easily...