contextCSN of subordinate syncrepl DBs
by Rein Tollevik
I've been trying to figure out why syncrepl used on a backend that is
subordinate to a glue database with the syncprov overlay should save the
contextCSN in the suffix of the glue database rather than the suffix of
the backend where syncrepl is used. But all I come up with are reasons
why this should not be the case. So, unless anyone can enlighten me as
to what I'm missing, I suggest that this be changed.
The problem with the current design is that it makes it impossible to
reliably replicate more than one subordinate db from the same remote
server, as there are now race conditions where one of the subordinate
backends could save an updated contextCSN value that is picked up by the
other before it has finished its synchronization. An example of a
configuration where more than one subordinate db replicated from the
same server might be necessary is the central master described in my
previous posting in
http://www.openldap.org/lists/openldap-devel/200806/msg00041.html
My idea as to how this race condition could be verified was to add
enough entries to one of the backends (while the consumer was stopped)
to make it possible to restart the consumer after the first backend had
saved the updated contextCSN but before the second has finished its
synchronization. But I was able to produce it by simply add or delete
of an entry in one of the backends before starting the consumer. Far to
often was the backend without any changes able to pick up and save the
updated contextCSN from the producer before syncrepl on the second
backend fetched its initial value. I.e it started with an updated
contextCSN and didn't receive the changes that had taken place on the
producer. If syncrepl stored the values in the suffix of their own
database then they wouldn't interfere with each other like this.
There is a similar problem in syncprov, as it must use the lowest
contextCSN value (with a given sid) saved by the syncrepl backends
configured within the subtree where syncprov is used. But to do that it
also needs to distinguish the contextCSN values of each syncrepl
backend, which it can't do when they all save them in the glue suffix.
This also implies that syncprov must ignore contextCSN updates from
syncrepl until all syncrepl backends has saved a value, and that
syncprov on the provider must send newCookie sync info messages when it
updates its contextCSN value when the changed entry isn't being
replicated to a consumer. I.e as outlined in the message referred to above.
Neither of these changes should interfere with ordinary multi-master
configurations where syncrepl and syncprov are both use on the same
(glue) database.
I'll volunteer to implement and test the necessary changes if this is
the right solution. But to know whether my analysis is correct or not I
need feedback. So, comments please?
--
Rein Tollevik
Basefarm AS
13 years, 6 months
contextCSN interaction between syncrepl and syncprov
by Rein Tollevik
The remaining errors and race condition that test058 demonstrates cannot
be solved unless syncrepl is changed to always store the contextCSN in
the suffix of the database where it is configured, not the suffix of its
glue database as it does today.
Assuming serverID 0 is reserved for the single master case, syncrepl and
syncprov can in that case only be configured within the same database
context if syncprov is a pure forwarding server I.e, it will not update
any CSN value and syncrepl have no need to fetch any values from it.
In the multi-master case it is only the contextCSN whose SID matches the
current serverID that syncprov maintains, the other are all received by
syncrepl. So, the only time syncrepl should need an updated CSN from
syncprov is when it is about to present it to its peer, i.e when it
initiates a refresh phase. Actually, a race condition that would render
the state of the database undetermined could occur if syncrepl fetches
an updated CSN from syncprov during the initial refresh phase. So, it
should be sufficient to read the contextCSN values from the database
before a new refresh phase is initiated, independent of whether syncprov
is in use or not.
Syncrepl will receive updates to the contextCSN value with its own SID
from its peers, at least with ITS#5972 and ITS#5973 in place. I.e, the
normal ignoring of updates tagged with a too old contextCSN value will
continue to work. It should also be safe to ignore all updates tagged
with a contextCSN or entryCSN value whose SID is the current servers
non-zero serverID, provided a complete refresh cycle is known to have
taken place. I.e, when a contextCSN value with the current non-zero
serverID was read from the database before the refresh phase started, or
after the persistent phase have been entered.
The state of the database will be undetermined unless an initial refresh
(i.e starting from an empty database or CSN set) have been run to
completion. I cannot see how this can be avoided, and as far as I know
it is so now too. It might be worth mentioning in the doc. though
(unless it already is).
Syncprov must continue to monitor the contextCSN updates from syncrepl.
When it receives updates destined for the suffix of the database it
itself is configured it must replace any CSN value whose SID matches its
own non-zero serverID with the value it manages itself (which should be
greater or equal to the value syncrepl tried to store unless something
is seriously wrong). Updates to "foreign" contextCSN values (i.e those
with a SID not matching the current non-zero serverID) should be
imported into the set of contextCSN values syncprov itself maintain.
Syncprov could also short-circuit the contextCSN update and delay it to
its own checkpoint. I'm not sure what effect the checkpoint feature
have today when syncrepl constantly updates the contextCSN..
Syncprov must, when syncrepl updates the contextCSN in the suffix of a
subordinate DB, update its own knowledge of the "foreign" CSNs to be the
*lowest* CSN with any given SID stored in all the subordinate DBs (where
syncrepl is configured). And no update must take place unless a
contextCSN value have been stored in *all* the syncrepl-enabled
subordinate DBs. Any values matching the current non-zero serverID
should be updated in this case too, but a new value should probably not
be inserted.
These changes should (unless I'm completely lost that is..) create a
cleaner interface between syncrepl and syncprov without harming the
current multi-master configurations, and make asymmetric multiple
masters configurations like the one in test058 work. Comments please?
Rein
13 years, 6 months
back-mdb - futures...
by Howard Chu
Just some thoughts on what I'd like to see in a new memory-based backend...
One of the complaints about back-bdb/hdb is the complexity in the tuning;
there are a number of different components that need to be balanced against
each other and the proper balance point varies depending on data size and
workload. One of the directions we were investigating a couple years back was
mechanisms for self-tuning of the caches. (This was essentially the thrust of
Jong-Hyuk Choi's work with zoned allocs for the back-bdb entry cache; it would
allow large chunks of the entry cache to be discarded on demand when system
memory pressure increased.) Unfortunately Jong hasn't been active on the
project in a while and it doesn't appear that anyone else was tracking that
work. Self-tuning is still a goal but it seems to me to be attacking the wrong
problem.
One of the things that annoys me with the current BerkeleyDB based design is
that we have 3 levels of cache operating at all times - filesystem, BDB, and
slapd. This means at least 2 memory copy operations to get any piece of data
from disk into working memory, and you have to play games with the OS to
minimize the waste in the FS cache. (E.g. on Linux, tweak the swappiness setting.)
Back in the 80s I spent a lot of time working on the Apollo DOMAIN OS, which
was based on the M68K platform. One of their (many) claims to fame was the
notion of a single-level store: the processor architecture supported a full 32
bit address space but it was uncommon for systems to have more than 24 bits
worth of that populated, and nobody had anywhere near 1GB of disk space on
their entire network. As such, every byte of available disk space could be
directly mapped to a virtual memory address, and all disk I/O was done thru
mmaps and demand paging. As a result, memory management was completely unified
and memory usage was extremely efficient.
These days you could still take that sort of approach, though on a 32 bit
machine a DB limit of 1-2GB may not be so useful any more. However, with the
ubiquity of 64 bit machines, the idea becomes quite attractive again.
The basic idea is to construct a database that is always mmap'd to a fixed
virtual address, and which returns its mmap'd data pages directly to the
caller (instead of copying them to a newly allocated buffer). Given a fixed
address, it becomes feasible to make the on-disk record format identical to
the in-memory format. Today we have to convert from a BER-like encoding into
our in-memory format, and while that conversion is fast it still takes up a
measurable amount of time. (Which is one reason our slapd entry cache is still
so much faster than just using BDB's cache.) So instead of storing offsets
into a flattened data record, we store actual pointers (since they all simply
reside in the mmap'd space).
Using this directly mmap'd approach immediately eliminates the 3 layers of
caching and brings it down to 1. As another benefit, the DB would require
*zero* cache configuration/tuning - it would be entirely under the control of
the OS memory manager, and its resident set size would grow or shrink
dynamically without any outside intervention.
It's not clear to me that we can modify BDB to operate in this manner. It
currently supports mmap access for read-only DBs, but it doesn't map to fixed
addresses and still does alloc/copy before returning data to the caller.
Also, while BDB development continues, the new development is mainly occurring
in areas that don't matter to us (e.g. BDB replication) and the areas we care
about (B-tree performance) haven't really changed much in quite a while. I've
mentioned B-link trees a few times before on this list; they have much lower
lock contention than plain B-trees and thus can support even greater
concurrency. I've also mentioned them to the BDB team a few times and as yet
they have no plans to implement them. (Here's a good reference:
http://www.springerlink.com/content/eurxct8ewt0h3rxm/ )
As such, it seems likely that we would have to write our own DB engine to
pursue this path. (Clearly such an engine must still provide full ACID
transaction support, so this is a non-trivial undertaking.) Whether and when
we embark on this is unclear; this is somewhat of an "ideal" design and as
always, "good enough" is the enemy of "perfect" ...
This isn't a backend we can simply add to the current slapd source base, so
it's probably an OpenLDAP 3.x target: In order to have a completely canonical
record on disk, we also need pointers to AttributeDescriptions to be recorded
in each entry and those AttributeDescription pointers must also be persistent.
Which means that our current AttributeDescription cache must be modified to
also allocate its records from a fixed mmap'd region. (And we'll have to
include a schema-generation stamp, so that if schema elements are deleted we
can force new AD pointers to be looked up when necessary.) (Of course, given
the self-contained nature of the AD cache, we can probably modify its behavior
in this way without impacting any other slapd code...)
There's also a potential risk to leaving all memory management up to the OS -
the native memory manager on some OS's (e.g. Windows) is abysmal, and the
CLOCK-based cache replacement code we now use in the entry cache is more
efficient than the LRU schemes that some older OS versions use. So we may get
into this and decide we still need to play games with mlock() etc. to control
the cache management. That would be an unfortunate complication, but it would
still allow us to do simpler tuning than we currently need. Still,
establishing a 1:1 correspondence between virtual memory addresses and disk
addresses is a big win for performance, scalability, and reduced complexity
(== greater reliability)...
(And yes, by the way, we have planning for LDAPCon2009 this September in the
works; I imagine the Call For Papers will go out in a week or two. So now's a
good time to pull up whatever other ideas you've had in the back of your mind
for a while...)
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
13 years, 9 months
cn=config: sharing, conditionals
by Howard Chu
So, while we can fully replicate cn=config for the case where all syncrepl
participants are masters/peers, things are still a bit sticky if we only want
the replicas to remain in slave mode.
Instead of going thru complicated mapping/virtual directory redirections, it
seems to me that all we need is to tag certain config objects with the
serverIDs to which they apply. As such, I'm considering adding an
olcServerMatch attribute to the olcDatabaseConfig and olcOverlayConfig
objectclasses. This would take a regexp to match against the current server
ID; if the pattern matches then the config entry is processed otherwise it is
ignored. This attribute would be absent/empty by default, making the entry
always enabled.
Likewise it may be useful to add a boolean olcDisabled attribute to these
classes, to allow databases and overlays to be switched on and off without
needing to delete them. Again, it would default to absent/FALSE...
We'd also want these controls for syncrepl stanzas. (Too bad the patch to turn
syncrepl into an overlay was never committed....)
For example, we may have a cluster of servers in MMR with a pool of other
servers operating as slaves. We'd want the syncprov overlay active on all of
the masters, the syncrepl consumer active on all of the servers, and the chain
overlay active on all of the slaves. Setting olcServerMatch on the syncprov
and chain overlays would allow things to behave as desired, without needing to
create a parallel config tree just for the slaves.
Comments?
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years
Re: cn=config: sharing, conditionals
by Raphaël Ouazana-Sustowski
Hi,
Le Ven 29 mai 2009 16:11, Howard Chu a écrit :
> Rein Tollevik wrote:
>> Howard Chu wrote:
>>> Howard Chu wrote:
>>>> Since there is currently no support at all, I think it's important to
>>>> get
>>>> something usable first, and worry about those other cases later.
>>>
>>> The other alternative, which is much simpler to implement, is just to
>>> add a suffixmassage/rewrite keyword to the syncrepl config, allowing it
>>> to pull from a particular remote base and map it to the local base.
>>> Then
>>> it's up to the sysadmin to create a complete cn=config hierarchy
>>> somewhere else on the master server and let the slaves pick it up. That
>>> would address all of the issues of differentiation, at the cost of a
>>> little bit of redundancy on the master.
>>
>> I'm not too fond of the proposed olcServerMatch, it appears to me to
>> create a cluttered config database that requires you to match these
>> attribute values to see the currently active configuration. Should it
>> be added though, then I would prefer it to be defined as range(s) of
>> serverIDs rather than a pattern to match. Regexp matching of integer
>> ranges is always awkward..
>
> Yes, I agree, it would make things cluttered and the complexity could
> easily
> get out of hand. In retrospect I'm not so fond of the idea.
Wouldn't it be possible to simply extend the attribute name to make it
match a server?
Eg:
olcAccess;server1: ...
olcAccess;server2: ...
That would be IMHO more clear than any related regexp-attribute.
Regards,
Raphaël Ouazana.
14 years
Invalid DN after schema change
by Michael Ströder
HI!
If one plays around with schema and an attribute type used in the RDN
of an entry is no longer present then this entry is no longer readable
because whenever a request is sent to slapd invalidDNSyntax is returned.
This leads to the situation that a client can't even explicitly delete
this offending entry anymore.
I'd vote for relaxing the schema-based DN checking in case of search,
rename (only old DN), modify and delete requests a bit so that after a
schema change the data can be corrected with normal client tools without
server down-time.
Any thoughts on this?
Ciao, Michael.
14 years
NUMA-aware tcmalloc
by Howard Chu
For those of you running multi-socket Opteron servers (and eventually,
multi-socket Nehalem servers), AMD published a whitepaper last week on their
work adapting Google's tcmalloc to be NUMA-aware. The whitepaper includes
links to their source code / diffs. It appears to be quite a performance boost
in their (very artificial) benchmark. I'll be trying it out soon myself.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
14 years
current RE24 experience
by Aaron Richton
I've had a spurt of bad luck with 2.4.16 (it appears Quanah and a few
others may share that opinion). The seg faults inspired me to run under
libumem, which has some interesting features that give you "moderate"
debug ability in exchange for moderate performance hit -- small enough
that I can run it hot safely, unlike full-featured memory debuggers.
At this point a RE24 checkout from late Saturday has been good for me in
production, with some moderate libumem checks enabled. Is everybody else
starting to see RE24 shape up? Bottom line...I think I'm now +1 for
encouraging a 2.4.17 train, for what it's worth...
14 years