Ferenc Wagner wrote:
Hi,
First, please let me tell you the story of my adventure yesterday. I'll summarize my questions at the end.
I've set up a simple master-slave replicated system some time ago (stock Debian wheezy OpenLDAP, version 2.4.31-1+nmu2):
dn: olcDatabase={0}config,cn=config olcSyncrepl: {0}rid=1 provider=ldap://elm.niif.hu [...]
dn: olcDatabase={1}mdb,cn=config olcSyncrepl: {0}rid=2 provider=ldap://elm.niif.hu [...]
The slave opened two connections to the master, and everything worked fine. Then I enabled TLS and put in a CNAME record, so that the master became accessible as ldaps://ldap-master.niif.hu. I decided to also switch over the replication traffic to the SSL channel, so ldapmodified the above attributes to contain provider=ldaps://ldap-master.niif.hu. This pretty much broke the system, because the master server suddenly started to replicate from itself, thus became read-only.
Finding no other option, I stopped the "master" slapd and edited back the providers to their previous values (above) in the olcDatabase={0}config.ldif and olcDatabase={1}mdb.ldif files under the cn=config directory of my server configuration. I know these files should not be edited, but I found no other way.
This move made the master recognized itself again in the provider URI, so it did not start replicating and became writeable. My edits, however, did not propagate to the slave, probably because I did not change the internal attributes (entryCSN?) so this was expected. Also, slapcat started to report CRC warnings in some LDIF files while dumping the databases, which was also understandable for the edited ones, but not so much for cn=config.ldif (if I remember correctly).
I tried to fix these by doing some dummy changes by ldapmodify to the database entries. For both, I added an extra olcAccess attribute, then deleted it. These operations made the slave switch back its syncrepl connections to the ldap port from ldaps, but also instantly broke the slave server, which stopped returning results and instead logged lots of
slapd[27944]: => mdb_idl_fetch_key: cursor failed: Invalid argument (22)
lines. Having no better idea, I restarted the slave server, which fortunately returned it to normal working condition.
So, my questions:
- How does the "self-recognition" (by which the master does not start replicating from itself) work, why did it fail when I changed the provider URI to ldaps?
As noted here http://www.openldap.org/doc/admin24/replication.html#N-Way%20Multi-Master
Did using a CNAME (instead of some FQDN of the server) confuse it? Could this be fixed by adding an appropriate subjectAltName to the server TLS certificate? Or by adding some olcServerID attributes?
- How could I have handled the read-only situation, instead of editing forbidden LDIF files? Would setting olcMirrorMode have been possible (without olcServerIDs around)?
At the moment, manually editing was probably your only course of action. In OpenLDAP 2.5 the slapmodify tool should be used to make changes while slapd is shutdown.
- Is my setup in a reliable and consistent state now, or should I expect sudden future failures? I mean, were the "cursor failed" errors fixed for good by the slave server restart?
Don't know. You're using 2.4.31, current is 2.4.39, possibly you saw a bug that has been fixed. Doesn't sound familiar though.
Please also feel free to educate me on any other points, as needed. :)