https://bugs.openldap.org/show_bug.cgi?id=9847
Issue ID: 9847 Summary: kqueue problem with OpenBSD Product: OpenLDAP Version: 2.6.2 Hardware: All OS: Other Status: UNCONFIRMED Keywords: needs_review Severity: normal Priority: --- Component: slapd Assignee: bugs@openldap.org Reporter: stu@spacehopper.org Target Milestone: ---
The kqueue support added to slapd for 2.5 isn't working correctly on OpenBSD. If slapd runs as a daemon, errors like these are seen:
slap_client_connect: URI=ldaps://XXX Warning, ldap_start_tls failed (1) mdb_opinfo_get: err Invalid argument(22)
However if slapd is run in the foreground with -d, things are ok.
They are also OK if kqueue is disabled (by neutering the autoconf check).
I've tested 2.5.4, 2.5.9 and 2.6.2.
Here are some log excerpts:
17:23:44.692: slapd starting 17:23:44.692: daemon: added 3r listener=0x0 17:23:44.692: daemon: added 6r listener=0x100263797200 17:23:44.692: daemon: added 7r listener=0x100263785400 17:23:44.692: daemon: kqueue: listen=6 active_threads=0 tvp=zero 17:23:44.692: daemon: kqueue: listen=7 active_threads=0 tvp=zero 17:23:44.692: daemon: activity on 1 descriptor 17:23:44.692: daemon: activity on: 17:23:44.692: 17:23:44.692: daemon: kqueue: listen=6 active_threads=0 tvp=zero 17:23:44.692: daemon: kqueue: listen=7 active_threads=0 tvp=zero 17:23:44.692: >>> dnNormalize: <cn=Consumer 101> 17:23:44.692: <<< dnNormalize: <cn=consumer 101> 17:23:44.692: =>do_syncrepl rid=101 17:23:44.763: slap_client_connect: URI=ldaps://XXXXXXXXXXXXXXX Warning, ldap_start_tls failed (1) 17:23:44.781: => mdb_entry_get: ndn: "dc=XXXXXXX,dc=XXX" 17:23:44.782: => mdb_entry_get: oc: "(null)", at: "contextCSN" 17:23:44.782: mdb_opinfo_get: err Invalid argument(22) 17:23:44.782: =>do_syncrep2 rid=101 17:23:44.782: daemon: added 9r listener=0x0 17:23:44.783: daemon: activity on 1 descriptor 17:23:44.783: daemon: activity on: 17:23:44.783: 17:23:44.783: daemon: kqueue: listen=6 active_threads=0 tvp=NULL 17:23:44.783: daemon: kqueue: listen=7 active_threads=0 tvp=NULL 17:23:44.800: daemon: activity on 1 descriptor 17:23:44.800: daemon: activity on: 17:23:44.800: 9r 17:23:44.800: 17:23:44.800: daemon: read active on 9 17:23:44.800: daemon: kqueue: listen=6 active_threads=0 tvp=NULL 17:23:44.800: daemon: kqueue: listen=7 active_threads=0 tvp=NULL 17:23:44.800: connection_get(9) 17:23:44.800: connection_get(9): got connid=0 17:23:44.801: =>do_syncrepl rid=101 17:23:44.801: =>do_syncrep2 rid=101
Not sure if it's any help but looking at kdump(1) output after a run under ktrace(1) I didn't spot the immediate problem resulting in "ldap_start_tls failed (1)" however the "mdb_opinfo_get: err Invalid argument(22)" occurs after attempting to call fcntl with F_SETLK on the fd 5 which is the kqueue fd:
65304 slapd RET write 97/0x61 65304 slapd CALL poll(0x89d39cb9004,1,INFTIM) 65304 slapd STRU struct kevent [2] { ident=9, filter=EVFILT_READ, flags=0x1005<EV_ADD|EV_ENABLE>, fflags=0<>, data=0, udata=0x231a1bea } { ident=9, filter=EVFILT_EXCEPT, flags=0x1005<EV_ADD|EV_ENABLE>, fflags=0x4<>, data=0, udata=0x231a1bea } 65304 slapd STRU struct kevent { ident=9, filter=EVFILT_READ, flags=0x1005<EV_ADD|EV_ENABLE>, fflags=0<>, data=36, udata=0x231a1bea } 65304 slapd STRU struct pollfd { fd=9, events=0x3<POLLIN|POLLPRI>, revents=0x1<POLLIN> } 65304 slapd RET poll 1 65304 slapd CALL read(9,0x89ccfb103e0,0x5) 65304 slapd GIO fd 9 read 5 bytes "^W^C^C\0^_" 65304 slapd RET read 5 65304 slapd CALL read(9,0x89ccfb349c5,0x1f) 65304 slapd GIO fd 9 read 31 bytes
"\M^W\M-r\M-f\M^C\M-2\M^V\M-E^O\M-hdgq\M-"\M-o\M-&\M^[*\M-9\M^E\M-Sd\M^A2\M-QE\M-cb |\M^VI" 65304 slapd RET read 31/0x1f 65304 slapd CALL mmap(0,0x2000,0x3<PROT_READ|PROT_WRITE>,0x1002<MAP_PRIVATE|MAP_ANON>,-1,0) 65304 slapd RET mmap 9468564512768/0x89c926ca000 65304 slapd CALL fcntl(5,F_SETLK,0x8ef465d0) 65304 slapd RET fcntl -1 errno 22 Invalid argument 65304 slapd CALL getpid() 65304 slapd RET getpid 65304/0xff18 65304 slapd CALL sendsyslog(0x89c8ef44040,59,0<>) 65304 slapd GIO fd -1 wrote 59 bytes "<167>slapd[65304]: mdb_opinfo_get: err Invalid argument(22)"
Any suggestions or requests for further information/tests would be welcome. Thanks!
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #1 from Howard Chu hyc@openldap.org --- If the behavior is different between daemon mode and foreground mode, it sounds like you need to investigate why these two modes are different on OpenBSD. I.e., you have a OS/platform issue and there's unlikely to be an OpenLDAP software issue to be fixed here.
kqueue support is known to work on FreeBSD and MacOSX so again, you need to find out what OpenBSD is doing differently.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #2 from Howard Chu hyc@openldap.org ---
Not sure if it's any help but looking at kdump(1) output after a run under ktrace(1) I didn't spot the immediate problem resulting in "ldap_start_tls failed (1)" however the "mdb_opinfo_get: err Invalid argument(22)" occurs after attempting to call fcntl with F_SETLK on the fd 5 which is the kqueue fd:
You might try attaching to slapd with a debugger and setting a breakpoint on fcntl to see who's trying to set this lock.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #3 from stu@spacehopper.org --- Tracked it down.
Parent opens kqueue as fd 5. Post fork, child closes fd 5 because
/* a kqueue fd obtained before a fork can't be used in child process. * close it and reacquire it.
It doesn't work quite like that in OpenBSD. From https://man.openbsd.org/kqueue
kqueue() creates a new kernel event queue and returns a descriptor. The queue is not inherited by a child created with fork(2). Similarly, kqueues ^^^^^^^^^^^^^ cannot be passed across UNIX-domain sockets.
So in the slapd child fd 5 is lock.db and that is what gets closed by SLAP_SOCK_INIT2.
me_lfd remains set to 5 and that is why we get an fcntl later; it is from mdb_reader_pid.
If I remove close() from the macro I no longer hit the problem.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #4 from stu@spacehopper.org --- Looks like this is a problem on FreeBSD too: https://cgit.freebsd.org/ports/tree/net/openldap25-server/files/patch-server...
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #5 from Howard Chu hyc@openldap.org --- (In reply to stu from comment #3)
Tracked it down.
Parent opens kqueue as fd 5. Post fork, child closes fd 5 because
/* a kqueue fd obtained before a fork can't be used in child process.
- close it and reacquire it.
It doesn't work quite like that in OpenBSD. From https://man.openbsd.org/kqueue
kqueue() creates a new kernel event queue and returns a descriptor. The queue is not inherited by a child created with fork(2). Similarly, kqueues ^^^^^^^^^^^^^ cannot be passed across UNIX-domain sockets.
So in the slapd child fd 5 is lock.db and that is what gets closed by SLAP_SOCK_INIT2.
me_lfd remains set to 5 and that is why we get an fcntl later; it is from mdb_reader_pid.
If I remove close() from the macro I no longer hit the problem.
Interesting. The Apple manpage has the same text, but the behavior is different. The *descriptor* is inherited, it just doesn't have any validity as a queue. Since the manpage doesn't say the *descriptor* is not inherited, this sounds to me like an OS bug either in their kernel or their manpage.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #6 from Howard Chu hyc@openldap.org ---
this sounds to me like an OS bug either in their kernel or their manpage.
"Their" meaning OpenBSD.
Since you submitted that patch to FreeBSD: did you verify that it is actually misbehaving on FreeBSD first? We've seen no bug reports about it and this support has been in the tree for a couple years already.
https://bugs.openldap.org/show_bug.cgi?id=9847
Howard Chu hyc@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |TEST
--- Comment #7 from Howard Chu hyc@openldap.org --- Never mind, I misread the history of that freeBSD patch.
Thanks for the report. Fixed in master.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #8 from stu@spacehopper.org --- On 2022/05/14 16:02, openldap-its@openldap.org wrote:
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #6 from Howard Chu hyc@openldap.org ---
this sounds to me like an OS bug either in their kernel or their manpage.
"Their" meaning OpenBSD.
Since you submitted that patch to FreeBSD: did you verify that it is actually misbehaving on FreeBSD first? We've seen no bug reports about it and this support has been in the tree for a couple years already.
-- You are receiving this mail because: You reported the issue.
I didn't submit that to FreeBSD; they have had it since they added OpenLDAP 2.5.x back in Sept 2021.
https://cgit.freebsd.org/ports/commit/net/openldap25-server?id=53fec214c68a5...
That port was committed by Xin Li who has comments forwarded in ITS#6300 where kqueue was added so they presumably already knew about this issue. If the majority of FreeBSD users are using the port (which is quite likely) they would have been insulated from this and not needed to report.
FreeBSD has more precise text in fork(2),
The child process has its own copy of the parent's descriptors, except for descriptors returned by kqueue(2), which are not in- herited from the parent process.
Unsurprisingly DragonFlyBSD has the same patch in their port as FreeBSD.
NetBSD does not have a patch like this in pkgsrc but I found a fairly recent list post suggests the child does not (at least as on 2021/09) inherit the kqueue fd on fork - so it's unclear how things are working in there, if indeed things *are* working there. https://mail-index.netbsd.org/netbsd-users/2021/09/30/msg027833.html
It seems the slapd kqueue code was originally donated by Apple so it would make sense that it would primarily target their implementation. (Other things e.g. that it's broken with device nodes does suggest that Apple's kqueue is a bit different to others).
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #9 from Howard Chu hyc@openldap.org ---
NetBSD does not have a patch like this in pkgsrc but I found a fairly
recent list post suggests the child does not (at least as on 2021/09) inherit the kqueue fd on fork - so it's unclear how things are working in there, if indeed things *are* working there. https://mail-index.netbsd.org/netbsd-users/2021/09/30/msg027833.html
The post makes some good points. The child process is supposed to be identical to the parent process after a fork, but this behavior means the child's descriptor table is quite different from the parent's. Whoever designed this kqueue interface really dropped the ball.
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #10 from Howard Chu hyc@openldap.org --- For future reference: ran this test program on MacOSX
### #include <unistd.h> # include <sys/types.h> # include <sys/event.h> # include <sys/time.h>
int main() { int pfd[2]; int fd = kqueue(); pipe(pfd); fork(); sleep(3600); } ###
and ran lsof on parent and child processes. Confirmed that it behaves the same as the other OSs, descriptor is not inherited.
https://bugs.openldap.org/show_bug.cgi?id=9847
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |delphij@freebsd.org
--- Comment #11 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to stu from comment #4)
Looks like this is a problem on FreeBSD too: https://cgit.freebsd.org/ports/tree/net/openldap25-server/files/patch- servers_slapd_daemon.c
I'm somewhat puzzled as to why Xin Li never reported this to the OpenLDAP project. I had a 2 year long thread with them about kqueue support in OpenLDAP, and while one issue was reported and fixed with the code, there was no follow up on it since January of 2019 when I last asked Xin Li to confirm the code in OpenLDAP master worked correctly.
https://bugs.openldap.org/show_bug.cgi?id=9847
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |2.5.13 Keywords|needs_review | Version|2.6.2 |2.5.4
https://bugs.openldap.org/show_bug.cgi?id=9847
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|TEST |FIXED
--- Comment #12 from Quanah Gibson-Mount quanah@openldap.org --- head:
Commits: • ced6652e by Howard Chu at 2022-05-14T17:16:02+01:00 ITS#9847 slapd: fix kqueue for FreeBSD/MacOSX/OpenBSD
• 26e37213 by Howard Chu at 2022-05-14T17:40:37+01:00 ITS#9847 fix kqueue detection on MacOSX
RE26:
• 1ec2d8ae by Howard Chu at 2022-05-16T14:43:56+00:00 ITS#9847 slapd: fix kqueue for FreeBSD/MacOSX/OpenBSD
• b37bcadf by Howard Chu at 2022-05-16T15:07:16+00:00 ITS#9847 fix kqueue detection on MacOSX
RE25:
• e2e9373a by Howard Chu at 2022-05-16T15:07:33+00:00 ITS#9847 slapd: fix kqueue for FreeBSD/MacOSX/OpenBSD
• 59ae963b by Howard Chu at 2022-05-16T15:07:42+00:00 ITS#9847 fix kqueue detection on MacOSX
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #13 from Xin Li delphij@freebsd.org --- (In reply to Quanah Gibson-Mount from comment #11)
(In reply to stu from comment #4)
Looks like this is a problem on FreeBSD too: https://cgit.freebsd.org/ports/tree/net/openldap25-server/files/patch- servers_slapd_daemon.c
I'm somewhat puzzled as to why Xin Li never reported this to the OpenLDAP project. I had a 2 year long thread with them about kqueue support in OpenLDAP, and while one issue was reported and fixed with the code, there was no follow up on it since January of 2019 when I last asked Xin Li to confirm the code in OpenLDAP master worked correctly.
Ahhh, that was my bad. I have maintained a side branch locally with some modifications at some point, but never submitted it back (I must've thought I did upstreamed them, then forgot about these and expected the local patch can be removed one day, sorry... will be more careful in the future).
https://bugs.openldap.org/show_bug.cgi?id=9847
--- Comment #14 from Quanah Gibson-Mount quanah@openldap.org --- (In reply to Xin Li from comment #13)
Ahhh, that was my bad. I have maintained a side branch locally with some modifications at some point, but never submitted it back (I must've thought I did upstreamed them, then forgot about these and expected the local patch can be removed one day, sorry... will be more careful in the future).
Thanks for the update! We have moved to using gitlab now which makes contributing a lot simpler. Feel free to sign up for an account for doing future contributions at https://git.openldap.org
https://bugs.openldap.org/show_bug.cgi?id=9847
Quanah Gibson-Mount quanah@openldap.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED