Which characters one can safely use in filenames on at least the platforms OpenLDAP is ported to: Unix, Windows (cygwin), IBM zOS and MacOS X according to ANNOUNCEMENT? In particular Windows, since that's the most common non-Unix platform.
I only know Unix filenames. Googling around I've found a plethora of conflicting info.
It's for a naming scheme for back-ldif filenames that will let back-ldif work as a general backend, and hopefully won't conflict with its current names for the config database. ITS#5408.
Currently back-ldif takes the RDN and escapes the directory separator as "\hex", which doesn't work on Windows where \ is a directory separator.
Some other notes:
- Don't need really general filenames. OpenLDAP does in any case assume Unix/Windows/URL-style pathnames: root to the left, leaf to the right, a single directory separator character.
- Hopefully the characters "=-{}" can be used, since database config uses those characters. E.g. olcDatabase={-1}frontend,cn=config. ",+" would be nice too, as separators in and between RDNs.
- Need some escape character. Howard suggested % as in URL-escaping. First that sounded nice, but on second thought such a filename is not a valid file:// URL component for that file - the '%' must be URL-escaped again to access the file as an URL. Not sure if that's a good argument either way.
Full URL-escaping also escapes {} which is unfortunate (see point 1).
- For real paranoia, might escape uppercase characters in case-sensitive attribute types too. I won't bother unless someone disagrees, both case-sensitive DNs and the use of back-ldif as a general database are fairly rare.
Hallvard B Furuseth wrote:
Which characters one can safely use in filenames on at least the platforms OpenLDAP is ported to: Unix, Windows (cygwin), IBM zOS and MacOS X according to ANNOUNCEMENT? In particular Windows, since that's the most common non-Unix platform.
MacOSX supports both BSD UFS and Mac HFS(+). In UFS a forward slash is reserved as a path separator, while in HFS the colon serves that purpose. MacOSX swaps the two whenever they appear in the wrong filesystem.
E.g., a file named "foo:bar" stored on UFS will be named "foo/bar" if it's copied to HFS. Currently the MacOSX FileManager will always display paths with "/" as the separator, so it's simplest to treat MacOSX the same as Unix/POSIX.
http://developer.apple.com/qa/qa2006/qa1392.html (Note that we only use BSD APIs...)
HFS also shares the Windows characteristic of being case-preserving but otherwise case-insensitive. I think we can ignore this and just tell people to use UFS if they want to use back-ldif.
IBM z/OS is supported using their UNIX System Services environment. In that environment, regular Unix file conventions apply.
http://publib.boulder.ibm.com/infocenter/zoslnctr/v1r7/index.jsp?topic=/com....
Building OpenLDAP as a Cygwin app is not recommended and I see no reason to make any special effort to support it.
For native Windows, the forbidden characters are listed here http://msdn2.microsoft.com/en-us/library/aa365247.aspx
< > : " / \ | ? *
Windows has a lot of problems, period. I don't think it's worth spending much effort on this platform.
E.g. http://blogs.msdn.com/brian_dewey/archive/2004/01/19/60263.aspx you can create legal file names/paths that the Windows desktop cannot manipulate... We're only using Win32 APIs (and AFAIK the POSIX subsystem was deprecated anyway) so it shouldn't be an issue for us.
I only know Unix filenames. Googling around I've found a plethora of conflicting info.
It's for a naming scheme for back-ldif filenames that will let back-ldif work as a general backend, and hopefully won't conflict with its current names for the config database. ITS#5408.
Currently back-ldif takes the RDN and escapes the directory separator as "\hex", which doesn't work on Windows where \ is a directory separator.
Some other notes:
- Don't need really general filenames. OpenLDAP does in any case assume Unix/Windows/URL-style pathnames: root to the left, leaf to the right, a single directory separator character.
Every supported platform works that way.
Hopefully the characters "=-{}" can be used, since database config uses those characters. E.g. olcDatabase={-1}frontend,cn=config. ",+" would be nice too, as separators in and between RDNs.
Need some escape character. Howard suggested % as in URL-escaping. First that sounded nice, but on second thought such a filename is not a valid file:// URL component for that file - the '%' must be URL-escaped again to access the file as an URL. Not sure if that's a good argument either way.
OK. It was just an idea, we can use something else instead.
Full URL-escaping also escapes {} which is unfortunate (see point 1).
- For real paranoia, might escape uppercase characters in case-sensitive attribute types too. I won't bother unless someone disagrees, both case-sensitive DNs and the use of back-ldif as a general database are fairly rare.
Ignore that. back-ldif should never be used as a general database.
Heh. Thanks for the info. Have you built OpenLDAP on everything? I expected I'd need advice from half a dozen different people:-)
Howard Chu writes:
MacOSX supports both BSD UFS and Mac HFS(+). In UFS a forward slash is reserved as a path separator, while in HFS the colon serves that purpose. MacOSX swaps the two whenever they appear in the wrong filesystem.
(...) Currently the MacOSX FileManager will always display paths with "/" as the separator, so it's simplest to treat MacOSX the same as Unix/POSIX.
Might as well hex-escape ':' on all systems, since it's troublesome at least for the users on both MacOSX and Windows.
(And '/', 8-bit and control chars, and a special hack for windows according to your latest message on the ITS.)
(...) Building OpenLDAP as a Cygwin app is not recommended and I see no reason to make any special effort to support it.
Ah, OK. I was just severely out of date.
Hallvard B Furuseth wrote:
Heh. Thanks for the info. Have you built OpenLDAP on everything?
That's what we (Symas) do. ;) You forget that outside of OpenLDAP I have code running on 98% of the computers in the world. (And a few off the world too...)
I expected I'd need advice from half a dozen different people:-)
Howard Chu writes:
MacOSX supports both BSD UFS and Mac HFS(+). In UFS a forward slash is reserved as a path separator, while in HFS the colon serves that purpose. MacOSX swaps the two whenever they appear in the wrong filesystem.
(...) Currently the MacOSX FileManager will always display paths with "/" as the separator, so it's simplest to treat MacOSX the same as Unix/POSIX.
Might as well hex-escape ':' on all systems, since it's troublesome at least for the users on both MacOSX and Windows.
Fair enough.
(And '/', 8-bit and control chars, and a special hack for windows according to your latest message on the ITS.)
Honestly, I wouldn't make any special effort for control characters. All the filesystems we touch accept them. (Makes life interesting when you're trying to enter a filename at the command line, but what the heck, you can always glob it.) 8-bit I'm not so sure about, some filesystems expect UTF-8, while others are 8-bit clean to begin with. If we can expect that no one is going to use an octetString as a naming attribute, I think we'll be fine since everything else will be in UTF-8 anyway.
Howard Chu writes:
Hallvard B Furuseth wrote:
Might as well hex-escape (...) (And '/', 8-bit and control chars, and a special hack for windows according to your latest message on the ITS.)
Honestly, I wouldn't make any special effort for control characters. All the filesystems we touch accept them.
Well - I think it'd be nice to be able to visit back-ldif filenames without troble. Still, it's just back-ldif, so as long as it works, it works:-)
8-bit I'm not so sure about, some filesystems expect UTF-8, while others are 8-bit clean to begin with. If we can expect that no one is going to use an octetString as a naming attribute, I think we'll be fine since everything else will be in UTF-8 anyway.
I've got three or so conflicting opinions on that myself. On an UTF-8 filesystem, that'll certainly be nice for people using non-Latin characters. Assuming they decide to use back-ldif in the first place, of course. I suppose it'll happen with private schema include files in cn=config, at least. And maybe module names.
So, unless someone else has strong opinions, I guess they'll stay untouched. And then there's EBCDIC, where uppercase A-Z chars are apparently 8-bit. Though an ('A' == 65) test could catch that.
<rant> I didn't believe "everything will soon be UTF-8" a decade ago and I still don't.
8-bit filenames and data worked just fine for a while until various applications discovered "internationalization". That can ruin an otherwise 8-bit clean system - the OS handles it fine, but not the apps. The Windows troubles in the URL you posted seem just the same.
For example, Emacs in its default mode on my system seems unable to visit 8-bit filenames from its own Dired buffer. Presumably something in it disagrees with itself about the encoding of filenames. I'm sure there is something I could configure - until I run into the next equally clever program. So personally I just stick to 7-bit filenames. </rant>
Sigh, this took longer than expected to finish, and then testing ran into problems which I _think_ is a current ITS and not mine, but... maybe I should just apply the filename-specific part of my ldif patches, after a bit of cleanup.
Or maybe it's better to wait till next release after all, since this gets rather late before the current release.
I forgot, how do I _test_ if the program is running on a box where \ is a directory separator? defined(_WIN32) apparently means the Windows API, not Unix emulations. Or not all of them anyway. Googling around, the best I found was to collect various symbols in an #ifdef - the more the merrier.
Probably belongs in portable.hin, but I don't want to put something I can't even test in such an official place. So left to myself I'm inclined to put something like this in back-ldif in the hope of catching current and future ports, and leave cleanup to someone who uses windows:
#if defined(_WIN32) || defined(_MSC_VER) || /* Native windows. Others? */ \ defined(__MINGW32__) || defined(__CYGWIN__) /* Emulations */ || \ defined( __MSDOS__) || defined( __DJGPP__) || defined(__GO32__) /*:-)*/ # define LDAP_WINDOWS_FILENAMES 1 #endif
Another point, one strangeness I'm wondering about:
back-ldif can b3 compiled to transate {} in DNs to other characters in filenames, e.g. "cn={3}foo" => "cn=<3>foo" or => "cn=!3!foo".
If I remember correctly the point was that '{' and '}' might be special characters in the filesystem. However it doesn't translate all {}'s, it translates the first '{' it encounters, then the next '}', then '{', etc. So cn=foo{{bar}} becomes e.g. cn=foo<{bar>}.
Is there a reason for this? I'm inclined to instead translate every '{' and every '}'. It's not that { and } in DNs from slapd can be configured to be the same character; IX_DNL and IX_DNR are defined unconditionally.
Hallvard B Furuseth wrote:
Sigh, this took longer than expected to finish, and then testing ran into problems which I _think_ is a current ITS and not mine, but... maybe I should just apply the filename-specific part of my ldif patches, after a bit of cleanup.
A current back-ldif ITS? Or something else?
Or maybe it's better to wait till next release after all, since this gets rather late before the current release.
I guess just commit to HEAD in stages. We can skip this for 2.4.9 if necessary.
I forgot, how do I _test_ if the program is running on a box where \ is a directory separator? defined(_WIN32) apparently means the Windows API, not Unix emulations. Or not all of them anyway. Googling around, the best I found was to collect various symbols in an #ifdef - the more the merrier.
ldap_config.h just tests for _WIN32 and defines LDAP_DIRSEP. There is no reason to do anything beyond that. Cygwin does its own machinations already and we don't know or care what they are.
Probably belongs in portable.hin, but I don't want to put something I can't even test in such an official place. So left to myself I'm inclined to put something like this in back-ldif in the hope of catching current and future ports, and leave cleanup to someone who uses windows:
#if defined(_WIN32) || defined(_MSC_VER) || /* Native windows. Others? */ \ defined(__MINGW32__) || defined(__CYGWIN__) /* Emulations */ || \ defined( __MSDOS__) || defined( __DJGPP__) || defined(__GO32__) /*:-)*/ # define LDAP_WINDOWS_FILENAMES 1 #endif
Ugh. No.
I'm sure slapd can't be compiled on MSDOS or DJGPP any more anyway. libldap probably still works for them, but that's beyond the scope of back-ldif.
Another point, one strangeness I'm wondering about:
back-ldif can be compiled to transate {} in DNs to other characters in filenames, e.g. "cn={3}foo" => "cn=<3>foo" or => "cn=!3!foo".
If I remember correctly the point was that '{' and '}' might be special characters in the filesystem. However it doesn't translate all {}'s, it translates the first '{' it encounters, then the next '}', then '{', etc. So cn=foo{{bar}} becomes e.g. cn=foo<{bar>}.
Is there a reason for this? I'm inclined to instead translate every '{' and every '}'. It's not that { and } in DNs from slapd can be configured to be the same character; IX_DNL and IX_DNR are defined unconditionally.
Seems like it should translate all, yes.
Howard Chu writes:
Hallvard B Furuseth wrote:
Sigh, this took longer than expected to finish, and then testing ran into problems which I _think_ is a current ITS and not mine, but... maybe I should just apply the filename-specific part of my ldif patches, after a bit of cleanup.
A current back-ldif ITS? Or something else?
HEAD with my current back-ldif version, "./run -b ldif all" crashed in some syncrepl stuff. Couldn't yet reproduce it.
Or maybe it's better to wait till next release after all, since this gets rather late before the current release.
I guess just commit to HEAD in stages. We can skip this for 2.4.9 if necessary.
OK
I forgot, how do I _test_ if the program is running on a box where \ is a directory separator? defined(_WIN32) apparently means the Windows API, not Unix emulations. Or not all of them anyway. Googling around, the best I found was to collect various symbols in an #ifdef - the more the merrier.
ldap_config.h just tests for _WIN32 and defines LDAP_DIRSEP. There is no reason to do anything beyond that. Cygwin does its own machinations already and we don't know or care what they are.
So MingW and Cygwin translate filenames with "forbidden characters" like \ to something which works on Windows?
#if defined(_WIN32) || defined(_MSC_VER) || (...)
Ugh. No.
Ugh indeed. OK:-)
I wrote:
ldap_config.h just tests for _WIN32 and defines LDAP_DIRSEP. There is no reason to do anything beyond that. Cygwin does its own machinations already and we don't know or care what they are.
So MingW and Cygwin translate filenames with "forbidden characters" like \ to something which works on Windows?
Sorry, getting tired... I presume this falls under your previous reply, "Building OpenLDAP as a Cygwin app is not recommended and I see no reason to make any special effort to support it."
It'd be easier if I knew Windows so I knew something about what I was talking about...
Hallvard B Furuseth wrote:
I wrote:
ldap_config.h just tests for _WIN32 and defines LDAP_DIRSEP. There is no reason to do anything beyond that. Cygwin does its own machinations already and we don't know or care what they are.
So MingW and Cygwin translate filenames with "forbidden characters" like \ to something which works on Windows?
Sorry, getting tired... I presume this falls under your previous reply, "Building OpenLDAP as a Cygwin app is not recommended and I see no reason to make any special effort to support it."
It'd be easier if I knew Windows so I knew something about what I was talking about...
MinGW is only a programming toolkit that exposes the Win32 API to the GNU toolchain. As such, any code you develop with MinGW is a plain old native Win32 app; there are no special "MinGW" APIs.
Cygwin is a DLL that emulates Unix system calls, plus an environment built around that DLL. The emulation layer it provides has a pretty high resource cost, and Cygwin apps cannot easily be manipulated using regular Windows tools. Performance-wise I've found developing in Cygwin to be at least 3x slower than with native Windows, and as such I avoid it as much as possible.
There's also MSYS, which is a programming environment (forked from an old version of Cygwin) aimed at enabling MinGW development using standard Unix shells and tools. This is what I use for working on Win32 OpenLDAP.