Fwd: LMDB and text encoding

List overview All Threads
Download

newer

older

Re: Fwd: LMDB and text encoding

Re: Status of ldap_init_fd and...

Timur Kristóf

2 Feb 2015 2 Feb '15

2:52 a.m.

...

I just had a look at how BDB handled this. As you can see they used a TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274...

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274...

(And a FROM_TSTRING for the reverse, as well.)

(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry. Now reposting to the mailing list.)

Since we only need to do this on Windows, we could use MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.) I do not think we would ever need to do any such conversion on UNIX. https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%...

I'm not sure if we can just copy-paste BDB's code. Probably not, that would lead to licensing issues, wouldn't it?

Show replies by date

Howard Chu

2 Feb 2 Feb

2:56 a.m.

Timur Kristóf wrote:

...

...
I just had a look at how BDB handled this. As you can see they used a TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274...

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274...

(And a FROM_TSTRING for the reverse, as well.)

(Mea culpa, I accidentally hit "reply" instead of "reply all". Sorry. Now reposting to the mailing list.)

Since we only need to do this on Windows, we could use MultiByteToWideChar with CP_UTF8. (That's what TO_TSTRING does, too.) I do not think we would ever need to do any such conversion on UNIX.

Correct, these macros only exist in the Windows-specific source files of BDB. None of this is needed for POSIX.

...

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%...

I'm not sure if we can just copy-paste BDB's code. Probably not, that would lead to licensing issues, wouldn't it?

I wasn't suggesting a copy/paste, just using it as an example of how the problem could be approached.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

4:37 a.m.

I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows".

-- Hallvard

Howard Chu

5:24 a.m.

Hallvard Breien Furuseth wrote:

...

I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows".

DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Timur Kristóf

6:20 a.m.

...

DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them.

If we let the users treat db names as an MDB_val (essentially, an arbitrary byte array), then all bets are off: we can't even make the assumption that a db name is meaningful text in any encoding. We can make it possible to type such a thing in the console if we represent it as a string of hexadecimal numbers. For example, mdb_dump could do something like to_hex_string in this code snippet: http://pastebin.com/jqnGSS6C (note: you need -std=c11 to compile the snippet).

Hallvard Breien Furuseth

6:48 a.m.

On 02. feb. 2015 14:24, Howard Chu wrote:

...

Hallvard Breien Furuseth wrote:

...
I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice:-) Hopefully users and programmers will only need one method of handling non-ASCII LMDB names on Windows, not two.

I'd be nice if 'mdb_stat filename -s dbname' would Just Work, as would reading DB names and filenames from an config file. Yet OS-aware and OS-specific config files can look rather different. Maybe LMDB must handle DB names more flexibly than filenames, or maybe we'll end up recommending that "portable" DB names must be UTF-8. And add a "flag convert UTF8<->WCHAR if this is Windows".

DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them.

They're exposed to the programmer and the program's users. Either may want them on command-line arguments, in config files, etc. It will be inconvenient if LMDB requires different string handling for non-ASCII filenames and non-ASCII DB names in such cases. The programmer may choose to use different string handling but let's try to avoid forcing him to do so.

-- Hallvard

Timur Kristóf

7:03 a.m.

...

...
DB names are purely internal to LMDB, so they bear no relation to OS filenames and none of this discussion matters to them.

They're exposed to the programmer and the program's users. Either may want them on command-line arguments, in config files, etc. It will be inconvenient if LMDB requires different string handling for non-ASCII filenames and non-ASCII DB names in such cases. The programmer may choose to use different string handling but let's try to avoid forcing him to do so.

A path is always a Unicode string, while a DB name can be an arbitrary binary blob. So I don't think that we can treat them the same way.

Hallvard Breien Furuseth

7:15 a.m.

On 02. feb. 2015 16:03, Timur Kristóf wrote:

...

A path is always a Unicode string, while a DB name can be an arbitrary binary blob. So I don't think that we can treat them the same way.

Not the point. A program which uses LDMB can choose to treat its own DB names in its own LMDB environments as the same kind of strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we make that impossible.

As for what LMDB will accept and what it must handle, that's up to us. DB names are not binary blobs yet, after all.

-- Hallvard

Timur Kristóf

7:25 a.m.

...

...
A path is always a Unicode string, while a DB name can be an arbitrary binary blob. So I don't think that we can treat them the same way.

Not the point. A program which uses LDMB can choose to treat its own DB names in its own LMDB environments as the same kind of strings as filenames (WCHAR, UTF-8 char, or whatever). Unless we make that impossible.

As for what LMDB will accept and what it must handle, that's up to us. DB names are not binary blobs yet, after all.

Okay. What do you suggest?

Hallvard Breien Furuseth

8 a.m.

On 02. feb. 2015 16:25, Timur Kristóf wrote:

...

Okay. What do you suggest?

I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal with for programmers and or users. I guess I should have separated that from the rest more clearly.

-- Hallvard

Timur Kristóf

8:11 a.m.

...

I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal with for programmers and or users. I guess I should have separated that from the rest more clearly.

I can write a patch which does the UTF-8 to UTF-16 conversion on Windows for file paths, but I would hate to restrict db names to UTF-8 text only (or for that matter, any text only). However, not supporting non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable compromise to me.

Hallvard Breien Furuseth

8:28 a.m.

On 02. feb. 2015 17:11, Timur Kristóf wrote:

...

...
I suggest we wait to deal with DB names until we also have a way to deal with filenames. And this time test that it works is practice.

And then I also suggest to try to make this mess simple to deal with for programmers and or users. I guess I should have separated that from the rest more clearly.

I can write a patch which does the UTF-8 to UTF-16 conversion on Windows for file paths, but I would hate to restrict db names to UTF-8 text only (or for that matter, any text only). However, not supporting non-UTF-8 db names in mdb_dump and mdb_load sounds like a reasonable compromise to me.

I suggest we wait to deal with DB names until we also have a way to deal with filenames.

-- Hallvard

Florian Weimer

15 Feb 15 Feb

7:52 a.m.

* Timur Kristóf:

...

A path is always a Unicode string, while a DB name can be an arbitrary binary blob.

On many POSIX platforms, a path is a blob which does not contain '\000'. These systems do not enforce Unicode encoding at all.

Timur Kristóf

11:36 a.m.

...

...
A path is always a Unicode string, while a DB name can be an arbitrary binary blob.

On many POSIX platforms, a path is a blob which does not contain '\000'. These systems do not enforce Unicode encoding at all.

My mistake. I was unaware. On those platforms, how do you type a path name into a terminal?

Florian Weimer

12:21 p.m.

* Timur Kristóf:

...

...
...
A path is always a Unicode string, while a DB name can be an arbitrary binary blob.

On many POSIX platforms, a path is a blob which does not contain '\000'. These systems do not enforce Unicode encoding at all.

My mistake. I was unaware. On those platforms, how do you type a path name into a terminal?

There are some files which are not directly nameable. Many programs support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary bytes, but that's not universal. Depending on the actual implementation of the terminal, cut-and-paste of funny file names can work, too.

Older programs have trouble accessing such files even if the user chooses them in a file selection dialog, but current version are supposed to have been fixed (including OpenJDK, which took a ridiculously long time). Beyond that, it's not much different from dealing with file names in an unfamiliar script.

Timur Kristóf

12:33 p.m.

...

...
...
...
A path is always a Unicode string, while a DB name can be an arbitrary binary blob.

On many POSIX platforms, a path is a blob which does not contain '\000'. These systems do not enforce Unicode encoding at all.

My mistake. I was unaware. On those platforms, how do you type a path name into a terminal?

There are some files which are not directly nameable. Many programs support special sequences such as “Ctrl+V 3 7 7” to enter arbitrary bytes, but that's not universal. Depending on the actual implementation of the terminal, cut-and-paste of funny file names can work, too.

Older programs have trouble accessing such files even if the user chooses them in a file selection dialog, but current version are supposed to have been fixed (including OpenJDK, which took a ridiculously long time). Beyond that, it's not much different from dealing with file names in an unfamiliar script.

Interesting. So ultimately, there are always going to be things that you cannot type into your terminal directly.

Timur Kristóf

7 Jun 7 Jun

4:46 a.m.

Hi Everyone,

I've just came accross this old thread and am wondering, is this still an issue? Does LMDB have a way to use non-ASCII path names with mdb_env_open in a cross-platform way?

If not, would you guys accept patches to LMDB with this regard?

Thanks, Timur

2963

Age (days ago)

3819

Last active (days ago)

openldap-devel@openldap.org

16 comments

4 participants

tags (0)

participants (4)

Florian Weimer
Hallvard Breien Furuseth
Howard Chu
Timur Kristóf