LMDB and text encoding

List overview All Threads
Download

newer

older

Fwd: LMDB and text encoding

Re: Question regarding MDB_NOLOCK

Timur Kristóf

27 Jan 2015 27 Jan '15

1:39 p.m.

Hi Everyone,

I've been talking to Howard about this and he suggested to post it to this mailing list. There are two things that I recently noticed about how LMDB works with various encodings and I think it's worth to discuss.

1. Database names

mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications.

My suggestion is to create a variant of this function that also accepts a length parameter (or just use MDB_val) so that instead of treating it as a C string, it would treat it like a series of bytes, allowing the user to use the encoding of their choice.

2. Path names

Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the likes accept a char* for path names. This is fine on most unixes where char* is an UTF-8 string, but unfortunately, these functions call the ANSI variants of the Windows API functions, making it impossible to use Unicode path names with them.

I think we should switch to the widechar APIs instead, but that would also mean changing the LMDB API to accept a wchar_t* parameter on Windows instead of char*.

What do you guys think about all this?

Best regards, Timur Kristóf

Show replies by date

Timur Kristóf

29 Jan 29 Jan

1:29 a.m.

...

mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications. [...]

Here is a patch that addresses this concern. If you like it, I'll move on to the other issue.

Timur Kristóf

5:42 a.m.

Here is a fixed version of the patch.

On Thu, Jan 29, 2015 at 10:29 AM, Timur Kristóf timur.kristof@gmail.com wrote:

...

...
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications. [...]

Here is a patch that addresses this concern. If you like it, I'll move on to the other issue.

Timur Kristóf

1 Feb 1 Feb

1:59 p.m.

Hi,

I forgot to add an ENOMEM check. I added it now. I think this patch is ready for Howard and Hallvard to review. :)

Timur

On Thu, Jan 29, 2015 at 2:42 PM, Timur Kristóf timur.kristof@gmail.com wrote:

...

Here is a fixed version of the patch.

On Thu, Jan 29, 2015 at 10:29 AM, Timur Kristóf timur.kristof@gmail.com wrote:

...
...
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications. [...]

Here is a patch that addresses this concern. If you like it, I'll move on to the other issue.

Howard Chu

3:40 p.m.

Timur Kristóf wrote:

...

Hi,

I forgot to add an ENOMEM check. I added it now. I think this patch is ready for Howard and Hallvard to review. :)

It looks OK to me. No one raises any concerns I'll commit it in a few hours.

...

Timur

On Thu, Jan 29, 2015 at 2:42 PM, Timur Kristóf timur.kristof@gmail.com wrote:

...
Here is a fixed version of the patch.

On Thu, Jan 29, 2015 at 10:29 AM, Timur Kristóf timur.kristof@gmail.com wrote:

...
...
mdb_dbi_open treats its name parameter as a C string. This means UTF-8 on unixes and ANSI on Windows, which is problematic for cross-platform applications. [...]

Here is a patch that addresses this concern. If you like it, I'll move on to the other issue.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

5 p.m.

On 02/02/15 00:40, Howard Chu wrote:

...

It looks OK to me. No one raises any concerns I'll commit it in a few hours.

Some sudden last thoughts:

mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size) to exclude non-databases, which is no longer valid.

Database names with \0 in them can no longer be spelled as strings, everything which gets DB names from the database must use binary blobs. Including mdb_load and mdb_dump; I notice mdb_load uses strdup() for the "database=" name. Come to think of it, I have no idea if the dump format supports DB names with \0 in them.

-- Hallvard

Hallvard Breien Furuseth

5:09 p.m.

On 02/02/15 02:00, Hallvard Breien Furuseth wrote:

...

Come to think of it, I have no idea if the dump format supports DB names with \0 in them.

...and there will now be database names which cannot be spelled on the command line, like for <mdb_stat/mdb_dump -s subdb>. I don't think that was quite the point.

Howard Chu

6:37 p.m.

Hallvard Breien Furuseth wrote:

...

On 02/02/15 00:40, Howard Chu wrote:

...
It looks OK to me. No one raises any concerns I'll commit it in a few hours.

Some sudden last thoughts:

mdb_dump.c also has a check (memchr(key.mv_data, '\0', key.mv_size) to exclude non-databases, which is no longer valid.

Good point. As Timur's patch comment notes, we probably need an API call "is valid DB" now.

...

Database names with \0 in them can no longer be spelled as strings, everything which gets DB names from the database must use binary blobs. Including mdb_load and mdb_dump; I notice mdb_load uses strdup() for the "database=" name. Come to think of it, I have no idea if the dump format supports DB names with \0 in them.

No, it doesn't. It's the BDB format, and BDB only accepted C strings.

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

Hallvard Breien Furuseth

29 Jan 29 Jan

2:15 a.m.

My take:

On 27. jan. 2015 22:39, Timur Kristóf wrote:

...

Database names

MDB_val here sounds nice...

...

Path names

Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the likes accept a char* for path names. This is fine on most unixes where char* is an UTF-8 string, but unfortunately, these functions call the ANSI variants of the Windows API functions, making it impossible to use Unicode path names with them.

I think we should switch to the widechar APIs instead, but that would also mean changing the LMDB API to accept a wchar_t* parameter on Windows instead of char*.

My initial reaction is that an API with different prototypes for Windows and Unix sounds bad. Unix must have char* since it does not even support wchar_t* filenames, and it should be simple to write a portable OS-unaware LDMB program.

Though I notice Windows #defines CreateFile() & co as CreateFileA or CreateFileW depending on whether or not UNICODE is #defined (and some other macros), without even mentioning this in the CreateFile() doc. I suppose ldmb.h could do something similar - but with doc:-)

What's the norm for Windows libraries? Google found TCHAR stuff which becomes WCHAR or char depending on defined(UNICODE), and apparently strong religions about whether if it's a good or bad idea for portable libraries to do the same. #define MDB_TEXT(str) as str or L##str depending on UNICODE, etc.

-- Hallvard

Hallvard Breien Furuseth

2:19 a.m.

I wrote:

...

Though I notice Windows #defines CreateFile() & co as CreateFileA or CreateFileW depending on whether or not UNICODE is #defined (and some other macros), without even mentioning this in the CreateFile() doc. I suppose ldmb.h could do something similar - but with doc:-)

Whoops, is does document it: Filename parameter is "LPCTSTR".

-- Hallvard

Timur Kristóf

5:35 a.m.

I've had a brief chat with Hallvard on IRC. We came up with several possible solutions, although each of them has its drawbacks. Writing cross-platform code that supports unicode is always a messy business. I vote for option 4, but would like to hear everyone's opinions before starting to work on any of them.

1) Separate widechar functions

Make functions such as mdb_env_open_w that would call the widechar APIs. The drawback of this approach is that it would require a lot of duplicate code, which is hard to maintain. It would also pollute the lmdb header file.

2) New flag

Introduce a new flag (such as MDB_USE_WCHAR) that would tell mdb_dbi_open to cast the path parameter to wchar_t* under the hood and call the widechar variant of the windows api.

Advantage: only the string concatenation code would need to be duplicated Drawback: it is really-really ugly

3) Require UTF-16 on Windows

Since Microsoft discourages the use of their ANSI apis, we could say that we require UTF-16 on windows. We can make a type such as mdb_uchar_t that we would typedef to char on unix and wchar_t on windows and then we could change the function signatures to use this type.

Drawback: users that want to write cross-platform code would need to ifdef their calls to mdb_env_open

4) Require UTF-8 on Windows

Let's say we require the path parameter to be encoded in UTF-8, even on windows. Then under the hood we can convert it to UTF-16 and call the widechar APIs. This doesn't lead to loss of performance because windows itself converts to UTF-16 anyway if you use their ANSI functions. This is the least ugly and perhaps the easiest-to-implement solution we found. It is easy to make UTF-8 (most libraries can produce it, or the user could use u8"..." from C++11, etc.)

Advantage: this is the easiest to implement; code that worked before (with ASCII paths) will work without modification, and we don't need to duplicate any code.

Howard Chu

1 Feb 1 Feb

6:58 p.m.

Timur Kristóf wrote:

...

Hi Everyone,

I've been talking to Howard about this and he suggested to post it to this mailing list. There are two things that I recently noticed about how LMDB works with various encodings and I think it's worth to discuss.

...

Path names

Functions like mdb_env_open, mdb_env_get_path, mdb_env_copy and the likes accept a char* for path names. This is fine on most unixes where char* is an UTF-8 string, but unfortunately, these functions call the ANSI variants of the Windows API functions, making it impossible to use Unicode path names with them.

I think we should switch to the widechar APIs instead, but that would also mean changing the LMDB API to accept a wchar_t* parameter on Windows instead of char*.

What do you guys think about all this?

I just had a look at how BDB handled this. As you can see they used a TO_TSTRING macro to convert incoming pathnames from UTF8 to UTF16.

https://gitorious.org/berkeleydb/berkeleydb/source/347d239a1e44ed4f773ae9274...

(And a FROM_TSTRING for the reverse, as well.)

-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/

3805

Age (days ago)

3811

Last active (days ago)

openldap-devel@openldap.org

11 comments

3 participants

tags (0)

participants (3)

Hallvard Breien Furuseth
Howard Chu
Timur Kristóf