Hi,
I've recently started using LMDB in a new project targeted to MIPS (and later also ARM) architectures. While developing my storage code I found that test cases which execute perfectly on x86 were failing with assertions inside LMDB on mips and mipsel devices.
Investigating further, I've found that the "mtest" test program included with LMDB also fails. Specifically, the failures I'm seeing are any of the following:
mdb.c:2635: Assertion 'pglast <= env->me_pglast' failed in mdb_freelist_save()
or
mdb.c:5100: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in mdb_cursor_sibling()
or
mdb.c:5176: Assertion 'IS_LEAF(mp)' failed in mdb_cursor_next()
or
mdb.c:1713: Assertion 'rc == 0' failed in mdb_page_dirty()
etc.
The failures are intermittent in that there's about a 50% chance mtest will complete successfully. The line numbers are against LMDB 0.9.11 source.
Target devices and toolchain combinations I have tried (all fail):
- mips-sf-linux-musl cross compilers from musl.codu.org, musl libc 0.9.15. Atheros AR9344 cpu, OpenWRT trunk (Linux 3.10.28). - mipsel-sf-linux-musl cross compilers from musl.codu.org, musl libc 0.9.15, Broadcom BCM4706, Tomato firmware by "shibby" (Linux 2.6.22.x based on ASUS SDK).
In all cases the code fails both with and without optimization, and is compiled as a static executable.
Consulting with a friend who is also developing on MIPS devices he said he ran into similar problems when trying LMDB. He suggested that I check for problems with unaligned memory accesses; indeed compiling LMDB with -Wcast-align produces many warnings regarding such accesses.
My knowledge of LMDB and MIPS internals is not up to debugging this but it seems like unaligned accesses may(?) be the underlying cause.
What is the status of LMDB on MIPS? Has anyone tried it?
Any help would be much appreciated.
Martin
Martin Lucina wrote:
Hi,
I've recently started using LMDB in a new project targeted to MIPS (and later also ARM) architectures. While developing my storage code I found that test cases which execute perfectly on x86 were failing with assertions inside LMDB on mips and mipsel devices.
Investigating further, I've found that the "mtest" test program included with LMDB also fails. Specifically, the failures I'm seeing are any of the following:
mdb.c:2635: Assertion 'pglast <= env->me_pglast' failed in mdb_freelist_save()
or
mdb.c:5100: Assertion 'IS_BRANCH(mc->mc_pg[mc->mc_top])' failed in mdb_cursor_sibling()
or
mdb.c:5176: Assertion 'IS_LEAF(mp)' failed in mdb_cursor_next()
or
mdb.c:1713: Assertion 'rc == 0' failed in mdb_page_dirty()
etc.
The failures are intermittent in that there's about a 50% chance mtest will complete successfully. The line numbers are against LMDB 0.9.11 source.
Target devices and toolchain combinations I have tried (all fail):
- mips-sf-linux-musl cross compilers from musl.codu.org, musl libc 0.9.15. Atheros AR9344 cpu, OpenWRT trunk (Linux 3.10.28).
- mipsel-sf-linux-musl cross compilers from musl.codu.org, musl libc 0.9.15, Broadcom BCM4706, Tomato firmware by "shibby" (Linux 2.6.22.x based on ASUS SDK).
In all cases the code fails both with and without optimization, and is compiled as a static executable.
Consulting with a friend who is also developing on MIPS devices he said he ran into similar problems when trying LMDB. He suggested that I check for problems with unaligned memory accesses; indeed compiling LMDB with -Wcast-align produces many warnings regarding such accesses.
My knowledge of LMDB and MIPS internals is not up to debugging this but it seems like unaligned accesses may(?) be the underlying cause.
That seems pretty unlikely, considering that the code works fine on SPARC which also has quite stringent alignment requirements. I presume you're compiling a 32 bit binary, not 64 bit. Perhaps shorts are bigger than 16 bits on your platform?
What is the status of LMDB on MIPS? Has anyone tried it?
Any help would be much appreciated.
Martin
hyc@symas.com said:
That seems pretty unlikely, considering that the code works fine on SPARC which also has quite stringent alignment requirements. I presume you're compiling a 32 bit binary, not 64 bit. Perhaps shorts are bigger than 16 bits on your platform?
32-bit, yes. Otherwise nothing out of the ordinary:
(from the target) pagesize 4096 size_t 4 int 4 short 2 void* 4
The only thing which is somewhat unusual is that I use musl libc (http://www.musl-libc.org/). However I also use this on my x86_64 dev boxes and everything works fine.
I'm currently building an eglibc toolchain for MIPS just to rule that possibility out.
Martin
martin@lucina.net said:
The only thing which is somewhat unusual is that I use musl libc (http://www.musl-libc.org/). However I also use this on my x86_64 dev boxes and everything works fine.
I'm currently building an eglibc toolchain for MIPS just to rule that possibility out.
Ruled out.
Built a toolchain for mips-unknown-linux-gnu, using GCC 4.8.1 (crosstool-NG 1.19.0) and configured with eglibc 2.17. mtest still fails with the same symptoms.
Martin
Martin Lucina wrote:
martin@lucina.net said:
The only thing which is somewhat unusual is that I use musl libc (http://www.musl-libc.org/). However I also use this on my x86_64 dev boxes and everything works fine.
I'm currently building an eglibc toolchain for MIPS just to rule that possibility out.
Ruled out.
Built a toolchain for mips-unknown-linux-gnu, using GCC 4.8.1 (crosstool-NG 1.19.0) and configured with eglibc 2.17. mtest still fails with the same symptoms.
Martin
I wouldn't expect the libc to have affected this. I note that compiling with -Wcast-align produces no warnings on my x86-64 build, so it appears to be compiler-specific and not library specific.
hyc@symas.com said:
Martin Lucina wrote:
martin@lucina.net said:
The only thing which is somewhat unusual is that I use musl libc (http://www.musl-libc.org/). However I also use this on my x86_64 dev boxes and everything works fine.
I'm currently building an eglibc toolchain for MIPS just to rule that possibility out.
Ruled out.
Built a toolchain for mips-unknown-linux-gnu, using GCC 4.8.1 (crosstool-NG 1.19.0) and configured with eglibc 2.17. mtest still fails with the same symptoms.
Martin
I wouldn't expect the libc to have affected this. I note that compiling with -Wcast-align produces no warnings on my x86-64 build, so it appears to be compiler-specific and not library specific.
I don't have any other MIPS boxes to try on. Maybe someone on this list has an old SGI lying around with the IRIX compilers on it? Might be worth trying.
Martin
hyc@symas.com said:
Martin Lucina wrote:
martin@lucina.net said:
The only thing which is somewhat unusual is that I use musl libc (http://www.musl-libc.org/). However I also use this on my x86_64 dev boxes and everything works fine.
I'm currently building an eglibc toolchain for MIPS just to rule that possibility out.
Ruled out.
Built a toolchain for mips-unknown-linux-gnu, using GCC 4.8.1 (crosstool-NG 1.19.0) and configured with eglibc 2.17. mtest still fails with the same symptoms.
Martin
I wouldn't expect the libc to have affected this. I note that compiling with -Wcast-align produces no warnings on my x86-64 build, so it appears to be compiler-specific and not library specific.
The docs for -Wcast-align say:
Warn whenever a pointer is cast such that the required alignment of the target is increased. For example, warn if a char * is cast to an int * on machines where integers can only be accessed at two- or four-byte boundaries.
So the warning is arch-specific and thus will not trigger on x86(-64) which does not require aligned accesses for integers.
In a previous email you mentioned LMDB works fine on SPARC. That is not what I see here, with 0.9.11 freshly cloned from gitorious:
$ uname -a SunOS erzika 5.10 Generic_141414-10 sun4u sparc SUNW,Sun-Blade-100 $ mkdir testdb $ dbx ./mtest (...) (dbx) run Running: mtest (process id 20724) Reading libc_psr.so.1 t@1 (l@1) signal SEGV (no mapping at the fault address) in mdb_txn_renew0 at 0x12ea4 0x00012ea4: mdb_txn_renew0+0x01e4: ld [%g2 + 76], %g2 (dbx) bt current thread: t@1 =>[1] mdb_txn_renew0(0x2fa88, 0x0, 0xffffffff, 0x0, 0x2f9f8, 0x2fa88), at 0x12ea4 [2] mdb_txn_begin(0xc, 0x0, 0x0, 0xffbff9bc, 0x2fa88, 0x2f9f8), at 0x1425c [3] main(0x2f638, 0xffbff9bc, 0xffbffa34, 0x2f55c, 0x2f9f0, 0xee), at 0x1ded8
I note that the CSW GCC 4.6.3 I'm using on SPARC also produces warnings when building mdb with -Wcast-align.
Martin
martin@lucina.net said:
current thread: t@1 =>[1] mdb_txn_renew0(0x2fa88, 0x0, 0xffffffff, 0x0, 0x2f9f8, 0x2fa88), at 0x12ea4 [2] mdb_txn_begin(0xc, 0x0, 0x0, 0xffbff9bc, 0x2fa88, 0x2f9f8), at 0x1425c [3] main(0x2f638, 0xffbff9bc, 0xffbffa34, 0x2f55c, 0x2f9f0, 0xee), at 0x1ded8
I note that the CSW GCC 4.6.3 I'm using on SPARC also produces warnings when building mdb with -Wcast-align.
Rebuilt with the Sun compilers to get proper debug info, the faulting instruction is the same as the one in GCC:
signal SEGV (no mapping at the fault address) in mdb_env_pick_meta at line 3349 in file "mdb.c" 3349 return (env->me_metas[0]->mm_txnid < env->me_metas[1]->mm_txnid);
Martin
On Tue, 18 Feb 2014, Martin Lucina wrote:
martin@lucina.net said:
current thread: t@1 =>[1] mdb_txn_renew0(0x2fa88, 0x0, 0xffffffff, 0x0, 0x2f9f8, 0x2fa88), at 0x12ea4 [2] mdb_txn_begin(0xc, 0x0, 0x0, 0xffbff9bc, 0x2fa88, 0x2f9f8), at 0x1425c [3] main(0x2f638, 0xffbff9bc, 0xffbffa34, 0x2f55c, 0x2f9f0, 0xee), at 0x1ded8
I note that the CSW GCC 4.6.3 I'm using on SPARC also produces warnings when building mdb with -Wcast-align.
Rebuilt with the Sun compilers to get proper debug info, the faulting instruction is the same as the one in GCC:
signal SEGV (no mapping at the fault address) in mdb_env_pick_meta at line 3349 in file "mdb.c" 3349 return (env->me_metas[0]->mm_txnid < env->me_metas[1]->mm_txnid);
Martin
I don't have MIPS specsheets memorized/in front of me, but what's the pagesize?
http://www.openldap.org/lists/openldap-devel/201310/msg00005.html (and ITS#7713)
richton@nbcs.rutgers.edu said:
Rebuilt with the Sun compilers to get proper debug info, the faulting instruction is the same as the one in GCC:
signal SEGV (no mapping at the fault address) in mdb_env_pick_meta at line 3349 in file "mdb.c" 3349 return (env->me_metas[0]->mm_txnid < env->me_metas[1]->mm_txnid);
Martin
I don't have MIPS specsheets memorized/in front of me, but what's the pagesize?
http://www.openldap.org/lists/openldap-devel/201310/msg00005.html (and ITS#7713)
4k on the MIPS boards I'm using. The SPARC I tried on is 8k, but LMDB 0.9.11 from git uses sysconf(_SC_PAGE_SIZE) to get the page size at run time so that shouldn't be a problem.
The SPARC and MIPS problems may or may not be related - I just tried SPARC since Howard mentioned it worked. Do you know a known-good version on SPARC? If so I can work from that and see if I can find the commits that introduced the problem.
Martin Lucina wrote:
richton@nbcs.rutgers.edu said:
Rebuilt with the Sun compilers to get proper debug info, the faulting instruction is the same as the one in GCC:
signal SEGV (no mapping at the fault address) in mdb_env_pick_meta at line 3349 in file "mdb.c" 3349 return (env->me_metas[0]->mm_txnid < env->me_metas[1]->mm_txnid);
Martin
I don't have MIPS specsheets memorized/in front of me, but what's the pagesize?
http://www.openldap.org/lists/openldap-devel/201310/msg00005.html (and ITS#7713)
4k on the MIPS boards I'm using. The SPARC I tried on is 8k, but LMDB 0.9.11 from git uses sysconf(_SC_PAGE_SIZE) to get the page size at run time so that shouldn't be a problem.
The SPARC and MIPS problems may or may not be related - I just tried SPARC since Howard mentioned it worked. Do you know a known-good version on SPARC? If so I can work from that and see if I can find the commits that introduced the problem.
I just did a fresh build of 32 bit SPARC Solaris 10 with gcc 4.4.0 and mtest works fine. I get a number of warnings if I use -Wcast-align but in this case they're irrelevant.
hyc@symas.com said:
I just did a fresh build of 32 bit SPARC Solaris 10 with gcc 4.4.0 and mtest works fine. I get a number of warnings if I use -Wcast-align but in this case they're irrelevant.
Sorry, the directory I was testing in on SPARC was on NFS :-(
If I run mtest in /tmp it works fine.
That still doesn't explain the MIPS issues, any suggestions on how to proceed there? I can give someone access to a MIPS host if that would help.
Martin
Martin Lucina wrote:
That still doesn't explain the MIPS issues, any suggestions on how to proceed there? I can give someone access to a MIPS host if that would help.
Copying back to the list:
Martin Lucina wrote:
hyc@symas.com said:
It appears that this system also lacks a coherent FS cache, like some BSDs. I changed mtest.c to use MDB_WRITEMAP and it now runs fine.
The unmodified mtest.c also worked when single-stepping thru gdb, which apparently gives time for the cache to sort itself out between mdb function calls.
Interesting. What you're saying is that without MDB_WRITEMAP pages are written out separately and it is up to the FS cache to ensure that reading back via the memory map is consistent, correct?
That's the general idea. As the LMDB design paper states, LMDB requires the OS to use a unified buffer cache - so that mmap pages and FS cache pages are the same.
I'll try and dig through the OpenWRT kernel configuration, they must have changed something that triggers this behaviour.
Frankly it seems unlikely that they could have changed something so fundamental to the VM subsystem of the kernel. It's also possible that we're seeing *CPU* cache inconsistencies, and that adding a few MIPS-specific memory barrier instructions here and there may fix things up.
Unfortunately I need (or will be very unhappy without) nested transactions so I'm going to try and get it working without MDB_WRITEMAP if possible.
hyc@symas.com said:
Martin Lucina wrote:
That still doesn't explain the MIPS issues, any suggestions on how to proceed there? I can give someone access to a MIPS host if that would help.
Copying back to the list:
Martin Lucina wrote:
hyc@symas.com said:
It appears that this system also lacks a coherent FS cache, like some BSDs. I changed mtest.c to use MDB_WRITEMAP and it now runs fine.
The unmodified mtest.c also worked when single-stepping thru gdb, which apparently gives time for the cache to sort itself out between mdb function calls.
Interesting. What you're saying is that without MDB_WRITEMAP pages are written out separately and it is up to the FS cache to ensure that reading back via the memory map is consistent, correct?
That's the general idea. As the LMDB design paper states, LMDB requires the OS to use a unified buffer cache - so that mmap pages and FS cache pages are the same.
I'll try and dig through the OpenWRT kernel configuration, they must have changed something that triggers this behaviour.
Frankly it seems unlikely that they could have changed something so fundamental to the VM subsystem of the kernel. It's also possible that we're seeing *CPU* cache inconsistencies, and that adding a few MIPS-specific memory barrier instructions here and there may fix things up.
I did some more investigating:
1) Tried adding calls to sync_file_range() (Linux-specific syscall) and in desperation even sync(2) to mdb_txn_commit() just after mdb_page_flush() et al. No change.
2) Compiled the below test program on various plaforms. This tries (rather unscientifically) to test how "long" it takes for a mmap to become consistent after writing to the underlying file through a different fd opened with O_DSYNC (what mdb does).
The results are interesting:
x86_64 core i5m (2 cores, 4 threads): gcc -O2: consistently less than 1k iterations x86_64 core i5m (2 cores, 4 threads): gcc -O2 -DNOBARRIER: consistently around 10k iterations x86_64 dual 4-core xeon, gcc -O2: around 2k iterations x86_64 dual 4-core xeon, gcc -O2 -DNOBARRIER: 10-15k iterations MIPS target, musl gcc -O2 -mips32r2: varies, mostly 1, in each 10 runs at least one run completes in the high 100k's of iterations MIPS target, musl gcc -O2 -mips32r2 -DNOBARRIER: about the same as previous, but when not 1 the result is subjectively higher (around 1m iterations) single CPU SPARCv9 solaris 10, Sun cc -fast -mt: always[*] 1 single CPU SPARCv9 solaris 10, CSW gcc -O2, with or without -DNOBARRIER: always[*] 1 ia64 dual Itanium 2, Linux gcc -O2: around 2k iterations ia64 dual Itanium 2, Linux gcc -O2 -DNOBARRIER: anwhere between 3-8k iterations
[*] very rarely several million iterations
Does this help in any way? It certainly seems to suggest that the MIPS target's fs cache is (eventually) consistent.
Any pointers on how to proceed or what else to try/who else to ask will be much appreciated.
Martin
----test program---- #include <fcntl.h> #include <sys/types.h> #include <sys/mman.h> #include <assert.h> #include <stdio.h> #include <pthread.h> #include <unistd.h>
pthread_barrier_t b;
static void *thread (void *arg) { int fd;
pthread_barrier_wait (&b); fd = open ("/tmp/testfile", O_RDWR | O_CREAT | O_DSYNC, 0600); unsigned long v = 1; assert (write (fd, &v, sizeof v) == sizeof v); close (fd); return NULL; }
int main (int argc, char *argv[]) { int fd; pthread_barrier_init (&b, NULL, 2);
unlink ("/tmp/testfile"); fd = open ("/tmp/testfile", O_RDWR | O_CREAT, 0600); unsigned long v = 0; assert (write (fd, &v, sizeof v) == sizeof v); volatile unsigned long *p = mmap (NULL, getpagesize (), PROT_READ, MAP_SHARED, fd, 0); assert (p != MAP_FAILED);
int i = 0; pthread_t thread_id = 0; pthread_create (&thread_id, NULL, thread, NULL);
while (*p != 1) { if (!i) pthread_barrier_wait (&b); i++; #if defined (__GNUC__) && !defined (NOBARRIER) __sync_synchronize (); #endif } printf ("%d\n", i);
munmap ((void *)p, getpagesize ()); close (fd); return 0; }
Hi!
I think a problem with your test program is that you don't wait for the write() thread to finish before you try to read the mmap(). See how locking on a producer-consumer (or reader-writer) relationship is usually implemented (If you don't have it ready, I could send you the algorithms).
Regards, Ulrich
Martin Lucina martin@lucina.net schrieb am 10.03.2014 um 22:10 in Nachricht
20140310211032.GA22062@nodbug.moloch.sk:
hyc@symas.com said:
Martin Lucina wrote:
That still doesn't explain the MIPS issues, any suggestions on how to proceed there? I can give someone access to a MIPS host if that would help.
Copying back to the list:
Martin Lucina wrote:
hyc@symas.com said:
It appears that this system also lacks a coherent FS cache, like some BSDs. I changed mtest.c to use MDB_WRITEMAP and it now runs fine.
The unmodified mtest.c also worked when single-stepping thru gdb, which apparently gives time for the cache to sort itself out between mdb function calls.
Interesting. What you're saying is that without MDB_WRITEMAP pages are written out separately and it is up to the FS cache to ensure that reading back via the memory map is consistent, correct?
That's the general idea. As the LMDB design paper states, LMDB requires the OS to use a unified buffer cache - so that mmap pages and FS cache pages are the same.
I'll try and dig through the OpenWRT kernel configuration, they must have changed something that triggers this behaviour.
Frankly it seems unlikely that they could have changed something so fundamental to the VM subsystem of the kernel. It's also possible that we're seeing *CPU* cache inconsistencies, and that adding a few MIPS-specific memory barrier instructions here and there may fix things up.
I did some more investigating:
- Tried adding calls to sync_file_range() (Linux-specific syscall) and
in desperation even sync(2) to mdb_txn_commit() just after mdb_page_flush() et al. No change.
- Compiled the below test program on various plaforms. This tries (rather
unscientifically) to test how "long" it takes for a mmap to become consistent after writing to the underlying file through a different fd opened with O_DSYNC (what mdb does).
The results are interesting:
x86_64 core i5m (2 cores, 4 threads): gcc -O2: consistently less than 1k iterations x86_64 core i5m (2 cores, 4 threads): gcc -O2 -DNOBARRIER: consistently around 10k iterations x86_64 dual 4-core xeon, gcc -O2: around 2k iterations x86_64 dual 4-core xeon, gcc -O2 -DNOBARRIER: 10-15k iterations MIPS target, musl gcc -O2 -mips32r2: varies, mostly 1, in each 10 runs at least one run completes in the high 100k's of iterations MIPS target, musl gcc -O2 -mips32r2 -DNOBARRIER: about the same as previous, but when not 1 the result is subjectively higher (around 1m iterations) single CPU SPARCv9 solaris 10, Sun cc -fast -mt: always[*] 1 single CPU SPARCv9 solaris 10, CSW gcc -O2, with or without -DNOBARRIER: always[*] 1 ia64 dual Itanium 2, Linux gcc -O2: around 2k iterations ia64 dual Itanium 2, Linux gcc -O2 -DNOBARRIER: anwhere between 3-8k iterations
[*] very rarely several million iterations
Does this help in any way? It certainly seems to suggest that the MIPS target's fs cache is (eventually) consistent.
Any pointers on how to proceed or what else to try/who else to ask will be much appreciated.
Martin
----test program---- #include <fcntl.h> #include <sys/types.h> #include <sys/mman.h> #include <assert.h> #include <stdio.h> #include <pthread.h> #include <unistd.h>
pthread_barrier_t b;
static void *thread (void *arg) { int fd;
pthread_barrier_wait (&b); fd = open ("/tmp/testfile", O_RDWR | O_CREAT | O_DSYNC, 0600); unsigned long v = 1; assert (write (fd, &v, sizeof v) == sizeof v); close (fd); return NULL;
}
int main (int argc, char *argv[]) { int fd; pthread_barrier_init (&b, NULL, 2);
unlink ("/tmp/testfile"); fd = open ("/tmp/testfile", O_RDWR | O_CREAT, 0600); unsigned long v = 0; assert (write (fd, &v, sizeof v) == sizeof v); volatile unsigned long *p = mmap (NULL, getpagesize (), PROT_READ, MAP_SHARED, fd, 0); assert (p != MAP_FAILED); int i = 0; pthread_t thread_id = 0; pthread_create (&thread_id, NULL, thread, NULL); while (*p != 1) { if (!i) pthread_barrier_wait (&b); i++;
#if defined (__GNUC__) && !defined (NOBARRIER) __sync_synchronize (); #endif } printf ("%d\n", i);
munmap ((void *)p, getpagesize ()); close (fd); return 0;
}
Ulrich.Windl@rz.uni-regensburg.de said:
Hi!
I think a problem with your test program is that you don't wait for the write() thread to finish before you try to read the mmap(). See how locking on a producer-consumer (or reader-writer) relationship is usually implemented (If you don't have it ready, I could send you the algorithms).
That shoudln't matter. The write thread opens the file descriptor with O_DSYNC, and all the test program is trying to verify is that the mmap eventually becomes consistent. You can ignore the pthread_barrier stuff, that just tries to eliminate thread creation time from the equation.
Now that I think about the output, there is a fairly obvious explanation for the numbers - 1 means the writer thread got scheduled first. However I'm still not sure why the # of iterations on MIPS is so high - scheduling resolution on the box maybe?
Martin
Martin Lucina wrote:
Ulrich.Windl@rz.uni-regensburg.de said:
Hi!
I think a problem with your test program is that you don't wait for the write() thread to finish before you try to read the mmap(). See how locking on a producer-consumer (or reader-writer) relationship is usually implemented (If you don't have it ready, I could send you the algorithms).
That shoudln't matter.
More to the point, you don't wait for the write() thread to *start* - there's no guarantee that it will actually start running as soon as the barrier is released. A valid test has to know that the write() thread actually got scheduled and ran.
The write thread opens the file descriptor with
O_DSYNC, and all the test program is trying to verify is that the mmap eventually becomes consistent. You can ignore the pthread_barrier stuff, that just tries to eliminate thread creation time from the equation.
Now that I think about the output, there is a fairly obvious explanation for the numbers - 1 means the writer thread got scheduled first. However I'm still not sure why the # of iterations on MIPS is so high - scheduling resolution on the box maybe?
Martin
openldap-technical@openldap.org