http://symas.com/mdb/inmem/scaling.html
i just looked at this and i have a sneaking suspicion that you may be running into the same problem that i encountered when accidentally opening 10 LMDBs 20 times by forking 30 processes *after* opening the 10 LMDBs... (and also forgetting to close them in the parent).
what i found was that when i reduced the number of LMDBs to 3 or below, the loadavg on the multi-core system was absolutely fine [ok, it was around 3 to 4 but that's ok]
adding one more LMDB (remember that's 31 extra file handles opened to a shm-mmap'd file) increased the loadavg to 6 or 7. by the time that was up to 10 the loadavg had completely unreasonably jumped to over 30. i could log in over ssh - that was not a problem. editing a file was ok (opening it) but trying to create new files resulted in applications (such as vim) was blocked so badly that i often could not even press ctrl-z in order to background the task, and had to kill the entire ssh session.
in each test run the amount of work being done was actually relatively small.
basically i suspect a severe bug in the linux kernel which these extreme circumstances (32 or 16 processes accessing the same mmap'd file for example) have never been encountered, so the bug is simply... unknown.
can i make the suggestion that, whilst i am aware that it is generally not recommended for production environments to run more processes than there are cores, you try running 128, 256 and even 512 processes all hitting that 64-core system, and monitor its I/O usage (iostats) and loadavg whilst doing so?
the hypothesis to test is that the performance, which should scale reasonably linearly downwards as a ratio of the number of processes to the number of cores, instead drops like a lead balloon.
l.