As always... just as you hit send to an email going to an open mailing list..
It's the bandwidth isn't it..
I'm so used to everything being a 1000mbit that I didn't spot the 100mbit limit being hit.
Will continue investigations with that additional bit of info..! :)
Thanks!
On Mon, Sep 4, 2017 at 9:59 AM, Tim tim@yetanother.net wrote:
Cheers guys,
Reassuring that I'm roughly on the right track - but that leads me into other questions relating to what I'm currently experiencing while trying to load test the platform.
I'm currently using LocustIO, with a swarm of ~70 instances spread ~25 hosts, to try scale the test traffic.
The problem I'm seeing (and hence the reason why I was questioning my initial test approach), is that the traffic seems to be artificially capping out and I can't for the life of me find the bottleneck.
I'm recording/graphing all of cn=monitor, all resources covered by vmstat and bandwidth - nothing appears to be topping out.
If I perform searches in isolation, it quickly ramps up to 20k/s and then just tabletops, while all system resources seem reasonably happy.
This happens no matter what distribution of clients I deploy (i.e. 5000 clients over 70 hosts or 100 clients over 10 hosts) - so fairly confident that the test environment is more than capable of generating further traffic.
https://s3.eu-west-2.amazonaws.com/uninspired/mystery_bottleneck.png
(.. this was thrown together in a very rough and ready fashion - it's quite possible that my units may be off on some of the y-axis!)
I've performed some minor optimisations to try and resolve it (number of available file handles was my initial hope for an easy fix..) but so far, nothings helped - I still see this capping of throughput prior to the key system resources even getting slightly hot.
I had hoped that it was going to be as simple as increasing a concurrency variable within the config - but the one that does exist seems to not be valid for anything outside of legacy solaris deployments?
If anyone has any suggestions as to where I could investigate for a potential bottle neck (either on the system or within my openldap configuration) it would be very much appreciated.
Thanks in advance
On Mon, Sep 4, 2017 at 7:47 AM, Michael Ströder michael@stroeder.com wrote:
Tim wrote:
I've, so far, been making use of home grown python-ldap3 scripts to simulate the various kinds of interactions using many parallel
synchronous
requests - but as I scale this up, I'm increasingly aware that it is a
very
different ask to simulate simple synchronous interactions compared to a fully optimised multithreaded client with dedicated async/sync channels
and
associated strategies.
Most clients will just send those synchronous requests. So IMHO this is the right test pattern and you should simply make your test client multi-threaded.
I'm currently working with a dataset of in the region of 2,500,000
objects
and looking to test throughput up to somewhere in the region of 15k/s searches alongside 1k/s modification/addition events - which is beyond
what
the current basic scripts are able to achieve.
Note that the ldap3 module for Python is written in pure Python - also the ASN.1 encoding/decoding stuff. In opposite to that the old Python 2.x https://python-ldap.org module is a C wrapper module around the OpenLDAP libs and therefore you might get a better client performance. Nevertheless you should spread your test clients over several machines to really achieve the needed performance.
Ciao, Michael.
-- Tim tim@yetanother.net