I have read a few comments during the week-end about possible fixes to ejabberd problems. It calls for clarification of to avoid lack of understanding of the area and frustration.
Looking at the given mentioned problems (OTP and binary support), one of them is already in ejabberd 3.0 alpha (binaries), but what got my attention is that finally they might not be problems. It always depends ultimately on what you want to achieve. Some optimisations might be relevant in a few cases, but in a generic situation I still think that optimisations we made for ejabberd project for our customers on a case by case have a much larger impact. Some of the mentioned improvements would also stand on your scaling path for different use cases.
Let me take a specific example. OTP compliance is a controversial topic. Erlang OTP stands for Open Telecom Platform and is essentially a set of good practices and pattern to build Erlang applications. OTP provides a convenient help to get started but it is no silver bullet for all Erlang applications design. Sometimes, they fit in your problem and sometimes you have to do it differently.
One of the typical OTP pattern is supervision. You are supposed to link Erlang workers (the process doing actual work) to a supervisor that can trigger actions when the worker terminates.
However, for extremely large systems, with lots of worker creation and destruction, the supervisor comes with a performance penalty that you cannot afford. You thus have to disable the transient supervisors. In that case, OTP approach can limit your performance and become a bottleneck.
My view is that to get the best performance, you have to know and observe how the system behave in real world situation, for the specific use case. XMPP is a large protocol, especially with tens of XMPP Extension Protocols that have been added over time. If you want to scale you have to have a perfect knowledge of Erlang inside / out, but also a perfect knowledge of the XMPP protocol itself. Some requirements or suggested approach in the protocol do not scale out of the specification, and you have to take into account a full solution design, from the client behaviour itself to the cluster architecture and code optimizations.
Micro optimisations at the code level are doomed to be very limited. However, optimizing the full stack, from client (both desktop and mobile) to server, including specific code level optimisations, is a sure win. From experience, we can squeeze from 2 to 3 times more concurrent users on a single node. Your mileage might vary, but it clearly demonstrate that multiple levels of knowledge are involved in designing a scalable XMPP messaging solution.
I understand it is frustrating to hear that from a customer perspective: they often expect turnkey solutions that scale linearly. However, after a few times working with us, they understand our view and why scaling cannot rely on a one size fits all approach.
We have developed a set of modules for ejabberd and optimisations and a range of expertise at Process One down to the client. This allows us to scale to unprecedented levels. OTP offers nice patterns, but is no substitution for this experience working on the largest XMPP deployments in the world.
Edit: I have received good feedback, saying you understand our point of view. Thank you !
Just to make it clear, my point is really to think and act global at the highest architectural level. We do everyday many improvements to ejabberd at the code level and this is good for code maintenance. What makes a deployment successful is team with different background working on the whole picture.