ElixirConf 2023 - Razvan Draghici - Managing a massive amount of distributed Elixir nodes

Discover the challenges and solutions of managing a massive amount of distributed Elixir nodes, including supervision trees, node connections, and performance optimization.

Key takeaways
  • The speaker’s ElixirConf 2023 presentation about managing a massive amount of distributed Elixir nodes.
  • The speaker’s example showed a supervision tree with a net-sup supervisor, a partisan peer service, and a Broadway consumer for measuring node connections.
  • Starting distributed Erlang with all 100 nodes resulted in no disconnected nodes, but with larger node counts, some nodes disconnected without message delivery.
  • The speaker analyzed 300 nodes and found a 2% message loss, while 500 nodes was within the acceptable range with net_kernel ticks increasing.
  • Using PubSub allows for better performance and less noise when handling a large cluster size.
  • The speaker built their own benchmark module with data from AWS metadata API.
  • They also created a simple adapter using Ethereum’s ECSV' library.
  • The speaker encountered port driver issues, which are controlled by the Erlang runt-time.
  • The speaker utilized cookies for grouping nodes together, stating cookies are not for security.
  • Pub/Sub tests were run separately on each node, with a dedicated listener for each node.
  • The delay during Pub/Sub was mostly unnoticeable.
  • Error handling was implemented, but the speaker saw errors in their test code which caused nodes to disconnect when connections were made.
  • Error rate increased with node count in one of the tests.

… No further points