Tuning ZIO for high performance

Tuning ZIO for high performance

How to make ZIO applications as fast as possible in production

Let's start with a disclaimer. What is discussed in this article is not the ultimate truth: how to make your application faster highly depends on what your application is actually doing. Depending on your use case, the overhead of ZIO might be completely negligible, or on the other hand, it might be quite significant. The only way to know is to analyze and measure the performance of your application through load testing with a profiler. However, I hope to provide some interesting pointers and things to consider when taking your ZIO app to production.

Runtime flags

Let's start with an easy one. ZIO has a few flags that can be easily turned on or off when you run your application. I won't detail all of them, but I want to draw your attention to two of them that can have some impact in production.

The first one is called FiberRoots and is enabled by default. With this flag on, whenever a root fiber is created (usually using the forkDaemon operator), this fiber will be added to a list tracking all root fibers. This mechanism serves two purposes:

  • You can perform fiber dumps. Similarly to thread dumps, you can send kill -s INFO to your application process, and it will print the full list of root fibers, their state, and what they are currently doing (which function they are executing).

  • When you stop your application, ZIO will try to interrupt all root fibers. If you don't track them, it will only interrupt the "children" fibers created from the main fiber, but you won't have clean termination of other root fibers created through forkDaemon.

However, this kind of tracking has a cost if you are creating a lot of root fibers. In the most extreme use case where we are just forking root fibers that do nothing, disabling the FiberRoots flag improves performance by 2.5 times (see the benchmark result in https://github.com/zio/zio/pull/8745). Be aware that even if you don't have many forkDaemon in your own code, it might be called by a library you are using. For example, zio-grpc calls forkDaemonon every gRPC request. Some ZIO operators also use it internally.

Personally, while I occasionally use fiber dumps in my local environment, I don't really need them in production, so I disabled that flag to avoid doing that tracking unnecessarily. Disabling it can be done by simply adding this to the bootstrap function of your ZIO app:

val bootstrap = Runtime.disableFlags(RuntimeFlag.FiberRoots)

There is another flag that I find quite interesting, and that one is disabled by default: RuntimeMetrics. Enabling that flag will trigger the collection of metrics that are then exposed as ZIO metrics and can be sent to the backend of your choice (Prometheus, Datadog, etc). The metrics it collects are the following:

  • zio_fiber_failure_causes will collect the common causes of fiber failures

  • zio_fiber_fork_locations will tell you from where in the code you fork fibers the most

  • zio_fiber_started, zio_fiber_successes, and zio_fiber_failures will count how many fibers are started and how many succeed or fail

  • zio_fiber_lifetimes will measure how long your fibers are running

Having those metrics can be quite useful when monitoring a production system, so you might want to consider it despite its runtime overhead. That overhead should be pretty small, but again, test and measure before you apply it.

Enabling it can be done by adding the following in your bootstrap function:

val boostrap = Runtime.enableRuntimeMetrics

Parallelism

Let's now talk about a common pitfall when using ZIO: unbounded parallelism. One of ZIO's common operators is ZIO.foreach, which allows you to run an effect for each item in a collection. There's also ZIO.foreachDiscard that is faster if you don't need to collect results (e.g., if your effect returns Unit). These operators have parallel counterparts: ZIO.foreachPar and ZIO.foreachParDiscard when you want to run things in parallel.

However, in some situations, foreachPar can be slower than foreach. How is that possible?

If you run foreachPar on a collection of 1,000 elements, ZIO is going to create 1,000 fibers that are all going to compete for your CPU time. If what each fiber does is CPU-bound, there is no advantage in creating more fibers than your number of cores. On the contrary, you are going to suffer from the overhead of the machinery involved with creating and joining all those fibers. In cases where your fibers are doing some I/O, it might make sense to create a lot of them at once, but you are most probably going to hit some other limits (e.g., DB connections, network, etc.), so making it bounded can be safer.

To control the number of fibers created when using foreachPar, you can use the withParallelism operator on ZIO.

ZIO
  .foreachPar(1 to 1000)(_ => doSomething)
  .withParallelism(16)

In that example, 16 fibers will be created and will concurrently process elements from myList. Assuming doSomething is CPU-bound and I have 16 cores, this will perform much better than having 1,000 fibers doing the same thing.

The withParallelism operator adjusts the parallelism level for any code executed within its scope. This includes all child fibers forked from it, as fibers inherit this setting from their parent. This means you can use withParallelism in your main function to set a sensible default parallelism across your application (usually matching the number of CPU cores), and you can adjust it locally as needed.

Fun fact: while writing this, I checked the implementation of foreachPar and realized that when using withParallelism with a number n that was higher than the size of the collection, we were still creating n fibers. This is unnecessary because we only need to create as many fibers as there are items in the collection. I opened a PR to optimize this behavior, and it should be available in the next minor release of ZIO.

Executors

When creating fibers and giving them work to do, these tasks eventually end up running on physical threads. In ZIO, the logic for creating those threads and assigning tasks to them in a way that maximizes CPU efficiency is the job of the executor. Let's explore the different options we have here.

As of ZIO 2.1.0, the default executor is called ZScheduler and it is essentially a port of the Tokio scheduler in Rust. This article explains the algorithm in detail, but to summarize it very roughly:

  • We create a number of threads equal to the number of cores we have available.

  • Each thread has a local queue of tasks to run, and there is also a global queue that is shared between all threads.

  • When a fiber forks an effect, we create a new fiber and enqueue it into the local queue of the current thread. If the current thread isn't a ZIO thread (for example, if we call unsafe.run from an external thread), we enqueue it into the global queue. If the local queue is full (its size is bounded), we take half of its tasks and move them to the global queue.

  • We then check if there is a thread sleeping, and wake it up if we find one.

  • When waking up, or also when they finish their current task, threads will find the next task to do this way:

    • pick from their local queue first

    • if it is empty, pick from the global queue

    • if it is empty, look at other threads' local queues to see if there is anything to steal. When stealing, a thread takes half of the other thread's local queue.

This process is proven and has been working efficiently since ZIO 2 was released. It has even been recently improved in ZIO 2.1. Why would we change it? Let's find out.

Blocking

In ZIO 2.0.x, the default scheduler had a feature called auto-blocking enabled by default. In 2.1.x, it has been disabled and requires explicit opt-in:

val bootstrap = Runtime.enableAutoBlockingExecutor

What this does is that it tries to detect if a fiber is running a blocking task, and when it does, it remembers from which part of the code it came from in order to shift it to the blocking threadpool automatically afterward. The goal is that if users forget to wrap their blocking code in ZIO.blocking or ZIO.attemptBlocking, the executor will do it for them.

Running blocking code on the default executor is bad because it uses a limited number of threads, and you can easily end up with all your threads blocked and no work being done at all.

The issue with this behavior is that the detection of whether a fiber is running blocking code is a heuristic that is not perfect and can detect false positives. To make it short, a task is considered "blocking" if its execution takes longer than a certain threshold. But that doesn't actually mean that it's blocking; it could just be CPU-bound, or it could be slow for another reason (I've witnessed my code being shifted to the blocking threadpool because class loading was taking some time at startup while the JVM was cold). Another issue is that tracking the blocking locations and checking it every time we run a task introduces overhead.

To get the best performance possible, it is better not to use this mode and to explicitly shift blocking code to the blocking threadpool. If you have doubts, use a profiler to look at thread states and find potential blocking. If you are not confident about your own code or the libraries you use and you are ready to take the performance penalty for better safety, then it is a perfectly fine option to enable auto-blocking.

Executor override

There is a behavior in ZIO that can be quite surprising and that not many people know about: when you shift a task to the blocking threadpool, the fiber does not automatically return to the default executor after running that code. It stays in the blocking threadpool until the fiber ends or it executes enough run loops to yield back to the default executor.

Depending on your use case, that behavior might be good or bad: if your fibers immediately end after doing the blocking call (for example, making a DB call and returning it to the client), it is more efficient to avoid the thread shifting. However, if your fibers are long-lived (for example, an entity behavior in Shardcake), you're going to have a lot of code running on the blocking threadpool, which might result in a large number of threads and a loss of performance. I believe Cats Effect made a different choice and automatically shifts back by default.

There is a simple way to fix this, though, which is setting an executor explicitly. When you override the default executor, work will always shift back to that given executor after the blocking code is executed. But what if you want to keep the default executor? Well, you can override the default executor with itself!

val bootstrap =
  Runtime.setExecutor(Executor.makeDefault(autoBlocking = false))

That simple line will guarantee that you always shift back to the default executor after running blocking code.

Alternative executors

Another reason to change the default executor could be to try brand new, alternative executors. First, there is a new executor in ZIO 2.1.x that uses Loom (therefore is only available on JDK21+) and assigns virtual threads to each fiber's work. The efficiency of running those virtual threads on actual threads is then left to Loom.

To use it, simply add this to your bootstrap function:

val bootstrap = Runtime.enableLoomBasedExecutor

This is an interesting alternative that you should definitely test. However, note that in the ZIO benchmark that tests creating and joining a lot of fibers, the default Tokio-based scheduler was quite a lot faster than the Loom-based one.

Another recent possibility is Kyo's scheduler, which has been recently extracted as a standalone dependency in kyo 0.9.3. It is very new, so don't expect something battle-tested yet; however, its creator Flavio Brasil is an expert on the matter, and his scheduler is based on years of experience working with various effect systems on the JVM. I am definitely planning to give it a try at some point in the future.

A final note on Datadog

I'll finish with a last tip for people using Datadog, which provides an agent that automatically profiles your code without any significant overhead. There are 2 flags related to ZIO that I changed in production:

  • -Ddd.integration.throwables.enabled=false disables exception profiling. Why would we need that? From version 2.0.x, ZIO uses exceptions for control flow internally. It means that we're going to have a very high number of exceptions which can overwhelm the profiler. In ZIO 2.1.x, these usages have decreased but were not totally removed so it is still relevant.

  • -Ddd.integration.zio.experimental.enabled=true won't improve performance but it will allow the Datadog tracer to be able to pass its context between ZIO fibers. This should give you much better traces as it will be able to attach DB calls to API calls, for example. I've been using it in production for a while and can confirm there is no overhead. Despite its experimental status (and I believe its absence from the official documentation), it has been working well so I recommend it.

That's it for today! I hope that was helpful.

Got additional tips for making ZIO applications faster in production? Let me know about it in the comments!