Avoiding a GenServer Pitfall
If you’re already intimately familiar with GenServers, skip to the next section.
Firstly, what is a GenServer? #
GenServer, short for “generic server”, is a type of process within BEAM (Erlang, Elixir, etc) applications that can hold state and respond to requests; much like a web server, in a way. For instance, you could create a GenServer that could handle requests to store, list, and retrieve those blog posts — but those requests would be made within the BEAM VM rather than with a web browser. Within BEAM applications, the two primary purposes of GenServers are to execute code asynchronously and act as a store of shared, mutable state.
Executing Code Asynchronously #
One way to use a GenServer is as a worker. Let’s say that when a new user signs up for your service, you want to send them a welcome email, but you don’t want to wait until that email is sent to return a successful account creation HTTP response to the user. You can send that email asynchronously like so1:
def create_user(user_details) do
user_details
|> User.changeset()
|> Repo.save()
|> case do
{:ok, user} ->
GenServer.cast(MyApp.Mailer, {:send_welcome, user})
{:ok, user}
error ->
error
end
end
defmodule MyApp.Mailer do
use GenServer
# For example purposes, we assume the server does not have any meaningful
# internal state or on-start setup to do.
def init(_), do: {:ok, nil}
def handle_cast({:send_welcome, user}, _from, state) do
user
|> customize_welcome_email()
|> @mail_api.send()
{:noreply, state}
end
end
Since GenServer.cast is asynchronous (sends a message and returns immediately, a behavior often termed “fire and forget”), the process invoking create_user will not wait for MyApp.Mailer.handle_cast to complete. Thus, control will return from create_user likely before the mailer ever tries to send the email.
Shared, Mutable State #
BEAM languages do not, by default, have mutable data structures2. This means that if you have some list x and you wish to add an element to it, you will now have a new list y. Also, the BEAM enforces process memory isolation. That is to say, if x exists in Process A, the only way that Process B can read x is to receive a message from Process A containing x — though, notably, messages passing copies the data being sent; Process B does not receive a reference to x, which is isolated to Process A, but a copy of x.
GenServers allow developers to work within these limitations by:
- Functioning as a stable “location” (name or Process ID (PID)) where the data “changes” over time.
- Functioning as a shared “location” where multiple processes can manipulate the same data.
One would naturally believe that a mutable data store shared between any number of processes would lead to countless race conditions. However, many race conditions are mitigated by the fact that GenServers respond to requests sequentially.
For instance, let’s imagine that we are creating a seat reservation system for a movie theater. When a user requests to reserve a seat, we want to:
- Verify that the requested seat is still available.
- Process the user’s payment.
- Update the seat in our system to no longer be available.
Consider this GenServer:
defmodule MyApp.SeatReservations do
@moduledoc """
A GenServer that keeps track of which seats are reserved.
"""
use GenServer
def init(_) do
{:ok, %{reserved_seats: []}}
end
def handle_call({:seat_available?, seat}, _from, state) do
{:reply, seat in state.reserved_seats, state}
end
def handle_call({:mark_reserved, seat}, _from, state) do
updated_state = Map.update!(state, :reserved_seats, &[seat | &1])
{:reply, :ok, updated_state}
end
end
Used like so:
def reserve_seat(seat, user) do
if GenServer.call(MyApp.SeatReservations, {:seat_available?, seat}) do
process_user_payment(user)
GenServer.call(MyApp.SeatReservations, {:mark_reserved, seat})
:ok
else
{:error, :taken}
end
end
Using this implementation, a race condition can occur if two users requested the same seat at roughly the same time. Because the seat is not yet marked as reserved for the first user (payment processing takes time), the second user to request the seat would still see it as available, and both users would ultimately be able to reserve the same seat.
To solve this issue, we can move our “critical section” into the GenServer like so:
defmodule MyApp.ReservationServer do
@moduledoc """
A GenServer that handles reserving seats.
"""
use GenServer
def init(_) do
{:ok, %{reserved_seats: []}}
end
def handle_call({:reserve, seat, user}, _from, state) do
if seat in state.reserved_seats do
{:reply, {:error, :taken}, state}
else
process_user_payment(user)
updated_state = Map.update!(state, :reserved_seats, &[seat | &1])
{:reply, :ok, updated_state}
end
end
end
And updated usage:
def reserve_seat(seat, user) do
GenServer.call(MyApp.ReservationServer, {:reserve, seat, user})
end
Because GenServers only processes one request at a time, this new GenServer will fully process one request before beginning the next. This means that if two users request the same seat at about the same time, reserved_seats in the GenServer’s state is updated when the second user’s request begins to be handled because the first user’s request must have finished. Thus the second user will receive an error that the seat is already reserved (i.e. the correct behavior of the system).
The Pit #
The default behavior of GenServer.call is to time out after five seconds — though the timeout can be set to any number of milliseconds or :infinity — raising an exception in the calling process if the GenServer fails to respond before the timeout. Unfortunately, getting the timing right in GenServers is difficult. Among the reasons why a GenServer may not respond to a call before timeout are:
- The callback for your request took too long. For instance, the callback may be making a call to an external API and their server is running particularly slowly today.
- The GenServer is bogged down handling other requests. Since requests to GenServers are handled in first-in, first-out (FIFO) order, it is possible that the GenServer never even started handling your request before it timed out.
- A combination of the above—the GenServer started handling your request before timeout, but was delayed due to processing other messages first and was unable to reply before timeout despite the callback’s runtime being less than
timeoutmilliseconds.
One can easily imagine a situation where, using the example code above, the payment processing step takes more than five seconds some of the time.
There is no way to totally avoid this class of issue. Even a GenServer whose callbacks execute in microseconds can become bogged down by a sudden surge of requests.
In the last month, I’ve experienced two different production issues related to GenServer timing:
- A GenServer callback that never terminated because a library function called in my callback evidently doesn’t terminate sometimes. The GenServer was effectively dead, but never restarted because it wasn’t actually dead.
- A GenServer callback that sometimes takes a while (it’s calling an external API) and on those occasions, subsequent calls to the GenServer timed out. The system took a while to recover.
A Partial Solution #
While this problem cannot be completely resolved, there is one way to constrain the runtime of GenServer callbacks and thus reduce the risk of timeouts: Tasks. Ideally, GenServers would have a built-in way to limit the allowable time for a GenServer callback’s execution. Since there isn’t, we can can make one ourselves:
def handle_call(:foo, _from, state) do
task =
Task.async(fn ->
slow_function(state)
end)
case Task.yield(task, 4_000) || Task.shutdown(task) do
{:ok, result} -> {:reply, {:ok, result}, state}
{:exit, reason} -> {:reply, {:error, reason}, state}
nil -> {:reply, {:error, :timeout}, state}
end
end
Let’s break this down.
task =
Task.async(fn ->
slow_function(state)
end)
The first thing this callback does is begin a Task running slow_function. Here, slow_function is a placeholder for the work that needs to be done in this GenServer’s callback; everything else in this callback is about limiting the runtime (we’ll reduce the boilerplate in a moment).
case Task.yield(task, 4_000) || Task.shutdown(task) do
{:ok, result} -> {:reply, {:ok, result}, state}
{:exit, reason} -> {:reply, {:error, reason}, state}
nil -> {:reply, {:error, :timeout}, state}
end
The expression Task.yield(task, 4_000) || Task.shutdown(task) gives the task running slow_function (i.e. the callback’s “real” work) 4,000ms to complete. The value of 4,000ms is somewhat arbitrary; but the central idea is that:
- It should probably be less than the default
GenServer.calltimeout of 5,000ms unless every caller is using a longer, custom timeout - It should probably leave some headroom to account for the fact that the
GenServer.calltimeout may have started counting down while a previous callback was being processed, and so this callback has less than the full 5,000ms to respond.
You can tweak this timeout based on your system’s needs.
Task.yield and Task.shutdown have conveniently have the same return types:
{:ok, reply}: The Task completed and returned a value.{:exit, reason}: The Task died unexpectedly.nil: No reply was received.
Here we’re using || to essentially say “If the Task hit its time limit, shut it down”. Based on which result we receive here, we produce a corresponding reply for our GenServer.
As promised, one can reduce the boilerplate by defining a helper function:
def time_limited(fun, time_limit) do
task = Task.async(fun)
Task.yield(task, time_limit) || Task.shutdown(task)
end
Used like so:
def handle_call(:foo, _from, state) do
case time_limited(fn -> slow_function(state) end, 4_000) do
{:ok, result} -> {:reply, {:ok, result}, state}
{:exit, reason} -> {:reply, {:error, reason}, state}
nil -> {:reply, {:error, :timeout}, state}
end
end
To cut down boilerplate even further, you can tailor time_limited to GenServers by creating the {:reply, ..., ...} tuples inside that function, or convert time_limited to a macro to remove the need to wrap the callback logic in a lambda.
One caveat to note here is that spawning a Task is going to consume additional resources — most notably memory. As noted above, when one process sends a message to another, the data in that message is copied between the processes. This also applies to spawning new processes: the data you are spawning the new process with is copied into that process. Consider a typical GenServer interaction with the above pattern:
- Process P sends a message to GenServer G
- G spawns Task T.
Let’s conjecture that P is sending data needed to fulfill the GenServer’s callback. Therefore:
- The original data exists in P
- The data is copied from P to G with
GenServer.call - The data is copied from G to T with
Task.async
Therefore, there are three total copies of this data. If the data in question is small, such as an atom or integer, this is not of much concern. However, if the data is larger (e.g., a struct with a large number of fields or nested structures) creating two additional copies of that data may consume more memory than you are comfortable allocating.
It is, as a general note, a best practice to send as little data in messages as possible for this reason (e.g., prefer sending an ID that can be used to fetch a struct rather than the struct itself).
The Caveat #
This solution mitigates timeout Reason 1 (callback took too long), but does not fully mitigate Reasons 2 or 3 (callback was not started before timeout due to prior messages, callback was started before timeout but was delayed by prior messages). That is to say, this approach does not solve the problem of a GenServer’s message queue being backed up. Even if each callback is limited to 4,000ms of runtime, two messages that require the maximum 4,000ms of runtime sitting in the queue ahead of your message mean that at least 8,000ms have elapsed before the GenServer even begins processing your message, and the default timeout of 5,000ms has long expired.
In Sum #
You can partially mitigate the problem of long-running GenServer callbacks causing timeouts within your system by wrapping the functionality of the callback in a time-limited Task. To this end, I believe that one should generally default to time-limiting GenServer callbacks, unless you have a good reason not to3. It’s a bit of boilerplate and additional memory overhead, but when used thoughtfully it’ll save you from production headaches in the long run.
Generally speaking, you should wrap
GenServer.castandGenServer.callinvocations in another function, e.g.def send_welcome_email(user) do GenServer.cast(MyApp.Mailer, {:send_welcome, user}) endThe rationale for this indirection is that the implementations for these types of operations are often subject to change. For instance, sending the welcome email might later be performed by a Task instead. If that were the case, and our application is sending emails via this
send_welcome_email/1function, we wouldn’t need to update every function that sends an email to use a Task instead of aGenServer.cast, we’d only need to updatesend_welcome_email/1. ↩︎You can make a mutable data structure using natively implemented functions (“NIFs”), but this is not commonly done. ↩︎
There are some use cases for GenServer to run time-unbounded callbacks, such as async workers. It’s a somewhat niche use case, however, since if the GenServer crashes for any reason (or the node goes down), the GenServer’s work queue is lost. A better option would be to opt for a persistence-backed worker such as Oban. ↩︎