How to use external services safely and reliably in Elixir applications

Welcome to the Ropig engineering blog!

Ropig is the next generation of alert management software, and the latest SaaS product from the team that created MeetEdgar. Ropig receives alerts from your infrastructure and routes the right alert to the right person, in the right channel.

Ropig is built on Elixir and React and deployed to Google Cloud Platform, so you can rest assured that Ropig will always be up and ready to alert your team of any problems in your infrastructure.

This blog will be an ongoing discussion of the technical architecture and components of Ropig and also some of the challenges we’ve faced while developing Ropig. For this first post, we will discuss our approach for interacting with external services and APIs.

Like any modern software system, Ropig utilizes a variety of external services and APIs. These include Google Cloud Datastore, Google Pub/Sub, Sendgrid, and Twilio, among others. One thing that all external services have in common is the propensity to fail. Users of these services must accept this fact of life and be prepared to deal with the potential for failure at any time.

To this end, we on the engineering team have released the open source external_service package to aid in the safe and reliable consumption of external services for Elixir applications.

The Ropig external_service package is, in essence, the combination of two techniques: retrying individual requests that have failed and preventing cascading failures with circuit breakers. The remainder of this blog post will describe these techniques in detail and provide usage examples to illustrate how to use the external_service package effectively in your own Elixir applications.

The ExternalService module

The ExternalService module is the main interface to the functionality provided by the external_service library. The basic approach to using ExternalService is to wrap all usages of any external services in an Elixir function and pass this function to the function (together with some options that we’ll discuss later). By doing so, ExternalService can manage calls to the external API that you’re using, and apply retry logic to individual calls while ensuring that your application is protected from outages or other problems with the external API by using “circuit breakers” (described in more detail below).

In addition to the call function, ExternalService also provides a call_async function, which uses an Elixir Task to call the external service asynchronously, and a call_async_stream function, which makes multiple calls to the external service in parallel.

Both of these functions apply the same retry and “circuit breaker” mechanisms as the regular call function. Let’s look at these mechanisms in further detail.

Retrying failed requests

Many failures that occur when accessing external services are transient in nature. For example, there could be network congestion causing a request to timeout, or the service could be under heavy load and is not able to handle any more requests at the time. The best strategy for dealing with such transient failures is to simply retry the failed request – perhaps after a brief backoff period. The external_service package uses the retry library from Safwan Kamarrudin to automate retry logic. This retry library provides flexible configuration to control various aspects of retry logic, such as whether to use linear or exponential backoff, the maximum delay between retries, and the total amount of time to spend retrying before giving up.

Let’s see how we can apply this to calling an external service using an example extracted from the Ropig code base – publishing a message to Google Pub/Sub (using Kane, an Elixir library for working with Pub/Sub):

We wrap our call to the external service in a function and pass this function to with retry options as the second argument. Importantly, we use a special return value from our anonymous function to trigger a retry. A retry is triggered by one of three mechanisms:

  1. the function returns :retry
  2. the function returns a tuple of the form {:retry, reason}
  3. the function raises an Elixir RuntimeError

In our example code, we examine the result of calling Kane.Message.publish and if it is an error response with an error code that matches one of our predetermined @retry_errors, we then trigger a retry.

Not all failed requests should be retried, of course. Some failures are due to bugs in the calling code; such calls can never succeed and therefore should not be retried. In our case, we consulted the documentation for Google Pub/Sub to determine which error codes should result in a retry. You will have to decide on a strategy to determine what error conditions are retriable for your service.

Circuit breakers for preventing catastrophic cascade

The Circuit Breaker pattern was first described in Michael Nygard‘s landmark book Release It! and was later popularized by Martin Fowler on his bliki post. To quote Nygard, “Circuit breakers are a way to automatically degrade functionality when the system is under stress.”

The Circuit Breaker pattern is modeled after electrical circuit breakers, which are designed to protect electrical circuits from damage caused by excess current. Like these electrical circuit breakers, software “circuit breakers” are designed to protect the system at large from damage caused by faults in a component of the system. By protecting calls to external services, the circuit breaker can monitor for failures. If the failures reach some given threshold, the circuit breaker “trips” and further calls to the service will immediately return an error without even making a call to the external API itself. After a configurable period of time, the circuit breaker is automatically reset and calls to the external service will once again be attempted with the same monitoring as before.

In contrast to retry logic, which is applied to each individual call to a service, the circuit breaker for a given service is global to the entire system. If a circuit trips, then it trips for all users of the associated service. This is a key feature of the Circuit Breaker pattern, and is what allows it to prevent cascading failures.

The external_service package uses the Erlang fuse library from Jesper Louis Andersen for implementing circuit breakers. To extend the electrical analogies, circuit breakers in the fuse library are called “fuses.” These fuses can be used to protect against cascading failure by first asking the fuse if it is OK to proceed using the :fuse.ask function. If this function returns :ok then it is OK to proceed to calling on the external service. If, on the other hand, it returns :blown, then the fuse has been tripped and it is not safe to call the external service. In this scenario, your code must have a fallback option to compensate for the fact that the external service is unavailable, which might mean returning cached data or indicating to the user that the functionality is not currently available.

What causes a fuse to trip?
When using a fuse, your application code must tell the fuse about any failures that occur. If you’ve asked the fuse if it is OK to proceed but then receive an error from the external service, your code should call the :fuse.melt function, which “melts the fuse a little bit”. Once the fuse has been “melted” enough times, the fuse is tripped and future calls to :fuse.ask for that fuse will return :blown.

ExternalService wraps the functionality provided by fuse in a convenient interface and automates the handling of the fuse so that you don’t need to explicitly call :fuse.ask or :fuse.melt in your code. Instead, you simply use the function with the name of the fuse as the first argument, together with the function in which you’ve wrapped your call to the external API. Then, the function will first ask the given fuse before making the call and will return {:error, :fuse_blown} if the fuse is blown. It will also automatically call :fuse.melt any time the call to the given function results in a retry. This eventually results in a blown fuse if there are enough failed requests to the service being protected by the fuse.

The only requirement for using a fuse for a particular service is that it must be initialized before using the service. This is done with the ExternalService.start/2 function, which takes the fuse name and options as arguments. The fuse name is an atom which must uniquely identify the external service to which the fuse applies. This function should be called in your application startup code, which is typically the Application.start function for your application. For example:

See the API docs for further details about the available fuse options.

Future enhancements

While the external_service package already provides very useful functionality, there are a few forthcoming enhancements that will make it even more broadly applicable.

The first is to provide more comprehensive logging and statistics. The fuse library already provides some logging and statistics for circuit breaker events, but having these integrated into the external_service package would make them more accessible to Elixir developers. The retry library does not currently provide any logging or statistics whatsoever, so it would be of great benefit to incorporate retry logging and statistics into external_service.

The second enhancement is to incorporate rate limiting. Many APIs and services enforce rate limits by rejecting requests when limits have been exceeded within a particular time period. By including rate limiting into external_service, we would allow developers to control the rate at which requests to a service are made and either block requests until such time as they would succeed, or return an alternative response when limits have been exceeded. Like fuses, rate limits would apply to all users of a service within the system. Therefore, rate limiting could be implemented by building on top of the existing circuit breaker for a service.

Finally, we would like to provide the ability for operators to control circuits and fuse settings. Currently, circuit breakers are tripped in response to repeated failures and are reset after a configurable period of time. But because the state of circuit breakers is important to operations personnel, there must be some way to allow operations to directly trip and reset circuit breakers.

Final Thoughts

In this blog post, we’ve described our approach to using external services in Ropig and, in doing so, shown how the external_service package can be used to apply the same techniques to your own Elixir applications. If you found this information useful, please consider subscribing to our blog for more articles about Elixir in the real world. And if you have ideas about how to make external_service even better, you can contribute by going to the GitHub project and creating an issue or submitting a pull request.