App Availability and Resiliency

A resilient app is resistant to events like hardware failures or outages. Redundancy, supported by monitoring and scaling, is one of the ways to make your app more resilient, and we provide some built-in features to make this easier.

But first, it’s important to understand what your app needs to take care of.

What your app needs to do

Since Fly Proxy does load balancing by default, you need to consider how your app handles:

  • session storage, so that users stay logged in when their session moves to a different server
  • persistent file storage and replication, as discussed in the following paragraphs

A note about Fly Proxy: Fly Proxy does a lot of work in the background, applying handlers, managing traffic and load balancing, and managing connections to services. Fly Proxy also automatically starts and stops Machines based on traffic to your app. The proxy works for process groups with services defined.

For databases, you’ll need to follow the specific recommendations for the database you’re using, including how to set up clusters for replication and failover.

If you’re deploying Fly Postgres, our un-managed database app, then you need to manage backups, recovery, replication, and scaling yourself. For information about high availability with Fly Postgres, refer to High Availability & Global Replication.

When you use Fly Volumes with your app, your app needs to implement replication and backups. Volumes are block devices that live on a single disk array and get mounted directly by your Machine. There’s no magic. And there’s no built-in replication between volumes. If your app provides a clustering mechanism with data replication, like most databases do, we recommend you take advantage of that and run multiple instances with attached volumes. When possible, we place your app’s volumes in different hardware zones within a region to mitigate hardware failures.

This brings us to disaster recovery and being prepared. You should have backups of your data and understand how to recover that data if needed. Fly.io takes daily snapshots of volumes and stores them for 5 days. You can restore a volume’s data from one of these daily snapshots. If your use case requires more frequent backups, then you’ll need to set up tooling and processes to backup your volumes on the required schedule.

Fly.io features for app resiliency

The Fly Platform has features to help make your app more resilient in case of outages or hardware failures.

The features for redundancy and availability are:

These features require or depend on your app configuration in the fly.toml file, including process groups. Every Fly App has at least one process group. An app with no extra process groups defined in fly.toml still uses the app process group by default. Learn more about app configuration.

You might also be interested in how we do load balancing.

Redundancy by default on first deploy

Technically, the title of this section should be “Redundancy by default on first deploy or after scaling down to zero Machines”. When you deploy your app for the first time (or after scaling down to zero Machines), you get some default redundancy configurations. These settings help you to deploy a recommended setup without having to think about it. On the other hand, if the defaults aren’t what you’re looking for, then you can still configure your app and processes the way you want. Learn more about scaling Machine CPU and RAM and scaling the number of Machines.

Here’s what Fly Launch (the fly launch or fly deploy command) does, based on your app configuration:

  • creates and starts two Machines for process groups with services, sets automatic start and stop to true and minimum machines running to zero
  • creates and starts one Machine and creates one stopped standby Machine for process groups without services
  • creates and starts only one Machine if the process group has volumes mounted, because there’s no built-in replication between volumes

Two Machines for process groups with services

The most basic way to improve resiliency is to create more than one Machine per process group, and we do this by default for process groups with services. A process group with services has mappings to the external internet or to a private (Flycast) network, so its connections and operation can be managed by Fly Proxy.

If a process group has services defined, two Machines are automatically created and started when:

  • you deploy an app for the first time using fly launch or fly deploy,
  • you redeploy an app using fly deploy after scaling down to zero, or
  • you add a new process group with services in the fly.toml file and then run fly deploy

Standby Machines for process groups without services

Standby Machines provide low-effort redundancy for process groups with no services defined. A process group with no services has no mappings to the external internet or to a private (Flycast) network, so its connections and operation can’t be managed by Fly Proxy.

A standby Machine is a stopped Machine that’s essentially paired to and watching a running Machine. The standby will only be started if the paired Machine becomes unavailable.

If a process group doesn’t have services defined, then a standby Machine is automatically created, along with a running Machine, when:

  • you deploy an app for the first time using fly launch or fly deploy,
  • you redeploy an app using fly deploy after scaling down to zero, or
  • you add a new process group in fly.toml and then run fly deploy

The standby Machine doesn’t consume resources or start up until it’s needed.

Remove a standby Machine

When an app or process group has one Machine and one standby Machine, the standby Machine is destroyed first if you scale down to one Machine. The standby Machine isn’t subsequently recreated when you scale up or back down using fly scale count.

If you add services to the process group in fly.toml, then the standby designation is removed from the standby Machine on the next deploy.

Create a standby Machine

You can recreate the standby Machine if you scale down to 0 and then run fly deploy, which will create one Machine and one standby Machine again.

At a lower level, you can also use the fly machines update command with the --standby-for option to create a standby configuration using two existing Machines.

Turn off redundancy on deploy

What if you don’t want to keep Machines running all the time for an app with a small or variable workload? The automatic start and stop feature is enabled by default for Machines and makes it possible to run at least two Machines without wasting resources.

But if you still don’t want fly deploy to create two Machines, then you can use the --ha option to turn off this feature:

fly deploy --ha=false

Automatically start and stop Machines

Automatic start and stop works for process groups with services defined. This feature enables you to have multiple Machines that only run when needed.

Fly Proxy can start and stop existing Machines based on incoming requests, so that your app can accommodate bursts in demand without keeping extra Machines running constantly. And if your app needs to have one or more Machines always running in your primary region, then you can set a minimum number of machines to keep running.

Default settings for new V2 apps created using the fly launch command: automatically start and automatically stop Fly Machines, and minimum machines running is zero.

Default settings for some existing V2 apps (or any V2 apps that don’t have these settings in fly.toml): automatically start but don’t automatically stop Fly Machines.

Get all the details about automatically stopping and starting Machines.

Health check-based routing

Fly Proxy routes network connections away from Machines that are failing health checks.

Health check-based routing works for process groups with services defined and is based on the configuration of services.tcp_checks or services.http_checks in your app’s fly.toml file.

If the configured health checks are failing for a Machine, then the proxy doesn’t route network connections to that Machine. The proxy routes connections to a healthy Machine instead. If there aren’t any healthy Machines, then the connection will block, waiting for a Machine to become healthy.

Health check-based routing doesn’t work for the top-level health checks defined in the checks section of fly.toml, because the top-level checks don’t apply to specific services.