How I Stopped Breaking Production Every Friday

The Problem with Deploy Fridays

We had this Rust API gateway handling over 100k requests per minute, and every time we deployed, we'd take the whole thing down for a few minutes. Doesn't sound like much, but when you're processing that many requests, even 30 seconds of downtime means thousands of angry users and support tickets flooding in.

The old deployment process was pretty brutal - stop the service, swap in the new code, start it back up, and pray nothing broke. It worked fine when we were small, but as traffic grew, those few minutes of downtime became a real problem.

What Zero-Downtime Actually Means

The idea is pretty straightforward - instead of stopping everything, you gracefully shut down the old service while spinning up the new one. The tricky part is making sure existing connections can finish what they're doing while not accepting any new ones.

Here's what a proper graceful shutdown looks like:

Stop taking new connections
Let existing requests finish
Tell the load balancer you're going away
Clean up properly

In Rust, this actually works out pretty nicely with Tokio's async stuff:

use tokio::signal;
use std::sync::Arc;
use tokio::sync::Notify;
use actix_web::{web, App, HttpServer, HttpResponse, middleware::Logger};

#[tokio::main]
async fn main() -> std::io::Result<()> {
    env_logger::init();
    
    // This notify thing lets us signal shutdown across the app
    let shutdown_notify = Arc::new(Notify::new());
    let shutdown_clone = shutdown_notify.clone();
    
    // Listen for Ctrl+C or kill signals
    tokio::spawn(async move {
        signal::ctrl_c().await.expect("Failed to listen for Ctrl+C");
        println!("Got shutdown signal, starting graceful shutdown...");
        shutdown_notify.notify_waiters();
    });
    
    let server = HttpServer::new(move || {
        let shutdown_for_app = shutdown_clone.clone();
        
        App::new()
            .wrap(Logger::default())
            .app_data(web::Data::new(shutdown_for_app))
            .route("/health", web::get().to(health_check))
            .route("/api/{path:.*}", web::to(proxy_handler))
    })
    .bind("0.0.0.0:8080")?
    .workers(num_cpus::get())
    .run();
    
    // Handle graceful shutdown
    let server_handle = server.handle();
    let server_future = server;
    
    tokio::select! {
        _ = server_future => {},
        _ = shutdown_clone.notified() => {
            println!("Shutting down gracefully...");
            server_handle.stop(true).await;
        }
    }
    
    Ok(())
}

Health Checks Are Critical

The health check endpoint is probably the most important part of this whole thing. It's how your load balancer knows whether to send traffic to your instance or not:

async fn health_check(shutdown_signal: web::Data>) -> HttpResponse {
    // If we're shutting down, tell the load balancer to stop sending traffic
    if shutdown_signal.notified().now_or_never().is_some() {
        return HttpResponse::ServiceUnavailable()
            .json(serde_json::json!({
                "status": "shutting_down",
                "message": "Service is gracefully shutting down"
            }));
    }
    
    // You could add more checks here:
    // - Database connection
    // - External API dependencies  
    // - Memory/CPU usage
    
    HttpResponse::Ok().json(serde_json::json!({
        "status": "healthy",
        "timestamp": chrono::Utc::now().to_rfc3339(),
        "version": env!("CARGO_PKG_VERSION")
    }))
}

How the Deployment Actually Works:

Start new service instance on a different port
Wait for it to report healthy
Add new instance to load balancer rotation
Send shutdown signal to old instance
Old instance health check starts returning 503
Load balancer stops sending new traffic to old instance
Old instance finishes handling existing requests
Old instance shuts down cleanly

The Stuff That Made Me Pull My Hair Out

Rust's ownership model: At first I was fighting with the borrow checker trying to share the shutdown signal everywhere. Turns out `Arc` is exactly what you need for this - it lets you share the signal across different parts of your app safely.

// This doesn't work - can't share mutable state like this
let mut shutdown_flag = false;

// This works - Arc gives you shared ownership
let shutdown_notify = Arc::new(Notify::new());
let shutdown_clone = shutdown_notify.clone();

Long-running requests: The tricky part is when someone's uploading a big file or streaming data. You can't just kill their connection, but you also can't wait forever.

use tokio::time::{timeout, Duration};
use actix_web::web::Bytes;
use futures::StreamExt;

async fn proxy_handler(
    req: HttpRequest,
    body: web::Payload,
    shutdown_signal: web::Data>,
) -> Result {
    // Don't process new requests if we're shutting down
    if shutdown_signal.notified().now_or_never().is_some() {
        return Ok(HttpResponse::ServiceUnavailable()
            .json("Service is shutting down"));
    }
    
    // For long requests, check the shutdown signal periodically
    let body_bytes = timeout(
        Duration::from_secs(30), // Give up after 30 seconds
        collect_body_with_shutdown_check(body, shutdown_signal.clone())
    ).await
    .map_err(|_| actix_web::error::ErrorRequestTimeout("Request timeout"))?
    .map_err(|e| actix_web::error::ErrorBadRequest(e))?;
    
    // Forward to the actual backend
    proxy_to_backend(req, body_bytes).await
}

async fn collect_body_with_shutdown_check(
    mut body: web::Payload,
    shutdown_signal: web::Data>,
) -> Result> {
    let mut bytes = web::BytesMut::new();
    
    while let Some(chunk) = body.next().await {
        // Bail out if shutdown was requested
        if shutdown_signal.notified().now_or_never().is_some() {
            return Err("Shutdown requested during request processing".into());
        }
        
        let chunk = chunk?;
        bytes.extend_from_slice(&chunk);
        
        // Don't let people upload huge files
        if bytes.len() > 10_000_000 { // 10MB limit
            return Err("Request body too large".into());
        }
    }
    
    Ok(bytes.freeze())
}

Load balancer timing: This was probably the hardest part to get right. You need to coordinate the timing between your health checks failing and actually stopping the service. Every load balancer is different, so you need to tune this.

// These numbers took some trial and error to get right
const HEALTH_CHECK_GRACE_PERIOD: Duration = Duration::from_secs(15);
const CONNECTION_DRAIN_TIMEOUT: Duration = Duration::from_secs(30);

async fn graceful_shutdown(
    server_handle: ServerHandle,
    shutdown_signal: Arc,
) {
    // Start failing health checks
    shutdown_signal.notify_waiters();
    println!("Health checks now failing...");
    
    // Wait for load balancer to notice and stop sending traffic
    tokio::time::sleep(HEALTH_CHECK_GRACE_PERIOD).await;
    println!("Grace period over, stopping server...");
    
    // Stop accepting new connections
    server_handle.stop(false).await;
    
    // Give existing connections time to finish
    println!("Waiting for connections to finish...");
    tokio::time::sleep(CONNECTION_DRAIN_TIMEOUT).await;
    
    // Force close anything that's still hanging around
    server_handle.stop(true).await;
    println!("Shutdown complete");
}

Getting Fancy with Rolling Deployments

Once I had basic zero-downtime working, I built a system that could deploy to multiple instances gradually instead of all at once:

// Simplified version of the deployment coordinator
use serde_json::Value;
use reqwest::Client;

struct DeploymentManager {
    client: Client,
    load_balancer_api: String,
    instances: Vec,
}

impl DeploymentManager {
    async fn rolling_deploy(&self, new_version: &str) -> Result<(), Box> {
        for instance in &self.instances {
            println!("Deploying to instance: {}", instance);
            
            // Deploy new version
            self.deploy_instance(instance, new_version).await?;
            
            // Wait for it to be healthy
            self.wait_for_healthy(instance).await?;
            
            // Switch traffic over
            self.shift_traffic_to_instance(instance, 100).await?;
            
            // Kill the old version
            self.shutdown_old_version(instance).await?;
            
            println!("Instance {} done", instance);
            
            // Brief pause between instances
            tokio::time::sleep(Duration::from_secs(10)).await;
        }
        
        Ok(())
    }
    
    async fn wait_for_healthy(&self, instance: &str) -> Result<(), Box> {
        let health_url = format!("http://{}:8080/health", instance);
        
        for attempt in 1..=30 {
            match self.client.get(&health_url).send().await {
                Ok(response) if response.status().is_success() => {
                    let health: Value = response.json().await?;
                    if health["status"] == "healthy" {
                        println!("Instance {} is healthy", instance);
                        return Ok(());
                    }
                }
                _ => {}
            }
            
            if attempt < 30 {
                println!("Attempt {}: Instance {} not ready yet...", attempt, instance);
                tokio::time::sleep(Duration::from_secs(2)).await;
            }
        }
        
        Err(format!("Instance {} never became healthy", instance).into())
    }
}

Keeping Track of Everything

Zero-downtime deployments only work if you can actually verify they're working. I added metrics to track everything:

use prometheus::{Encoder, TextEncoder, Counter, Histogram, register_counter, register_histogram};

lazy_static::lazy_static! {
    static ref DEPLOYMENT_COUNTER: Counter = register_counter!(
        "deployments_total", 
        "Total number of deployments"
    ).unwrap();
    
    static ref SHUTDOWN_DURATION: Histogram = register_histogram!(
        "shutdown_duration_seconds",
        "How long graceful shutdown takes"
    ).unwrap();
    
    static ref ACTIVE_CONNECTIONS: prometheus::IntGauge = 
        prometheus::register_int_gauge!(
            "active_connections", 
            "Current active connections"
        ).unwrap();
}

async fn metrics_handler() -> HttpResponse {
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    let mut buffer = Vec::new();
    encoder.encode(&metric_families, &mut buffer).unwrap();
    
    HttpResponse::Ok()
        .content_type("text/plain; version=0.0.4")
        .body(buffer)
}

How Well It Actually Works

After getting this all set up, the results were pretty solid:

Zero failed requests during deployments (used to be ~2,000)
15-second deployment window per instance (down from 5+ minutes of total downtime)
99.99% uptime over 8 months in production
Automatic rollbacks when new deployments fail health checks

What I Learned

Graceful shutdown is worth the effort: It's not just about deployments - proper shutdown handling also helps when servers crash or you need to do maintenance.

Your app needs to work with your infrastructure: The best application code in the world won't help if your load balancer configuration is wrong or your deployment process is broken.

Monitor everything: You can't tell if zero-downtime deployments are actually working without good metrics on connections, response times, and error rates.

Test the failure cases: The real test isn't when everything goes perfectly - it's when deployments fail, networks partition, or other weird stuff happens.

This system has been running in production for 8 months now, handling 50+ deployments per week. No more dreading Friday afternoon releases, and our users never see downtime from deployments anymore.

← Back to Blogs