The Problem with Deploy Fridays
We had this Rust API gateway handling over 100k requests per minute, and every time we deployed, we'd take the whole thing down for a few minutes. Doesn't sound like much, but when you're processing that many requests, even 30 seconds of downtime means thousands of angry users and support tickets flooding in.
The old deployment process was pretty brutal - stop the service, swap in the new code, start it back up, and pray nothing broke. It worked fine when we were small, but as traffic grew, those few minutes of downtime became a real problem.
What Zero-Downtime Actually Means
The idea is pretty straightforward - instead of stopping everything, you gracefully shut down the old service while spinning up the new one. The tricky part is making sure existing connections can finish what they're doing while not accepting any new ones.
Here's what a proper graceful shutdown looks like:
- Stop taking new connections
- Let existing requests finish
- Tell the load balancer you're going away
- Clean up properly
In Rust, this actually works out pretty nicely with Tokio's async stuff:
use tokio::signal;
use std::sync::Arc;
use tokio::sync::Notify;
use actix_web::{web, App, HttpServer, HttpResponse, middleware::Logger};
#[tokio::main]
async fn main() -> std::io::Result<()> {
env_logger::init();
// This notify thing lets us signal shutdown across the app
let shutdown_notify = Arc::new(Notify::new());
let shutdown_clone = shutdown_notify.clone();
// Listen for Ctrl+C or kill signals
tokio::spawn(async move {
signal::ctrl_c().await.expect("Failed to listen for Ctrl+C");
println!("Got shutdown signal, starting graceful shutdown...");
shutdown_notify.notify_waiters();
});
let server = HttpServer::new(move || {
let shutdown_for_app = shutdown_clone.clone();
App::new()
.wrap(Logger::default())
.app_data(web::Data::new(shutdown_for_app))
.route("/health", web::get().to(health_check))
.route("/api/{path:.*}", web::to(proxy_handler))
})
.bind("0.0.0.0:8080")?
.workers(num_cpus::get())
.run();
// Handle graceful shutdown
let server_handle = server.handle();
let server_future = server;
tokio::select! {
_ = server_future => {},
_ = shutdown_clone.notified() => {
println!("Shutting down gracefully...");
server_handle.stop(true).await;
}
}
Ok(())
}
Health Checks Are Critical
The health check endpoint is probably the most important part of this whole thing. It's how your load balancer knows whether to send traffic to your instance or not:
async fn health_check(shutdown_signal: web::Data>) -> HttpResponse {
// If we're shutting down, tell the load balancer to stop sending traffic
if shutdown_signal.notified().now_or_never().is_some() {
return HttpResponse::ServiceUnavailable()
.json(serde_json::json!({
"status": "shutting_down",
"message": "Service is gracefully shutting down"
}));
}
// You could add more checks here:
// - Database connection
// - External API dependencies
// - Memory/CPU usage
HttpResponse::Ok().json(serde_json::json!({
"status": "healthy",
"timestamp": chrono::Utc::now().to_rfc3339(),
"version": env!("CARGO_PKG_VERSION")
}))
}
How the Deployment Actually Works:
- Start new service instance on a different port
- Wait for it to report healthy
- Add new instance to load balancer rotation
- Send shutdown signal to old instance
- Old instance health check starts returning 503
- Load balancer stops sending new traffic to old instance
- Old instance finishes handling existing requests
- Old instance shuts down cleanly
The Stuff That Made Me Pull My Hair Out
Rust's ownership model: At first I was fighting with the borrow checker trying to share the shutdown signal everywhere. Turns out `Arc
// This doesn't work - can't share mutable state like this
let mut shutdown_flag = false;
// This works - Arc gives you shared ownership
let shutdown_notify = Arc::new(Notify::new());
let shutdown_clone = shutdown_notify.clone();
Long-running requests: The tricky part is when someone's uploading a big file or streaming data. You can't just kill their connection, but you also can't wait forever.
use tokio::time::{timeout, Duration};
use actix_web::web::Bytes;
use futures::StreamExt;
async fn proxy_handler(
req: HttpRequest,
body: web::Payload,
shutdown_signal: web::Data>,
) -> Result {
// Don't process new requests if we're shutting down
if shutdown_signal.notified().now_or_never().is_some() {
return Ok(HttpResponse::ServiceUnavailable()
.json("Service is shutting down"));
}
// For long requests, check the shutdown signal periodically
let body_bytes = timeout(
Duration::from_secs(30), // Give up after 30 seconds
collect_body_with_shutdown_check(body, shutdown_signal.clone())
).await
.map_err(|_| actix_web::error::ErrorRequestTimeout("Request timeout"))?
.map_err(|e| actix_web::error::ErrorBadRequest(e))?;
// Forward to the actual backend
proxy_to_backend(req, body_bytes).await
}
async fn collect_body_with_shutdown_check(
mut body: web::Payload,
shutdown_signal: web::Data>,
) -> Result> {
let mut bytes = web::BytesMut::new();
while let Some(chunk) = body.next().await {
// Bail out if shutdown was requested
if shutdown_signal.notified().now_or_never().is_some() {
return Err("Shutdown requested during request processing".into());
}
let chunk = chunk?;
bytes.extend_from_slice(&chunk);
// Don't let people upload huge files
if bytes.len() > 10_000_000 { // 10MB limit
return Err("Request body too large".into());
}
}
Ok(bytes.freeze())
}
Load balancer timing: This was probably the hardest part to get right. You need to coordinate the timing between your health checks failing and actually stopping the service. Every load balancer is different, so you need to tune this.
// These numbers took some trial and error to get right
const HEALTH_CHECK_GRACE_PERIOD: Duration = Duration::from_secs(15);
const CONNECTION_DRAIN_TIMEOUT: Duration = Duration::from_secs(30);
async fn graceful_shutdown(
server_handle: ServerHandle,
shutdown_signal: Arc,
) {
// Start failing health checks
shutdown_signal.notify_waiters();
println!("Health checks now failing...");
// Wait for load balancer to notice and stop sending traffic
tokio::time::sleep(HEALTH_CHECK_GRACE_PERIOD).await;
println!("Grace period over, stopping server...");
// Stop accepting new connections
server_handle.stop(false).await;
// Give existing connections time to finish
println!("Waiting for connections to finish...");
tokio::time::sleep(CONNECTION_DRAIN_TIMEOUT).await;
// Force close anything that's still hanging around
server_handle.stop(true).await;
println!("Shutdown complete");
}
Getting Fancy with Rolling Deployments
Once I had basic zero-downtime working, I built a system that could deploy to multiple instances gradually instead of all at once:
// Simplified version of the deployment coordinator
use serde_json::Value;
use reqwest::Client;
struct DeploymentManager {
client: Client,
load_balancer_api: String,
instances: Vec,
}
impl DeploymentManager {
async fn rolling_deploy(&self, new_version: &str) -> Result<(), Box> {
for instance in &self.instances {
println!("Deploying to instance: {}", instance);
// Deploy new version
self.deploy_instance(instance, new_version).await?;
// Wait for it to be healthy
self.wait_for_healthy(instance).await?;
// Switch traffic over
self.shift_traffic_to_instance(instance, 100).await?;
// Kill the old version
self.shutdown_old_version(instance).await?;
println!("Instance {} done", instance);
// Brief pause between instances
tokio::time::sleep(Duration::from_secs(10)).await;
}
Ok(())
}
async fn wait_for_healthy(&self, instance: &str) -> Result<(), Box> {
let health_url = format!("http://{}:8080/health", instance);
for attempt in 1..=30 {
match self.client.get(&health_url).send().await {
Ok(response) if response.status().is_success() => {
let health: Value = response.json().await?;
if health["status"] == "healthy" {
println!("Instance {} is healthy", instance);
return Ok(());
}
}
_ => {}
}
if attempt < 30 {
println!("Attempt {}: Instance {} not ready yet...", attempt, instance);
tokio::time::sleep(Duration::from_secs(2)).await;
}
}
Err(format!("Instance {} never became healthy", instance).into())
}
}
Keeping Track of Everything
Zero-downtime deployments only work if you can actually verify they're working. I added metrics to track everything:
use prometheus::{Encoder, TextEncoder, Counter, Histogram, register_counter, register_histogram};
lazy_static::lazy_static! {
static ref DEPLOYMENT_COUNTER: Counter = register_counter!(
"deployments_total",
"Total number of deployments"
).unwrap();
static ref SHUTDOWN_DURATION: Histogram = register_histogram!(
"shutdown_duration_seconds",
"How long graceful shutdown takes"
).unwrap();
static ref ACTIVE_CONNECTIONS: prometheus::IntGauge =
prometheus::register_int_gauge!(
"active_connections",
"Current active connections"
).unwrap();
}
async fn metrics_handler() -> HttpResponse {
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();
let mut buffer = Vec::new();
encoder.encode(&metric_families, &mut buffer).unwrap();
HttpResponse::Ok()
.content_type("text/plain; version=0.0.4")
.body(buffer)
}
How Well It Actually Works
After getting this all set up, the results were pretty solid:
- Zero failed requests during deployments (used to be ~2,000)
- 15-second deployment window per instance (down from 5+ minutes of total downtime)
- 99.99% uptime over 8 months in production
- Automatic rollbacks when new deployments fail health checks
What I Learned
Graceful shutdown is worth the effort: It's not just about deployments - proper shutdown handling also helps when servers crash or you need to do maintenance.
Your app needs to work with your infrastructure: The best application code in the world won't help if your load balancer configuration is wrong or your deployment process is broken.
Monitor everything: You can't tell if zero-downtime deployments are actually working without good metrics on connections, response times, and error rates.
Test the failure cases: The real test isn't when everything goes perfectly - it's when deployments fail, networks partition, or other weird stuff happens.
This system has been running in production for 8 months now, handling 50+ deployments per week. No more dreading Friday afternoon releases, and our users never see downtime from deployments anymore.