Graceful Shutdown β€” Lecture Notes


1. What is Graceful Shutdown?

The problem: A server restart is triggered mid-deployment. At that exact moment, your backend may be processing active requests β€” including payment transactions, database writes, or file uploads. What happens to those?

The answer: Graceful shutdown β€” the practice of stopping a backend application in an orderly way rather than abruptly killing it.

Analogy: When a restaurant closes, staff don’t turn off the lights and push customers out. They stop seating new customers, let existing ones finish their meals, then clean up and close. Same principle applies to backend servers.

Why it matters:

  • Prevents incomplete transactions (double charges, lost orders)
  • Avoids data corruption from interrupted DB writes
  • Prevents deadlocks from uncommitted transactions
  • Ensures a good user experience during deployments

2. Process Lifecycle

Every application runs as a process inside an operating system. Like living things, processes have a lifecycle:

Born     β†’ process starts
Lives    β†’ process executes
Dies     β†’ process terminates

When the OS needs to stop a process, it doesn’t just β€œpull the plug.” It follows an established communication protocol using signals.


3. Unix Signals

Signals are the OS mechanism for inter-process communication (IPC). Your application registers signal handlers β€” code that waits and listens for these signals and responds appropriately.

3.1 SIGTERM β€” The Polite Request

Signal: SIGTERM (Signal + Terminate)
Origin: Deployment systems, process managers (Kubernetes, PM2, systemd)
Meaning: "Please finish up and shut down cleanly"
  • Sent programmatically by deployment tools, orchestration platforms
  • Application can catch and handle it
  • Gives the application time to complete in-flight work
  • The standard way production deployments stop services

Response: finish existing requests β†’ clean up resources β†’ exit


3.2 SIGINT β€” The Developer Interrupt

Signal: SIGINT (Signal + Interrupt)
Origin: Developer pressing Ctrl+C in the terminal
Meaning: "User-initiated shutdown"
  • Requires a human key press β€” primarily used in development
  • Application can catch and handle it
  • Should be handled identically to SIGTERM

Whether it’s a developer pressing Ctrl+C or Kubernetes sending SIGTERM, the intention is the same: shut down cleanly. Handle both signals with the same graceful shutdown logic.


3.3 SIGKILL β€” The Nuclear Option

Signal: SIGKILL (Signal + Kill)
Origin: OS, or manual kill -9
Meaning: "Stop immediately, no exceptions"
  • Application cannot catch it
  • Application cannot ignore it
  • Process is terminated instantly β€” no cleanup, no finishing in-flight work
  • Equivalent to pulling the power plug

If you don’t respond to SIGTERM within the timeout window, the OS escalates to SIGKILL. This is why handling SIGTERM gracefully is critical β€” it’s your last chance to clean up.


4. Graceful Shutdown in Two Steps

Step 1: Connection Draining

Stop accepting new work, finish existing work.

SIGTERM received
       ↓
Stop accepting new connections/requests
       ↓
Allow in-flight requests to complete
       ↓
Proceed to resource cleanup

The restaurant analogy:

  1. Close the front door β€” stop seating new customers
  2. Announce last call β€” let existing customers finish
  3. Once everyone leaves β€” clean up and close

By application type:

ApplicationConnection Draining Behavior
HTTP serverStop accepting new HTTP requests; let active requests complete
DatabaseFinish active queries/transactions; stop new ones
WebSocket serverNotify connected clients of closure; close sockets
Background job workerFinish current task; stop picking up new tasks

The Timeout Problem

You can’t wait forever. Most systems set a graceful shutdown timeout (commonly 30 seconds):

SIGTERM
  ↓
Start draining connections
  ↓
Wait up to 30 seconds for in-flight work to finish
  ↓
If still running after timeout β†’ force stop (SIGKILL behavior)

Choosing the right timeout:

  • Too short β†’ legitimate requests get interrupted
  • Too long β†’ deployments are slow, system is sluggish

Base it on your typical request duration. For standard CRUD APIs, 30 seconds is usually more than enough. For streaming, WebSocket, or long-running jobs, consider longer windows.


Step 2: Resource Cleanup

Release everything the application acquired during its lifetime.

Resources to clean up:

ResourceWhy it matters
Database connectionsMust commit or rollback open transactions; close TCP connections in pool
Network connectionsOS limits open connections per process; unreleased connections β†’ memory/resource leak
File handlesOS limits open files per process; unreleased handles β†’ memory leak
Redis/cache connectionsBackground job processing; must stop workers cleanly
Temporary filesMust be deleted; orphaned temp files accumulate over deployments

Critical rule: Clean up resources in the reverse order of acquisition.

Startup order:
  1. Connect to Redis
  2. Connect to database
  3. Start HTTP server

Shutdown order:
  1. Stop HTTP server (stop accepting new requests)
  2. Close database connections
  3. Close Redis connections

Why reverse order? If a component depends on another (e.g., the HTTP server sends jobs to Redis), you must stop the consumer before the dependency. Reverse order prevents β€œcleanup a resource that something else still depends on.”


5. The Full Graceful Shutdown Flow

SIGTERM or SIGINT received
          ↓
Register signal handler triggers shutdown function
          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Connection Draining                        β”‚
β”‚  β€’ Stop HTTP server (no new connections)    β”‚
β”‚  β€’ Wait for in-flight requests to finish    β”‚
β”‚  β€’ Timeout: max 30 seconds                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Resource Cleanup (reverse acquisition orderβ”‚
β”‚  β€’ Close database connections               β”‚
β”‚    (commit/rollback open transactions)      β”‚
β”‚  β€’ Stop background job workers              β”‚
β”‚  β€’ Close Redis/cache connections            β”‚
β”‚  β€’ Release file handles                     β”‚
β”‚  β€’ Clean up temporary files                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      ↓
              Log: "Server exited cleanly"
                      ↓
                   Process exits

6. Implementation (Language-Agnostic Pattern)

Most HTTP frameworks provide a server.Shutdown() or equivalent method that handles connection draining internally. The general pattern:

1. Register SIGTERM and SIGINT handlers
2. When signal received, call handler:
   a. Call server.Shutdown(ctx) β€” stops new requests, waits for existing
   b. Close database connections (after HTTP server finishes)
   c. Close Redis/cache connections
   d. Close any other acquired resources
   e. Log success message
3. Process exits cleanly

Libraries that handle this:

  • Node.js: Express server.close(), Fastify fastify.close()
  • Go: http.Server.Shutdown(ctx)
  • Python: Signal module + server.shutdown()
  • Most frameworks have this built in β€” copy the boilerplate from docs

7. Zero Downtime Deployments

Graceful shutdown is one piece of a zero-downtime deployment strategy.

Blue-green deployment flow:

1. New server (new code) starts up and passes health checks
2. Load balancer shifts traffic to new server
3. Old server receives SIGTERM β†’ begins graceful shutdown
4. Old server drains connections β†’ cleans up β†’ exits

Graceful shutdown ensures step 3–4 don’t corrupt in-flight requests on the old server during the transition.


Quick Revision Checklist

  • Graceful shutdown = stop accepting new work, finish existing work, clean up resources, then exit
  • Every application runs as an OS process; shutdown is managed via Unix signals
  • SIGTERM = polite request to stop (sent by Kubernetes, PM2, systemd) β€” can be caught and handled
  • SIGINT = Ctrl+C from developer β€” can be caught and handled; treat identically to SIGTERM
  • SIGKILL = instant termination β€” cannot be caught or ignored; no cleanup possible
  • If SIGTERM is not handled within timeout β†’ OS escalates to SIGKILL
  • Connection draining: stop new connections first, let in-flight requests finish
  • Graceful shutdown timeout: typically 30 seconds; tune based on your request duration
  • Resource cleanup: database connections, network connections, file handles, Redis, temp files
  • Clean up in reverse acquisition order β€” prevents cleaning a dependency before its consumers
  • Database: commit or rollback open transactions before closing; never leave transactions dangling
  • Most frameworks provide a shutdown() method β€” boilerplate is copy-paste; understanding why matters more than memorizing how
  • Graceful shutdown + zero-downtime deployment = no requests lost during production deploys