Error Handling & Building Fault Tolerant Systems β Lecture Notes
Core mindset: Errors are not exceptions β they are a normal, expected part of running a backend. The question is not whether errors will happen, but how prepared you are when they do.
1. Types of Errors
1.1 Logic Errors
The most dangerous type. The app doesnβt crash β it runs correctly but produces wrong results.
Why dangerous:
- Silent by nature β no alerts, no crashes
- Can go undetected for weeks or months
- Can cause financial damage (e.g., discount applied twice β negative shipping cost)
Common causes:
- Misunderstood requirements (ambiguous meetings β wrong implementation)
- Incorrect algorithm (miscalculation in a discount or pricing workflow)
- Unhandled edge cases in business logic
What to watch for: Sudden unexpected business metric changes (e.g., revenue drop, order anomalies).
1.2 Database Errors
Can bring the entire system down. Most backend apps are wholly dependent on their database.
| Error Type | Description | Common Cause |
|---|---|---|
| Connection errors | App cannot talk to the DB | Network issues, DB server overloaded, connection pool exhausted |
| Constraint violations | Operation breaks DB rules | Inserting duplicate email (unique violation), inserting with non-existent foreign key |
| Query errors | Malformed SQL | Typo in table/column name, query too complex β timeout |
| Deadlocks | Circular wait between transactions | Multiple operations waiting on each otherβs locks |
Connection pooling: Backends maintain a pool of open TCP connections to the DB to avoid TCP handshake overhead on every request. Exhausting the pool = connection errors.
1.3 External Service Errors
Every third-party dependency is a point of failure you donβt control.
Examples of external dependencies: Payment processors, email providers (Resend, Mailgun), cloud storage (S3), auth providers (Auth0, Clerk), AI APIs (OpenAI).
Failure modes:
| Failure | Description | Strategy |
|---|---|---|
| Network issues | Timeouts, DNS failures, routing problems | Retry with exponential backoff |
| Rate limiting | Provider rejects with 429 Too Many Requests | Exponential backoff; audit your call frequency |
| Authentication errors | Bad credentials, expired tokens | Verify credentials; implement token refresh |
| Service outage | Provider goes down entirely | Fallback mechanisms (cache, secondary node, degraded mode) |
Exponential backoff pattern:
Request fails β wait 1 min β retry
Still fails β wait 2 min β retry
Still fails β wait 4 min β retry
Still fails β wait 8 min β retry
...up to max retries
Most downtime resolves in seconds/minutes β the task usually succeeds within 1β2 retries.
1.4 Input Validation Errors
User-caused errors. These are the easiest to handle β you know the rules, enforce them at the entry point.
| Validation Type | Example |
|---|---|
| Format | Email must match user@domain.tld pattern; date must be YYYY-MM-DD |
| Range | Price must be > 0; string length between 5β500 chars; array must have 1β100 items |
| Required fields | Name field must be present to create a book |
Return 400 Bad Request with clear field-level messages. These errors should never reach service or repository layers.
1.5 Configuration Errors
Occur when moving between environments (dev β staging β production).
Example: OPENAI_API_KEY added to .env during development, forgotten in production environment variables.
Two outcomes:
| Setup | Outcome |
|---|---|
| Validate env vars at startup | App fails to start β previous deployment keeps running (preferred) |
| No startup validation | App starts, fails at runtime when the missing key is accessed β users get 500 |
Best practice: Validate all required environment variables before the server starts. Fail fast with a clear message. Never let a misconfiguration reach a live user.
2. Prevention: Proactive Error Detection
The best error handling starts before errors happen.
2.1 Health Checks
Basic health endpoint:
GET /health β 200 OK (server is running)
500 (something is wrong)
This only tells you the server is alive. More comprehensive checks needed:
Database health checks:
- Test connectivity to the DB
- Run a representative query and measure response time
- Alert if queries that used to take 500ms now take 5s
External service health checks:
- Payment processors β run a test transaction
- Email providers β send a test email to an internal address
- Auth services β generate and validate a test token
Core functionality checks (at startup):
- All required environment variables are present
- Essential caches are populated
- Internal data structures are consistent
2.2 Monitoring & Observability (Overview)
Deep dive in a future video β high-level principles here.
What to monitor:
- HTTP error rates (4xx, 5xx)
- Database errors and query latency
- External service failure rates
- Business metrics (sudden drop in successful transactions is often the first sign of a technical problem)
Key insight: Donβt only track error rates. Monitor performance degradation β a system slowing down is often the warning sign before it breaks.
Logging best practices:
- Use structured JSON logs (parseable by tools like Grafana/Loki)
- Include correlation IDs and request IDs for tracing
- Never log sensitive data (see Security section below)
3. Error Response Strategies
3.1 Recoverable Errors
Errors that can be retried without human intervention.
Examples: Network timeouts, temporary DB connection failures, external API rate limits.
Strategy: Retry with exponential backoff. Be careful not to add more load to an already stressed system.
3.2 Non-Recoverable Errors
Errors that require giving up on the current operation.
Examples: Persistent external service outage, corrupted data, exhausted all retries.
Strategy: Containment + Graceful Degradation
- Switch to cached data
- Disable non-essential features
- Provide alternative functionality / fallback
- Contain the scope of damage β prevent one failure from cascading
4. Global Error Handler β The Final Safety Net
The single most important error handling mechanism in a backend app.
Architecture
Request
β
Routing β Handler (validation, deserialization)
β
Service (business logic)
β
Repository (DB queries)
β
β errors bubble up β
β
Global Error Handler Middleware
β
Response
Every error β regardless of where itβs thrown (repository, service, handler) β bubbles up to the global error handler. The middleware reads the error type and returns the appropriate HTTP response.
Error Type β HTTP Response Mapping
| Error | HTTP Code | Message |
|---|---|---|
| Validation failure (bad field format, missing required field) | 400 Bad Request | Field-level error details |
| Unique constraint violation (email already exists) | 400 Bad Request | βThis book already existsβ |
| No rows returned (resource not found) | 404 Not Found | βBook with ID 123 does not existβ |
| Foreign key violation (author ID doesnβt exist) | 404 Not Found | βAuthor with this ID does not existβ |
| Unknown / unclassified error | 500 Internal Server Error | βSomething went wrongβ (generic β no internal details) |
Custom Error Classes
Create typed error classes (DatabaseError, ValidationError, NotFoundError, etc.) so the global handler can identify error type and respond correctly.
Repository throws: DatabaseUniqueConstraintError
Service passes it up unchanged
Handler passes it up unchanged
Global middleware: catches DatabaseUniqueConstraintError β return 400 + "already exists"
Two Key Advantages
1. Robustness: All error types are handled in one place. You canβt forget a condition β any unrecognized error falls back to the 500 default.
2. Reduced redundancy: Error handling logic isnβt duplicated across every repository method. Define it once in the middleware, not in every SQL call.
5. Error Propagation & Boundaries
Propagation
Errors should bubble up with enough context. Wrap low-level errors with business context as they travel up layers:
Repository: "UNIQUE constraint violated on column 'email'" (DB detail)
β wrap with context
Service: EmailAlreadyExistsError("User with this email already registered")
β
Handler: passes up
β
Global handler: return 400 + user-friendly message
Service Boundaries
In microservice architectures, prevent errors in one service from cascading to others:
- Use separate processes for separate services
- Implement timeouts at service boundaries
- Use message queues (RabbitMQ, SQS) for async communication β a failing consumer doesnβt block the producer
6. Security in Error Handling
6.1 Donβt Leak Internal Details
Never send raw database errors to users. DB errors often contain table names, constraint names, column names β information an attacker can use for SQL injection or targeted attacks.
β Bad: "ERROR: duplicate key violates unique constraint 'users_email_idx'"
β Good: "An account with this email already exists"
β Bad: (any unhandled exception message in a 500 response)
β Good: "Something went wrong. Please try again later."
The 500 default handler message should always be generic β by the time you reach the default, you donβt know what the error is, which means it probably contains internal detail.
6.2 Authentication Error Messages
Authentication endpoints are the most targeted by attackers. Specific error messages enable enumeration attacks:
β Bad:
"User with this email does not exist" β attacker knows to try a different email
"Incorrect password" β attacker knows the email is valid, focus on password brute-force
β Good:
"Invalid email or password" β same message regardless of which field is wrong
Reference: OWASP Authentication Cheat Sheet β follow established best practices for auth-related error handling.
6.3 Log Security
Logs are often stored in external services and can be leaked in breaches.
Never log:
- User passwords (even hashed)
- Credit card numbers
- API keys or secrets
- Full email addresses (use user ID for correlation instead)
When logging auth errors:
β Bad: log { email: "user@example.com", error: "password mismatch" }
β Good: log { userId: "abc123", correlationId: "req-xyz", error: "auth_failed" }
7. Summary: Fault Tolerant Mindset
| Principle | Practice |
|---|---|
| Expect errors | Design for failure, not against it |
| Fail fast | Validate config at startup; donβt let misconfiguration reach users |
| Detect early | Health checks, monitoring, performance tracking |
| Handle centrally | Global error handler for all layers |
| Contain damage | Graceful degradation, service boundaries, fallbacks |
| Donβt expose internals | Generic messages for unknown errors; typed messages for known ones |
| Secure auth errors | Same message regardless of which auth field is wrong |
| Protect logs | Never log PII, passwords, or secrets |
Quick Revision Checklist
- Logic errors: silent, produce wrong results, can persist for months β monitor business metrics
- Database errors: connection (pool exhaustion), constraint violation (unique/FK), query (malformed SQL), deadlocks
- External service errors: network issues, rate limiting (β exponential backoff), auth errors, outages (β fallback)
- Input validation: format, range, required fields β
400 Bad Requestat the entry point - Configuration errors: validate all required env vars at startup; fail fast before serving users
- Best error handling starts before errors happen β proactive detection
- Health checks: HTTP endpoint + DB query test + external service test + config/cache checks at startup
- Global error handler: catches all bubbled-up errors β maps error type β returns correct HTTP response
- Error type mapping: unique constraint β 400, no rows β 404, FK violation β 404, unknown β 500 (generic)
- Custom error classes enable typed matching in the global handler
- Never send raw DB/internal errors to users β strip details, send user-friendly messages
- Auth errors: always return the same message regardless of which field is wrong (prevent enumeration)
- Logs: never log passwords, emails, credit cards, API keys; log user ID + correlation ID instead
- Recoverable errors β retry with exponential backoff; non-recoverable β contain + degrade gracefully
- Service boundaries: timeouts + message queues prevent one failing service from cascading