Logging, Monitoring & Observability β€” Lecture Notes

Key framing: This is a spectrum, not a checkbox. No system is β€œ100% observable.” These are practices implemented to varying degrees β€” the goal is to progressively improve your ability to understand and debug your system in production.


1. Why This Matters

Modern backends run in distributed environments β€” multiple services, regions, databases, external integrations. Without structured observability:

  • You only find out something is broken when users complain
  • You know that something is wrong but not what or where
  • Debugging production issues becomes guesswork

With proper logging, monitoring, and observability: you know what happened, where it happened, why it happened β€” and often before users notice.


2. Three Core Concepts

2.1 Logging

What: Recording important events throughout the application lifecycle with metadata.

Think of it as: A diary your backend keeps β€” what happened, when, and with what context.

Examples of events to log:

  • User logs in
  • Database query executes (or fails)
  • External API call is made
  • Business operation completes (to-do created, order placed)
  • Error occurs (with stack trace, user ID, request ID)

2.2 Monitoring

What: Continuously tracking the health and performance of your system using real-time (β‰ˆ10–15 second lag) data.

Examples of what to monitor:

  • Server CPU and memory usage
  • Requests per second
  • Error rate (% of requests returning 4xx/5xx)
  • Open database connection count
  • Response time percentiles (p50, p95, p99)

Monitoring tells you that something is wrong. It doesn’t tell you why.


2.3 Observability

What: The ability to determine the internal state of your system by examining its external outputs. Built on three pillars:

PillarWhat it provides
LogsWhat happened β€” events with context
MetricsHow the system is performing β€” quantified numbers over time
TracesHow a request flowed through all components

Monitoring vs Observability:

MonitoringObservability
Tells youThat there’s a problemWhat and where the problem is
Data typeAggregated metrics, alertsLogs + metrics + distributed traces
OriginInfrastructure-levelCode-level + infrastructure-level

3. The Three Pillars in Detail

3.1 Logs

Already covered above. The key addition: logs should link to traces via a shared traceId / requestId so you can jump from a log entry to the full request trace.

3.2 Metrics

Concrete numbers about your system, either current or historical:

MetricExample
Request count1,200 requests in the last 5 minutes
Error rate4.2% of requests returned 5xx
Response timep95 = 340ms
Business metrics47 to-dos created in the last hour
InfrastructureMemory: 3 MB, GC pause: 12ms

Metrics drive alerts: β€œif error rate > 80% for 5 minutes, send a Slack notification.”

3.3 Traces

A trace = the full journey of a single request through all system components.

Incoming request
      ↓
Middleware (auth check)      ← span 1: 2ms
      ↓
Validation layer             ← span 2: 1ms
      ↓
Service method               ← span 3: 45ms
      ↓
Repository (DB query)        ← span 4: 38ms
      ↓
Response

Each step is called a span. Together they form a trace. A trace lets you see:

  • Where time was spent
  • Where an error first appeared
  • Which downstream service caused a failure

4. Typical Debugging Workflow

Alert fires: "Error rate > 80%"
      ↓
Open monitoring dashboard (Grafana / New Relic)
      ↓
Look at metrics: confirm spike in 5xx errors
      ↓
Find related logs: filter by time window and error level
      ↓
Click on a specific log entry: see full context (user ID, route, error message)
      ↓
Jump to trace: see exactly which component failed and at what point
      ↓
Fix the issue

This workflow is only possible if all three pillars are implemented.


5. Log Levels

Each log statement should have an assigned level. Levels control what gets written to logs in each environment.

LevelWhen to useEnvironment
debugDetailed troubleshooting info β€” step-by-step flow, variable valuesDev only
infoGeneral successful operations, business events (β€œto-do created”)Dev + Production
warnUnexpected but non-critical β€” failed auth attempt, deprecated API usageDev + Production
errorSomething went wrong β€” DB query failed, external API error, validation errorDev + Production
fatalCritical unrecoverable error β€” app shuts down / restartsDev + Production

In practice:

  • Set LOG_LEVEL=debug in development β†’ see everything
  • Set LOG_LEVEL=info in production β†’ avoid noise; keep only meaningful events

6. Structured vs Unstructured Logging

Unstructured (Development)

Human-readable, colored terminal output:

[INFO]  2024-01-15 10:23:41  Connected to database
[INFO]  2024-01-15 10:23:41  Server started on port 8080
[ERROR] 2024-01-15 10:24:12  Failed to create to-do: duplicate key

Easy for humans to read. Hard for machines to parse.

Structured (Production)

JSON format β€” every field is explicit and parseable:

{
  "level": "error",
  "timestamp": "2024-01-15T10:24:12Z",
  "message": "Failed to create to-do",
  "userId": "user_abc123",
  "requestId": "req_xyz456",
  "traceId": "trace_789",
  "error": "duplicate key violates unique constraint",
  "route": "POST /todos",
  "latencyMs": 45
}

Why JSON in production:

  • Log management tools (ELK stack, Loki, New Relic) can parse and index every field
  • You can filter logs by userId, traceId, level, route, etc.
  • You can build dashboards and alerts from specific fields
  • Impossible to do this reliably with plain text

Rule: Human-readable in development β†’ JSON in production.


7. Instrumentation and OpenTelemetry

Instrumentation: The practice of adding measurement code to your application β€” recording spans, adding attributes to traces, emitting metrics. This is the code-level work that makes observability possible.

OpenTelemetry (OTel): An open standard and ecosystem for instrumentation. Provides:

  • SDKs for all major languages (Node.js, Go, Python, Java, Rust, etc.)
  • Standardized APIs for emitting traces, metrics, and logs
  • Vendor-neutral β€” works with Grafana, New Relic, Datadog, Jaeger, etc.

How a trace is created in code (simplified):

// Middleware: create a transaction/trace when a request arrives
transaction := newrelic.StartTransaction("POST /todos")
transaction.AddAttribute("userId", userID)
transaction.AddAttribute("env", "production")
ctx = context.WithValue(ctx, "transaction", transaction)

// Service layer: continue the same trace
txn := ctx.Value("transaction")
defer txn.EndSegment()         // ends this span when function returns
txn.AddAttribute("title", todo.Title)

// Log the event
logger.Info("Creating to-do", "title", todo.Title, "userId", userID)

// On error: add to trace + log
txn.NoticeError(err)
logger.Error("Failed to create to-do", "error", err, "userId", userID)

The key: the same requestId / traceId flows through every log and span so you can correlate them.


8. What to Log (Practical Guide)

Event typeLog levelWhat to include
Request receivedinforoute, method, userId, requestId, IP
Business operation startinfooperation name, key parameters
Business operation successdebugresult ID, duration
Business event (to-do created)inforesource ID, user ID, key fields
Validation failurewarnfield names, validation errors
Auth failurewarnuserId (not password), route, IP
External API callinfoservice name, endpoint, status
DB query failureerrorquery type, error message (not raw SQL)
Unhandled exceptionerrorstack trace, userId, requestId
App startupinfoenvironment, port, config summary
App shutdowninforeason, any cleanup status

Never log: passwords, credit card numbers, full email addresses, API keys, raw JWT tokens (see error handling notes).


9. Tools

Open Source Stack (Grafana Stack)

ToolRole
PrometheusMetrics collection and storage
GrafanaDashboard and visualization (connects to Prometheus, Loki, Jaeger)
LokiLog aggregation (like Elasticsearch but for logs, pairs with Grafana)
JaegerDistributed trace collection and visualization
OpenTelemetry CollectorReceives traces/metrics from apps, forwards to backends

Pros: Free, open source, highly configurable, used at most large companies. Cons: Requires setup, configuration, and ongoing maintenance.

Managed / Proprietary Solutions

ToolNotes
New RelicAll-in-one: logs, metrics, traces, dashboards, alerts
DatadogSimilar to New Relic; strong APM (application performance monitoring)
Elastic APMAdd-on to ELK stack for traces and APM

Pros: Simpler setup, integrated dashboards, managed infrastructure. Cons: Cost at scale; vendor lock-in.

Rule of thumb: Start with a managed solution (New Relic, Datadog) if you don’t have dedicated DevOps bandwidth. Migrate to open source when you have the team.


10. Shared Responsibility

Logging, monitoring, and observability require both developers and DevOps / infrastructure teams:

ResponsibilityWho
Add log statements with proper levels and metadataDeveloper
Instrument traces (create spans, add attributes)Developer
Emit custom business metricsDeveloper
Set up Prometheus / Grafana / Loki infrastructureDevOps
Configure log aggregation and ingestion pipelinesDevOps
Set up alert rules and on-call routingDevOps + Developer
Define which metrics and thresholds matterDeveloper + Product

Quick Revision Checklist

  • Logging = recording events; Monitoring = real-time health data; Observability = full internal state understanding
  • Three pillars of observability: Logs (what happened), Metrics (how the system performs), Traces (how a request flowed)
  • Monitoring tells you that something is wrong; observability tells you why
  • Log levels: debug (dev only) β†’ info β†’ warn β†’ error β†’ fatal (app shuts down)
  • LOG_LEVEL=debug in development, LOG_LEVEL=info in production
  • Unstructured (human-readable) logs in dev; JSON (structured) logs in production
  • JSON logs enable filtering, indexing, and dashboard building by tools
  • A trace = full journey of one request through all components; each step = a span
  • Instrumentation = adding measurement code; OpenTelemetry = vendor-neutral standard for this
  • Pass traceId / requestId through all logs so you can correlate events from the same request
  • Open source stack: Prometheus (metrics) + Loki (logs) + Grafana (dashboards) + Jaeger (traces)
  • Managed options: New Relic, Datadog β€” simpler but cost more at scale
  • Never log passwords, secrets, full emails, or credit card numbers
  • Debugging workflow: alert β†’ metrics β†’ logs β†’ trace β†’ fix
  • This is a spectrum β€” implement incrementally; something is always better than nothing