Observability

Observability is not just "more logging." In real services, it is the design that connects traces, error events, structured logs, and request context into one coherent operating model. Without that, slow requests, hidden N+1 queries, downstream timeouts, and background-task failures end up scattered across unrelated tools.

Quick takeaway: a practical FastAPI stack often uses OpenTelemetry as the trace backbone, Sentry as the error and performance product, and `structlog` for structured request-scoped logs. The important part is not the brand list; it is initializing them once at startup, separating sampling from PII policy, and avoiding high-cardinality spam.

A Practical Stack

Role	Good default	Why
distributed traces and spans	OpenTelemetry	vendor-neutral foundation with strong FastAPI and SQLAlchemy instrumentation
error and performance monitoring	Sentry	useful operational UI for errors plus tracing context
structured logs and request context	structlog	clean `contextvars` story for request and trace correlation

The Big Picture

Good observability connects logs, traces, and errors through one request context instead of leaving them isolated.

Instrument OpenTelemetry Once During Bootstrap

from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor


def configure_observability(app: FastAPI, engine: object) -> None:
    FastAPIInstrumentor.instrument_app(
        app,
        excluded_urls="health,metrics",
    )
    SQLAlchemyInstrumentor().instrument(
        engine=engine,
    )

FastAPI and SQLAlchemy instrumentation should usually be attached once during bootstrap. Doing it inside routes or dependencies can cause duplicate instrumentation and confusing runtime behavior.

Design Sentry Sampling Deliberately

import sentry_sdk
from sentry_sdk.types import SamplingContext


def traces_sampler(context: SamplingContext) -> float:
    if context.get("parent_sampled") is not None:
        return float(context["parent_sampled"])

    transaction_context = context.get("transaction_context", {})
    name = str(transaction_context.get("name", ""))
    if name.startswith("GET /health"):
        return 0.0
    if name.startswith("POST /checkout"):
        return 0.5
    return 0.1


sentry_sdk.init(
    dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
    traces_sampler=traces_sampler,
    sample_rate=1.0,
)

Error retention and trace sampling usually deserve different rates.
Sentry's docs explicitly recommend deliberate use of traces_sample_rate or traces_sampler.
Inherited parent sampling decisions usually should be preserved so distributed traces stay intact.

Bind Request Context into Logs with `structlog`

import structlog
from structlog.contextvars import bind_contextvars, clear_contextvars


log = structlog.get_logger()


async def logging_middleware(request, call_next):
    clear_contextvars()
    bind_contextvars(
        request_id=request.headers.get("x-request-id", "generated-id"),
        path=request.url.path,
    )
    response = await call_next(request)
    log.info("request.complete", status_code=response.status_code)
    return response

Good Patterns

bind request IDs and trace IDs in one request scope
exclude health checks and other noisy low-value paths from tracing
keep span names at route or business-action granularity
instrument meaningful boundaries such as DB, outbound HTTP, or queue publishing
decide PII and secret redaction before shipping events broadly

Patterns to Avoid

tracing every request at 100% without considering volume
adding high-cardinality values such as user_id, email, or order_id everywhere
reinitializing logger or Sentry scope repeatedly inside route handlers
logging whole request bodies or raw secrets
creating spans inside tight per-row loops
double auto-instrumenting the same path with overlapping SDKs

Operational Checklist

Initialize once

Instrumentation and SDK setup should live in app bootstrap or lifespan setup, not inside request handlers.

Sample by signal type

Error, trace, and profiling signals usually should not all use the same rate.

Constrain cardinality

Metric tags and span attributes should stay searchable and bounded.

Set PII policy first

Broad instrumentation without redaction rules creates operational and compliance risk quickly.

Scenario Table

symptom	inspect first	likely root cause	safe mitigation	what not to do
p95 jumps right after a release but traces are patchy	inspect whether instrumentation was duplicated at bootstrap and whether sampling changed	SDKs were reinitialized inside routes or sampling was accidentally reduced	remove duplicate initialization and temporarily raise sampling only for key endpoints	switch the whole service to permanent 100% tracing under pressure
500 alerts fire but logs, errors, and traces cannot be correlated	inspect whether request ID and trace ID are bound in the same request scope	correlation propagation is missing or middleware/contextvars binding broke	restore request-scoped binding and normalize shared correlation fields on hot paths	dump raw request bodies and secrets into logs as a shortcut
observability cost spikes and search gets slower	inspect metric tags and span attributes for high-cardinality values	fields such as `user_id`, `email`, or `order_id` were added too broadly	remove high-cardinality fields and keep only bounded searchable dimensions	ignore the cardinality issue and only shorten retention until the signal becomes useless

Code Review Lens

Check whether instrumentation and SDK setup happen once during bootstrap or lifespan.
Check whether request IDs, trace IDs, and error events share one request context.
Check whether sampling and redaction policy are explicit instead of hidden inside tool defaults.
Check whether spans, tags, and log fields stay within bounded cardinality.

Common Anti-Patterns

reinitializing logger scope, tracer setup, or Sentry SDK inside routes or dependencies
having traces without request IDs or logs that never correlate with traces
storing full request or response bodies in operational logs or error context
adding more metric labels or span attributes every time something becomes hard to debug

Likely Discussion Questions

During an incident, which signal would you inspect first and how would you correlate the rest?
Why is bootstrap-time instrumentation safer than route-level initialization?
Which fields belong in traces and logs, and which should be redacted or omitted?
When tracing cost rises, what should be constrained before you just reduce retention?

Strong Answer Frame

Start by checking whether logs, traces, and errors share the same request context.
Diagnose the system through four separate concerns: initialization point, sampling, cardinality, and redaction.
Under pressure, increase signal in a bounded way and then restore conservative defaults once the cause is known.
Close by framing observability as an operational contract, not a pile of vendor SDKs.

Observability ​

A Practical Stack ​

The Big Picture ​

Instrument OpenTelemetry Once During Bootstrap ​

Design Sentry Sampling Deliberately ​

Bind Request Context into Logs with structlog ​

Good Patterns ​

Patterns to Avoid ​

Operational Checklist ​