Skip to content

Observability

Observability is not just "more logging." In real services, it is the design that connects traces, error events, structured logs, and request context into one coherent operating model. Without that, slow requests, hidden N+1 queries, downstream timeouts, and background-task failures end up scattered across unrelated tools.

Quick takeaway: a practical FastAPI stack often uses OpenTelemetry as the trace backbone, Sentry as the error and performance product, and `structlog` for structured request-scoped logs. The important part is not the brand list; it is initializing them once at startup, separating sampling from PII policy, and avoiding high-cardinality spam.

A Practical Stack

RoleGood defaultWhy
distributed traces and spansOpenTelemetryvendor-neutral foundation with strong FastAPI and SQLAlchemy instrumentation
error and performance monitoringSentryuseful operational UI for errors plus tracing context
structured logs and request contextstructlogclean contextvars story for request and trace correlation

The Big Picture

Good observability connects logs, traces, and errors through one request context instead of leaving them isolated.

Instrument OpenTelemetry Once During Bootstrap

py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor


def configure_observability(app: FastAPI, engine: object) -> None:
    FastAPIInstrumentor.instrument_app(
        app,
        excluded_urls="health,metrics",
    )
    SQLAlchemyInstrumentor().instrument(
        engine=engine,
    )

FastAPI and SQLAlchemy instrumentation should usually be attached once during bootstrap. Doing it inside routes or dependencies can cause duplicate instrumentation and confusing runtime behavior.

Design Sentry Sampling Deliberately

py
import sentry_sdk
from sentry_sdk.types import SamplingContext


def traces_sampler(context: SamplingContext) -> float:
    if context.get("parent_sampled") is not None:
        return float(context["parent_sampled"])

    transaction_context = context.get("transaction_context", {})
    name = str(transaction_context.get("name", ""))
    if name.startswith("GET /health"):
        return 0.0
    if name.startswith("POST /checkout"):
        return 0.5
    return 0.1


sentry_sdk.init(
    dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
    traces_sampler=traces_sampler,
    sample_rate=1.0,
)
  • Error retention and trace sampling usually deserve different rates.
  • Sentry's docs explicitly recommend deliberate use of traces_sample_rate or traces_sampler.
  • Inherited parent sampling decisions usually should be preserved so distributed traces stay intact.

Bind Request Context into Logs with structlog

py
import structlog
from structlog.contextvars import bind_contextvars, clear_contextvars


log = structlog.get_logger()


async def logging_middleware(request, call_next):
    clear_contextvars()
    bind_contextvars(
        request_id=request.headers.get("x-request-id", "generated-id"),
        path=request.url.path,
    )
    response = await call_next(request)
    log.info("request.complete", status_code=response.status_code)
    return response

Good Patterns

  • bind request IDs and trace IDs in one request scope
  • exclude health checks and other noisy low-value paths from tracing
  • keep span names at route or business-action granularity
  • instrument meaningful boundaries such as DB, outbound HTTP, or queue publishing
  • decide PII and secret redaction before shipping events broadly

Patterns to Avoid

  • tracing every request at 100% without considering volume
  • adding high-cardinality values such as user_id, email, or order_id everywhere
  • reinitializing logger or Sentry scope repeatedly inside route handlers
  • logging whole request bodies or raw secrets
  • creating spans inside tight per-row loops
  • double auto-instrumenting the same path with overlapping SDKs

Operational Checklist

Initialize once

Instrumentation and SDK setup should live in app bootstrap or lifespan setup, not inside request handlers.

Sample by signal type

Error, trace, and profiling signals usually should not all use the same rate.

Constrain cardinality

Metric tags and span attributes should stay searchable and bounded.

Set PII policy first

Broad instrumentation without redaction rules creates operational and compliance risk quickly.

Scenario Table

symptominspect firstlikely root causesafe mitigationwhat not to do
p95 jumps right after a release but traces are patchyinspect whether instrumentation was duplicated at bootstrap and whether sampling changedSDKs were reinitialized inside routes or sampling was accidentally reducedremove duplicate initialization and temporarily raise sampling only for key endpointsswitch the whole service to permanent 100% tracing under pressure
500 alerts fire but logs, errors, and traces cannot be correlatedinspect whether request ID and trace ID are bound in the same request scopecorrelation propagation is missing or middleware/contextvars binding brokerestore request-scoped binding and normalize shared correlation fields on hot pathsdump raw request bodies and secrets into logs as a shortcut
observability cost spikes and search gets slowerinspect metric tags and span attributes for high-cardinality valuesfields such as user_id, email, or order_id were added too broadlyremove high-cardinality fields and keep only bounded searchable dimensionsignore the cardinality issue and only shorten retention until the signal becomes useless

Code Review Lens

  • Check whether instrumentation and SDK setup happen once during bootstrap or lifespan.
  • Check whether request IDs, trace IDs, and error events share one request context.
  • Check whether sampling and redaction policy are explicit instead of hidden inside tool defaults.
  • Check whether spans, tags, and log fields stay within bounded cardinality.

Common Anti-Patterns

  • reinitializing logger scope, tracer setup, or Sentry SDK inside routes or dependencies
  • having traces without request IDs or logs that never correlate with traces
  • storing full request or response bodies in operational logs or error context
  • adding more metric labels or span attributes every time something becomes hard to debug

Likely Discussion Questions

  • During an incident, which signal would you inspect first and how would you correlate the rest?
  • Why is bootstrap-time instrumentation safer than route-level initialization?
  • Which fields belong in traces and logs, and which should be redacted or omitted?
  • When tracing cost rises, what should be constrained before you just reduce retention?

Strong Answer Frame

  • Start by checking whether logs, traces, and errors share the same request context.
  • Diagnose the system through four separate concerns: initialization point, sampling, cardinality, and redaction.
  • Under pressure, increase signal in a bounded way and then restore conservative defaults once the cause is known.
  • Close by framing observability as an operational contract, not a pile of vendor SDKs.

Official References

Built with VitePress for a Python 3.14 handbook.