Skip to content

CPython Internals Advanced

This chapter treats CPython internals as an execution model, not as trivia. Once you connect source -> AST -> code object -> frame -> eval loop -> object/memory layers, performance discussions become concrete instead of vague.

Quick takeaway: build intuition in three layers: execution (frame/code/bytecode), object model (`PyObject`, type, refcount), and memory model (refcount plus cyclic GC plus allocators). Then verify with `dis`, `ast`, `gc`, `tracemalloc`, and `sys.monitoring` labs.

Execution Pipeline

Python source moves through multiple internal stages before object-level operations execute.

1) Keep code objects and frames distinct

  • code object: static execution plan (constants, names, bytecode)
  • frame object: runtime context (locals, stack, current instruction)
  • one function object can produce many frames across calls
py
import inspect


def sample(x: int, y: int) -> int:
    frame = inspect.currentframe()
    assert frame is not None
    print("frame locals keys:", list(frame.f_locals))
    return x + y

2) Bytecode specialization helps stable hot paths

  • in Python 3.11+, adaptive specialization optimizes common opcode paths
  • highly polymorphic paths can reduce specialization benefit
  • dis.dis() is the fastest way to inspect execution shape
py
import dis


def add_loop(n: int) -> int:
    total = 0
    for i in range(n):
        total += i
    return total


dis.dis(add_loop)

3) Memory model: refcount plus cyclic GC

CPython combines immediate deallocation from reference counting with periodic cycle collection.
  • many objects are deallocated immediately when refcount hits zero
  • cyclic references survive refcount and need GC collection
  • __del__ can complicate finalization order and timing

4) Use observability tools intentionally

ToolPurposeTypical use
disinspect bytecodecompare execution shape
astinspect syntax treescode analysis/generation
gcinspect collector statecycle and tuning investigations
tracemalloctrace allocationsleak and growth investigation
sys.monitoringlow-overhead runtime eventsevent-level execution labs

5) Read GIL, free-threaded builds, and subinterpreters on one axis

  • default CPython: GIL constrains parallel bytecode execution
  • free-threaded build variants: different parallelism tradeoffs
  • subinterpreters: isolation-oriented parallelism tradeoff

The key question is not "which is always faster?" but "which sharing and isolation costs fit this workload?"

Suggested Lab Routine

  1. compare two functions with dis
  2. inspect top allocation lines with tracemalloc
  3. create a reference cycle and inspect gc.collect()
  4. collect a small event sample with sys.monitoring when available

This repository includes examples/cpython_runtime_labs.py for these steps.

Common Mistakes

  • treating refcount and GC as the same mechanism
  • generalizing full-system conclusions from tiny micro-benchmarks
  • assuming one bytecode detail explains all latency
  • ignoring framework I/O and DB costs while over-focusing on interpreter internals

Good Companion Chapters

Official References

Built with VitePress for a Python 3.14 handbook.