CPython Internals Advanced
This chapter treats CPython internals as an execution model, not as trivia. Once you connect source -> AST -> code object -> frame -> eval loop -> object/memory layers, performance discussions become concrete instead of vague.
Quick takeaway: build intuition in three layers: execution (frame/code/bytecode), object model (`PyObject`, type, refcount), and memory model (refcount plus cyclic GC plus allocators). Then verify with `dis`, `ast`, `gc`, `tracemalloc`, and `sys.monitoring` labs.
Execution Pipeline
1) Keep code objects and frames distinct
- code object: static execution plan (constants, names, bytecode)
- frame object: runtime context (locals, stack, current instruction)
- one function object can produce many frames across calls
py
import inspect
def sample(x: int, y: int) -> int:
frame = inspect.currentframe()
assert frame is not None
print("frame locals keys:", list(frame.f_locals))
return x + y2) Bytecode specialization helps stable hot paths
- in Python 3.11+, adaptive specialization optimizes common opcode paths
- highly polymorphic paths can reduce specialization benefit
dis.dis()is the fastest way to inspect execution shape
py
import dis
def add_loop(n: int) -> int:
total = 0
for i in range(n):
total += i
return total
dis.dis(add_loop)3) Memory model: refcount plus cyclic GC
- many objects are deallocated immediately when refcount hits zero
- cyclic references survive refcount and need GC collection
__del__can complicate finalization order and timing
4) Use observability tools intentionally
| Tool | Purpose | Typical use |
|---|---|---|
dis | inspect bytecode | compare execution shape |
ast | inspect syntax trees | code analysis/generation |
gc | inspect collector state | cycle and tuning investigations |
tracemalloc | trace allocations | leak and growth investigation |
sys.monitoring | low-overhead runtime events | event-level execution labs |
5) Read GIL, free-threaded builds, and subinterpreters on one axis
- default CPython: GIL constrains parallel bytecode execution
- free-threaded build variants: different parallelism tradeoffs
- subinterpreters: isolation-oriented parallelism tradeoff
The key question is not "which is always faster?" but "which sharing and isolation costs fit this workload?"
Suggested Lab Routine
- compare two functions with
dis - inspect top allocation lines with
tracemalloc - create a reference cycle and inspect
gc.collect() - collect a small event sample with
sys.monitoringwhen available
This repository includes examples/cpython_runtime_labs.py for these steps.
Common Mistakes
- treating refcount and GC as the same mechanism
- generalizing full-system conclusions from tiny micro-benchmarks
- assuming one bytecode detail explains all latency
- ignoring framework I/O and DB costs while over-focusing on interpreter internals