2026-01-30

software performance model

(compute, capacity, movement, latency)

compute

definition

rate at which executable operations are retired assuming ideal data availability and resolved dependencies.

elements

instruction throughput

scalar issue rate
vector issue rate
instruction mix sensitivity

execution width

simd lane count
superscalar width

pipeline structure

depth
hazard resolution mechanisms

specialized units

floating point units
tensor or matrix units
cryptographic accelerators

frequency behavior

nominal clock rate
turbo and thermal throttling

failure signatures

low ipc despite available work
serialized dependency chains
unused vector capacity

mitigation space

vectorization
instruction scheduling
kernel fusion
precision reduction

capacity

definition

maximum live state that can be retained without eviction, spill, or recomputation. capacity is tiered. each tier introduces a discrete eviction boundary.

elements

registers

architectural registers
physical register file

caches

effective l1, l2, l3 capacity
associativity constraints

main memory

total ram
numa-local vs remote memory

persistent working set

page cache residency
mmap footprint

failure signatures

cache thrashing
tlb misses
page faults
numa amplification

mitigation space

working set reduction
blocking and tiling
data structure compaction
explicit locality management

movement

definition

maximum rate at which state can cross a boundary between capacity domains. movement is layered and directional. each boundary has an independent ceiling.

elements

intra-core transfer

cache hierarchy transfer

cache line bandwidth
writeback throughput

memory subsystem

dram bandwidth
memory controller saturation

interconnect

numa fabric
pcie links
accelerator interconnects

storage io

sequential throughput
dma transfer rates

failure signatures

sustained bandwidth saturation
queue buildup
false sharing
write amplification

mitigation space

streaming access patterns
structure of arrays
prefetching
batching

latency

definition

minimum time required to resolve a dependency chain. latency bounds critical paths and tail behavior independent of steady-state throughput.

elements

memory access latency

cache miss penalties
dram access latency

control flow latency

branch misprediction penalties

synchronization latency

lock acquisition delay
atomic contention

io latency

syscall overhead
device round-trip time

remote access latency

numa hop delay
network rtt

failure signatures

pointer chasing stalls
fine-grained synchronization collapse
long serial chains
high tail latency

mitigation space

latency hiding via concurrency
speculation
asynchronous io
data layout transformation

workload classes by dominant constraint

this section classifies common low-level workloads by the tuple element that first violates its invariant. the intent is diagnostic compression rather than exhaustive taxonomy.

compute-dominant workloads

typical cases

dense linear algebra
cryptographic primitives
compression and decompression kernels

characteristic structure

high arithmetic intensity
small or cache-resident working sets
regular access patterns

movement-dominant workloads

typical cases

memory copy and set
streaming reductions
column or row scans

characteristic structure

low reuse
predictable linear access
throughput scales with bytes per second

capacity-dominant workloads

typical cases

graph traversal
large hash tables
in-memory indices

characteristic structure

working set exceeds cache or memory tier
reuse exists but is defeated by layout or scale
eviction rate dominates execution time

latency-dominant workloads

typical cases

pointer-heavy data structures
lock-dominated critical sections
fine-grained rpc paths

characteristic structure

long dependency chains
irregular access
limited exploitable parallelism

mixed-constraint workloads

typical cases

stencil computations
image processing pipelines
database execution engines

characteristic structure

no single invariant dominates globally
optimization shifts pressure between tuple elements
performance tuning is path dependent

performance objective

performance optimization is constraint management under fixed semantics.

the objective is to delay violation of the dominant invariant governing execution. time, throughput, and latency are observable projections of which constraint binds first.

framing

given:

fixed program semantics
a workload distribution
bounded resources

optimization seeks to:

identify the binding axis in the performance tuple
reshape execution so pressure shifts to a cheaper or more scalable axis
or expand only the binding resource

axis-relative optimality

optimality is defined relative to a binding invariant.

an implementation may be optimal along one axis while pathological along another. claims of optimality are only meaningful when qualified by the dominant constraint.

resource expansion rule

hardware scaling is effective if and only if:

the binding constraint corresponds to a physical resource limit
the workload projects strongly onto that axis
algorithmic structure already aligns with that constraint

otherwise hardware scaling amplifies inefficiency already present in the mapping.

guiding principles

identify the binding constraint

determine which axis saturates or dominates delay. optimization without this identification is unguided.

improve the most pressured facet

reduce demand on the binding axis before addressing others. improvements to non-binding axes do not change first-order behavior.

trade across axes deliberately

effective optimizations exchange pressure across axes:

compute for locality
memory for recomputation
latency for concurrency
bandwidth for capacity

all such trades must preserve semantics and invariants.

prefer structural over parametric changes

structural changes:

algorithm selection
data layout
access pattern reshaping

parametric changes:

unrolling
prefetch hints
compiler flags

structural changes alter which invariant binds. parametric changes only approach an existing bound.