2026-01-30

software performance model

(compute, capacity, movement, latency)

compute

definition

rate at which executable operations are retired assuming ideal data availability and resolved dependencies.

elements

instruction throughput

  • scalar issue rate
  • vector issue rate
  • instruction mix sensitivity

execution width

  • simd lane count
  • superscalar width

pipeline structure

  • depth
  • hazard resolution mechanisms

specialized units

  • floating point units
  • tensor or matrix units
  • cryptographic accelerators

frequency behavior

  • nominal clock rate
  • turbo and thermal throttling

failure signatures

  • low ipc despite available work
  • serialized dependency chains
  • unused vector capacity

mitigation space

  • vectorization
  • instruction scheduling
  • kernel fusion
  • precision reduction

capacity

definition

maximum live state that can be retained without eviction, spill, or recomputation. capacity is tiered. each tier introduces a discrete eviction boundary.

elements

registers

  • architectural registers
  • physical register file

caches

  • effective l1, l2, l3 capacity
  • associativity constraints

main memory

  • total ram
  • numa-local vs remote memory

persistent working set

  • page cache residency
  • mmap footprint

failure signatures

  • cache thrashing
  • tlb misses
  • page faults
  • numa amplification

mitigation space

  • working set reduction
  • blocking and tiling
  • data structure compaction
  • explicit locality management

movement

definition

maximum rate at which state can cross a boundary between capacity domains. movement is layered and directional. each boundary has an independent ceiling.

elements

intra-core transfer

  • register to cache fill rate

cache hierarchy transfer

  • cache line bandwidth
  • writeback throughput

memory subsystem

  • dram bandwidth
  • memory controller saturation

interconnect

  • numa fabric
  • pcie links
  • accelerator interconnects

storage io

  • sequential throughput
  • dma transfer rates

failure signatures

  • sustained bandwidth saturation
  • queue buildup
  • false sharing
  • write amplification

mitigation space

  • streaming access patterns
  • structure of arrays
  • prefetching
  • batching

latency

definition

minimum time required to resolve a dependency chain. latency bounds critical paths and tail behavior independent of steady-state throughput.

elements

memory access latency

  • cache miss penalties
  • dram access latency

control flow latency

  • branch misprediction penalties

synchronization latency

  • lock acquisition delay
  • atomic contention

io latency

  • syscall overhead
  • device round-trip time

remote access latency

  • numa hop delay
  • network rtt

failure signatures

  • pointer chasing stalls
  • fine-grained synchronization collapse
  • long serial chains
  • high tail latency

mitigation space

  • latency hiding via concurrency
  • speculation
  • asynchronous io
  • data layout transformation

workload classes by dominant constraint

this section classifies common low-level workloads by the tuple element that first violates its invariant. the intent is diagnostic compression rather than exhaustive taxonomy.

compute-dominant workloads

typical cases

  • dense linear algebra
  • cryptographic primitives
  • compression and decompression kernels

characteristic structure

  • high arithmetic intensity
  • small or cache-resident working sets
  • regular access patterns

movement-dominant workloads

typical cases

  • memory copy and set
  • streaming reductions
  • column or row scans

characteristic structure

  • low reuse
  • predictable linear access
  • throughput scales with bytes per second

capacity-dominant workloads

typical cases

  • graph traversal
  • large hash tables
  • in-memory indices

characteristic structure

  • working set exceeds cache or memory tier
  • reuse exists but is defeated by layout or scale
  • eviction rate dominates execution time

latency-dominant workloads

typical cases

  • pointer-heavy data structures
  • lock-dominated critical sections
  • fine-grained rpc paths

characteristic structure

  • long dependency chains
  • irregular access
  • limited exploitable parallelism

mixed-constraint workloads

typical cases

  • stencil computations
  • image processing pipelines
  • database execution engines

characteristic structure

  • no single invariant dominates globally
  • optimization shifts pressure between tuple elements
  • performance tuning is path dependent

performance objective

performance optimization is constraint management under fixed semantics.

the objective is to delay violation of the dominant invariant governing execution. time, throughput, and latency are observable projections of which constraint binds first.

framing

given:

  • fixed program semantics
  • a workload distribution
  • bounded resources

optimization seeks to:

  • identify the binding axis in the performance tuple
  • reshape execution so pressure shifts to a cheaper or more scalable axis
  • or expand only the binding resource

axis-relative optimality

optimality is defined relative to a binding invariant.

an implementation may be optimal along one axis while pathological along another. claims of optimality are only meaningful when qualified by the dominant constraint.

resource expansion rule

hardware scaling is effective if and only if:

  • the binding constraint corresponds to a physical resource limit
  • the workload projects strongly onto that axis
  • algorithmic structure already aligns with that constraint

otherwise hardware scaling amplifies inefficiency already present in the mapping.

guiding principles

identify the binding constraint

determine which axis saturates or dominates delay. optimization without this identification is unguided.

improve the most pressured facet

reduce demand on the binding axis before addressing others. improvements to non-binding axes do not change first-order behavior.

trade across axes deliberately

effective optimizations exchange pressure across axes:

  • compute for locality
  • memory for recomputation
  • latency for concurrency
  • bandwidth for capacity

all such trades must preserve semantics and invariants.

prefer structural over parametric changes

structural changes:

  • algorithm selection
  • data layout
  • access pattern reshaping

parametric changes:

  • unrolling
  • prefetch hints
  • compiler flags

structural changes alter which invariant binds. parametric changes only approach an existing bound.