# software performance model

`(compute, capacity, movement, latency)`

# compute
## definition
rate at which executable operations are retired assuming ideal data availability and resolved dependencies.
## elements
instruction throughput
- scalar issue rate
- vector issue rate
- instruction mix sensitivity

execution width
- simd lane count
- superscalar width

pipeline structure
- depth
- hazard resolution mechanisms

specialized units
- floating point units
- tensor or matrix units
- cryptographic accelerators

frequency behavior
- nominal clock rate
- turbo and thermal throttling

## failure signatures
- low ipc despite available work
- serialized dependency chains
- unused vector capacity

## mitigation space
- vectorization
- instruction scheduling
- kernel fusion
- precision reduction

# capacity
## definition
maximum live state that can be retained without eviction, spill, or recomputation. capacity is tiered. each tier introduces a discrete eviction boundary.
## elements
registers
- architectural registers
- physical register file

caches
- effective l1, l2, l3 capacity
- associativity constraints

main memory
- total ram
- numa-local vs remote memory

persistent working set
- page cache residency
- mmap footprint

## failure signatures
- cache thrashing
- tlb misses
- page faults
- numa amplification

## mitigation space
- working set reduction
- blocking and tiling
- data structure compaction
- explicit locality management

# movement
## definition
maximum rate at which state can cross a boundary between capacity domains. movement is layered and directional. each boundary has an independent ceiling.
## elements
intra-core transfer
- register to cache fill rate

cache hierarchy transfer
- cache line bandwidth
- writeback throughput

memory subsystem
- dram bandwidth
- memory controller saturation

interconnect
- numa fabric
- pcie links
- accelerator interconnects

storage io
- sequential throughput
- dma transfer rates

## failure signatures
- sustained bandwidth saturation
- queue buildup
- false sharing
- write amplification

## mitigation space
- streaming access patterns
- structure of arrays
- prefetching
- batching

# latency
## definition
minimum time required to resolve a dependency chain. latency bounds critical paths and tail behavior independent of steady-state throughput.
## elements
memory access latency
- cache miss penalties
- dram access latency

control flow latency
- branch misprediction penalties

synchronization latency
- lock acquisition delay
- atomic contention

io latency
- syscall overhead
- device round-trip time

remote access latency
- numa hop delay
- network rtt

## failure signatures
- pointer chasing stalls
- fine-grained synchronization collapse
- long serial chains
- high tail latency

## mitigation space
- latency hiding via concurrency
- speculation
- asynchronous io
- data layout transformation

# workload classes by dominant constraint
this section classifies common low-level workloads by the tuple element that first violates its invariant. the intent is diagnostic compression rather than exhaustive taxonomy.

## compute-dominant workloads
typical cases
- dense linear algebra
- cryptographic primitives
- compression and decompression kernels

characteristic structure
- high arithmetic intensity
- small or cache-resident working sets
- regular access patterns

## movement-dominant workloads
typical cases
- memory copy and set
- streaming reductions
- column or row scans

characteristic structure
- low reuse
- predictable linear access
- throughput scales with bytes per second

## capacity-dominant workloads
typical cases
- graph traversal
- large hash tables
- in-memory indices

characteristic structure
- working set exceeds cache or memory tier
- reuse exists but is defeated by layout or scale
- eviction rate dominates execution time

## latency-dominant workloads
typical cases
- pointer-heavy data structures
- lock-dominated critical sections
- fine-grained rpc paths

characteristic structure
- long dependency chains
- irregular access
- limited exploitable parallelism

## mixed-constraint workloads
typical cases
- stencil computations
- image processing pipelines
- database execution engines

characteristic structure
- no single invariant dominates globally
- optimization shifts pressure between tuple elements
- performance tuning is path dependent

# performance objective
performance optimization is constraint management under fixed semantics.

the objective is to delay violation of the dominant invariant governing execution.
time, throughput, and latency are observable projections of which constraint binds first.

## framing
given:
- fixed program semantics
- a workload distribution
- bounded resources

optimization seeks to:
- identify the binding axis in the performance tuple
- reshape execution so pressure shifts to a cheaper or more scalable axis
- or expand only the binding resource

## axis-relative optimality
optimality is defined relative to a binding invariant.

an implementation may be optimal along one axis while pathological along another.
claims of optimality are only meaningful when qualified by the dominant constraint.

## resource expansion rule
hardware scaling is effective if and only if:
- the binding constraint corresponds to a physical resource limit
- the workload projects strongly onto that axis
- algorithmic structure already aligns with that constraint

otherwise hardware scaling amplifies inefficiency already present in the mapping.

## guiding principles
### identify the binding constraint
determine which axis saturates or dominates delay.
optimization without this identification is unguided.

### improve the most pressured facet
reduce demand on the binding axis before addressing others.
improvements to non-binding axes do not change first-order behavior.

### trade across axes deliberately
effective optimizations exchange pressure across axes:
- compute for locality
- memory for recomputation
- latency for concurrency
- bandwidth for capacity

all such trades must preserve semantics and invariants.

### prefer structural over parametric changes
structural changes:
- algorithm selection
- data layout
- access pattern reshaping

parametric changes:
- unrolling
- prefetch hints
- compiler flags

structural changes alter which invariant binds.
parametric changes only approach an existing bound.