# software performance model `(compute, capacity, movement, latency)` # compute ## definition rate at which executable operations are retired assuming ideal data availability and resolved dependencies. ## elements instruction throughput - scalar issue rate - vector issue rate - instruction mix sensitivity execution width - simd lane count - superscalar width pipeline structure - depth - hazard resolution mechanisms specialized units - floating point units - tensor or matrix units - cryptographic accelerators frequency behavior - nominal clock rate - turbo and thermal throttling ## failure signatures - low ipc despite available work - serialized dependency chains - unused vector capacity ## mitigation space - vectorization - instruction scheduling - kernel fusion - precision reduction # capacity ## definition maximum live state that can be retained without eviction, spill, or recomputation. capacity is tiered. each tier introduces a discrete eviction boundary. ## elements registers - architectural registers - physical register file caches - effective l1, l2, l3 capacity - associativity constraints main memory - total ram - numa-local vs remote memory persistent working set - page cache residency - mmap footprint ## failure signatures - cache thrashing - tlb misses - page faults - numa amplification ## mitigation space - working set reduction - blocking and tiling - data structure compaction - explicit locality management # movement ## definition maximum rate at which state can cross a boundary between capacity domains. movement is layered and directional. each boundary has an independent ceiling. ## elements intra-core transfer - register to cache fill rate cache hierarchy transfer - cache line bandwidth - writeback throughput memory subsystem - dram bandwidth - memory controller saturation interconnect - numa fabric - pcie links - accelerator interconnects storage io - sequential throughput - dma transfer rates ## failure signatures - sustained bandwidth saturation - queue buildup - false sharing - write amplification ## mitigation space - streaming access patterns - structure of arrays - prefetching - batching # latency ## definition minimum time required to resolve a dependency chain. latency bounds critical paths and tail behavior independent of steady-state throughput. ## elements memory access latency - cache miss penalties - dram access latency control flow latency - branch misprediction penalties synchronization latency - lock acquisition delay - atomic contention io latency - syscall overhead - device round-trip time remote access latency - numa hop delay - network rtt ## failure signatures - pointer chasing stalls - fine-grained synchronization collapse - long serial chains - high tail latency ## mitigation space - latency hiding via concurrency - speculation - asynchronous io - data layout transformation # workload classes by dominant constraint this section classifies common low-level workloads by the tuple element that first violates its invariant. the intent is diagnostic compression rather than exhaustive taxonomy. ## compute-dominant workloads typical cases - dense linear algebra - cryptographic primitives - compression and decompression kernels characteristic structure - high arithmetic intensity - small or cache-resident working sets - regular access patterns ## movement-dominant workloads typical cases - memory copy and set - streaming reductions - column or row scans characteristic structure - low reuse - predictable linear access - throughput scales with bytes per second ## capacity-dominant workloads typical cases - graph traversal - large hash tables - in-memory indices characteristic structure - working set exceeds cache or memory tier - reuse exists but is defeated by layout or scale - eviction rate dominates execution time ## latency-dominant workloads typical cases - pointer-heavy data structures - lock-dominated critical sections - fine-grained rpc paths characteristic structure - long dependency chains - irregular access - limited exploitable parallelism ## mixed-constraint workloads typical cases - stencil computations - image processing pipelines - database execution engines characteristic structure - no single invariant dominates globally - optimization shifts pressure between tuple elements - performance tuning is path dependent # performance objective performance optimization is constraint management under fixed semantics. the objective is to delay violation of the dominant invariant governing execution. time, throughput, and latency are observable projections of which constraint binds first. ## framing given: - fixed program semantics - a workload distribution - bounded resources optimization seeks to: - identify the binding axis in the performance tuple - reshape execution so pressure shifts to a cheaper or more scalable axis - or expand only the binding resource ## axis-relative optimality optimality is defined relative to a binding invariant. an implementation may be optimal along one axis while pathological along another. claims of optimality are only meaningful when qualified by the dominant constraint. ## resource expansion rule hardware scaling is effective if and only if: - the binding constraint corresponds to a physical resource limit - the workload projects strongly onto that axis - algorithmic structure already aligns with that constraint otherwise hardware scaling amplifies inefficiency already present in the mapping. ## guiding principles ### identify the binding constraint determine which axis saturates or dominates delay. optimization without this identification is unguided. ### improve the most pressured facet reduce demand on the binding axis before addressing others. improvements to non-binding axes do not change first-order behavior. ### trade across axes deliberately effective optimizations exchange pressure across axes: - compute for locality - memory for recomputation - latency for concurrency - bandwidth for capacity all such trades must preserve semantics and invariants. ### prefer structural over parametric changes structural changes: - algorithm selection - data layout - access pattern reshaping parametric changes: - unrolling - prefetch hints - compiler flags structural changes alter which invariant binds. parametric changes only approach an existing bound.