2025-07-06

basics of digital sound synthesis

foundational physical and mathematical assumptions

thought models about what we want to synthesize

physically

sound is a mechanical event consisting of kinetic energy in longitudinal waves transmitted through a medium via compression and rarefaction. for humans, the medium is usually air. humans perceive sound as air pressure changes, detected through two primary auditory pathways that also enable partial spatial localization. audible frequencies range from approximately 20 hz to 20000 hz. these vibrations are regular to varying degrees and can be characterized by frequency.

time-domain (amplitude) representation

a single-channel signal of amplitude samples over time encodes the net displacement caused by all contributing vibrations. it represents a directionless waveform and can be directly applied to a membrane or transducer to reproduce sound.

(time, amplitude)

in this representation, frequency information is mixed into one signal and practically impossible to fully separate again for individual editing. only post-processing is possible.

frequency-domain (time-varying amplitude) representation

imagine an abstract roll, similar to a piano roll, where width maps to frequency and time unfolds along its circumference. at each frequency, amplitude is encoded as the depth of grooves over time, forming continuous envelopes around the cylinder. this represents a decomposition of a time-domain signal into a structured frequency-domain form.

(time, frequency, amplitude)

this is the representation suitable for reasoning about additive synthesis. under consideration of the interactions that occur when the information is converted to a signal, frequency and volume are allow us the parameterized recreation of any possible sound.

sound sources and instruments

sound always begins with the excitation of matter. for example, air forced through a narrow opening can oscillate at high frequency. traditional instruments enable human-driven excitation, shaped by parameters like striking position or intensity.

fundamental computational structures

in a digital system, sound is typically produced as arrays of amplitude samples over time - one array per output channel (e.g. loudspeaker).

sample values are ideally real-valued (continuous in amplitude), while time in a sampled system is inherently discrete. therefore, it is natural to represent time using a discrete domain.

to enable systematic computation and reasoning, sound is generated by composing and summing the results of discrete events and event groups into the output arrays.

channels
samples
event
  start
  end
  prepare
  generate
  channel-delay

channel-delay is added here because it is a fundamental effect. consider, for example, the distance between the ears.

nested groups of such events allow for the computation and definition of sounds and complete song structures with arbitrary complexity.