2022-11-25

sound analysis and resynthesis

subtopics

frequency resolution enhancement of short duration fast fourier transforms using long duration transforms

sines, transients and noise model

extracts from a sound parameters for sinusoidial parts, transient parts and residual noise. the parts can be resynthesised from the parameters to recreate a similar sound with possible transformations

example analysis process

windowed fft with window-function related hop-size/overlap over the whole input. for example, blackman window with hop size of 66.1% of frame size and frames spaced like 0..100 50..150 100 200 and so on. starting zero-padded hop-size before input. results are hop-size samples apart
for sines: get frequency, magnitude and phase from a number of peak magnitude fft bins
for transients: like for sines but from smaller fft and with more emphasis on exact phase reconstruction because phase matters more in short loud segments. transients detected by change between fft results
remove previously used bins from original fft bins to get residual. average neighboring frequency bins to reduce number of bands for performance

some variations

ffts of different durations, zero padded to be of same length, then combined weighted by change between frames
zero padding before fft to select more exact peaks easily
splitting the input signal into frequency bands using a filter bank made of bandpass filters for example, and then using fft sizes related to the desired frequency resolution for the bands
do not average fft bands for residual noise but use fft result directly
impose the original envelope on resynthesised sounds
transient handling
- transient detection on time/magnitude data with center of mass calculation of frames or linear prediction and threshold on predicted continuation
- detecting transients in the time/magnitude data and using smaller fft sizes to increase time resolution for the detected transient duration
- analysis using dct and synthesis using inverse dct/fft
- copy from the original sound

example synthesis process

sine oscillators to recreate frequency/amplitude/phase of extracted peaks, parameters interpolated for example as catmul-rom splines. similar handling for transients
noise resynthesised by band-pass filtering white noise

downsides

loss of information is inherent in the fft process and the parameter interpolation of resynthesis

eventual improvements

when humans analyse and try to reproduce sounds, possibly with deliberate transformations, an fft-like process seems to take place but additionally the desired output and input are matched by comparison and adjustments until a satisfactory result is created

discrete fourier transform

the fast fourier transform, fft, is an algorithm that computes the discrete fourier transform of a sequence
returns frequency/magnitude/phase data for a portion of time/magnitude data
there are variations for calculating fft on real or complex values and returning either. complex values include phase information, which is important for the reconstruction of a singal
input: complex or real sample values
output
- complex or real values
- output_length = (1 + (input_length / 2))
- magnitudes are scaled by (1 / input_length)
- the first result value is called dc and is the average of the input values
- output values are also called bins like for a histogram
- each bin contains a value for a measured frequency range. the bins are evenly spaced, have no overlap and cover the whole spectrum
- max_frequency = sample_rate / 2
- bins_without_dc_count = input_length / 2
- bin_frequency_width = max_frequency / bins_without_dc_count
- example of how to calculate fft bin width
  - max_frequency = 44100 / 2 = 22050
  - bins_without_dc_count = (input_length / 2) = 1024 / 2 = 512
  - bin_frequency_width = 22050 / 512 = 21.53 hz
- amplitude = complex_magnitude(fft_bin_value) / input_length
- phase = complex_angle(fft_bin_value)
- frequency = bin_index * (max_frequency / bins_without_dc_count)
- without complex number functions: magnitude = sqrt(real * real + imaginary * imaginary); angle = atan(imaginary, real)
short input leads to a low frequency resolution and long input leads to high frequency resolution. short input has better time resolution and long input has worse time resolution
the inverse fft, ifft, takes complex numbers like the fft returns and recreate a signal with the frequencies and phases with the same length as the sampled block. this tends to be an approximation or average of the original signal because when exactly the frequencies occured in the analysed input signal portion is not known
windowing is applying a window function to input samples to reduce the magnitude to zero towards the edges of the input frame to remove edge discontinuities that would otherwise be included in the calculation as the signal abruptly falling off to zero
overlap
- the fft may be calculated with input data that overlaps to improve time resolution, especially when a windowing function reduces parts of the input on both sides. for example, using input frames of samples 0..100 50..150 100..200 and so on. hop size is the overlap in samples. a sliding fft has hop size one
- the improvement in time resolution is limited because each fft still gives results for a block of data similar to an average, and because the frame length stays the same each frame contains an overlapped part at the beginning and new samples at the end. the differences of frequencies of two overlapping frames are not the definite frequencies in the overlap
- overlapping results can be resynthesised by fading out the volume of a previous block as the next block is faded in (overlap-add). each overlapped block is hop-size samples apart
zero padding is adding zeros to the input signal, usually anywhere before or after the input. this increases the number of fft result frequency bins in a way that corresponds to interpolation between fft bins. for example, used when the interpolation is desired or to reach required input lengths for fft implementations
fft can be seen as a filter bank
fft could be used on fft results

general

transients: short duration sounds, typically up to 50ms. approach being purely impulsive. examples are onsets of sounds like percussive attacks or bow sounds from violins. not necessarily the loudest element in a sound. impulses produce spectra that look similar to sines
autocorrelation and cross-correlation calculate if two sets of samples tend to fall onto a line. can be used to find specific sounds among noise or test for white noise
envelope detection: for example, time-magnitude data smoothed by low pass filter. or all negative values multiplied by minus one to become positive then using this as the envelope

noise

noise represents a sum of uncorrelated events. it can be analysed with statistics on frequency band envelopes

examples of properties that could be calculated for analysis:

arithmetic mean: the average value
variance: variance is the expectation of the squared deviation of a random variable from its mean. informally, it measures how far a set of (random) numbers are spread out from their average value
kurtosis: measures how sharply peaked a probability distribution is, relative to its width. the kurtosis is normalized to zero for a gaussian distribution
skewness: measures the asymmetry of the tails of a probability distribution
correlations between bands - how similar are the envelopes between different frequency bands