This project provides two binaries. One of them is a command-line application, and the other one a graphical application. The command-line one allows to specify options for controlling the analysis and the synthesis (like -s 1024
that makes it to use a window size of 1024 samples). The graphical one allows modifying parameters graphically and also to see representations of the involved signals.
The main difference is that the graphical application loads in memory the entire signal (and some derived data) in order to process it, being limited to signals that fit in available main memory, meanwhile the command line application only loads a frame at a time, allowing processing signals of arbitrary size. In this document we refer mostly to the graphical application, but the command-line one is analogous.
Tipically, the processes can be done faster than realtime, depending on the processing power available.
The process consists of two parameterized phases: the analysis and the synthesis. Starting with the sound signal samples, the analysis is done (controlled by the analysis parameters) resulting in a intermediate representation (a list of partial trajectories, varying in amplitude and frequency), and then the synthesis is done (controlled by the synthesis parameters), resulting in another sampled sound signal (similar to the original, depending on the used parameters).
The analysis can be broken in two parts: the STFT (Short-Time Fourier Transform) analysis, and the sinusoidal analysis. The first one computes the spectrum of the signal, varying in time and frequency. The second one extracts the most intense discrete frequencies present in the signal.
In this stage, sound analysis is done. That is: starting with a series of samples, we obtain a representation of the most important frequencies. It is parameterized, both in the STFT phase (to obtain the spectrum) and in the sinusoidal analysis phase (to detect and to link the maximums).
First of all, we load the sound signal in main memory. For that, once the file is selected, it is loaded in memory with the help of the sndfile library. The samples are transformed to 32 bit floating point.
Once the signal is in memory, an algorithm goes through it, obtaining windows of the specified size, and separated by the specified offset. E.g. windows of 2048 samples, each one 1024 samples after the previous one would make an overlapping of 50%. This way, the STFT is done.
Each window is multiplied by the selected window function (rectangular, Hamming, among others), sample by sample, obtaining (depending on the window function) a signal with low amplitude in the extreme zones.
Next, an FFT is done with each window, thanks to the FFTW library. If the FFT input has size N
, the output would be also N
complex numbers, and then the modulus of each frequency is calculated, from 0 to the Nyquist frequency, obtaining N/2+1
real numbers, that are amplitudes.
Once we have the frequency data, we obtain, for each frame, the amplitude local maxima. Having this way a list of frequencies that have more amplitude than the neighbour ones. From those we discard the ones with amplitude smaller that, by default, -60 dB (approx. 0.001). Also, we discard a configurable number of frequencies at the extremes (those near DC and Nyquist frequencies). Last, the maxima are sorted and the weakest ones are discarded until it remains less than 60 maxima (by default). This process that obtains the maxima is repeated for each frame or window of the original signal.
Now we have the interesting frequencies for each frame. Since each frequency represents a sinusoid and will control an oscillator, and in order to avoid abrupt changes in the output of the oscillators, as frequency and amplitude change from each frame to the next, we try to link (join) the maxima from one frame with the maxima from the next frame. This allows smooth transitions in amplitude and frequency of the sinusoids that will be formed from the maxima data, appearing some lines (oscillators, sinusoids) that will last without interruption (only smooth changes) from one to several frames. Not every maximum may be continued in the next frame, though. The sinusoids (frequencies) that last less than a configurable number of windows (frames) will be discarded. Optionally, we can avoid linking two frequencies if the difference between the one in one frame and the one in the next frame is more than a configurable number of indexes.
The algorithm for linking, or obtaining the transitions is as follows: First, for each frequency of the current frame, we obtain the minimum frequency of the next frame, that has higher or equal value than the one in the current frame. And the maximum frequency that has lower or equal value that the one in the current frame, also. We have, this way, the frequencies that are candidate to be a continuation of the current frequency. All the frequency candidates to continue from one frame to another are put in a vector, and sorted by frequency difference. Then, starting from the lowest difference (a difference of 0 implies that there is a frequency in the next frame equal to one in the current frame), they are being linked, and erased from the vector to avoid multiple use of the same frequency.
This process of linking the maxima is done for each pair of adjacent windows (frames) obtained from the original sound signal.
Finally, the program can save the analysis data in a text file. This is a simple text file with the format that the program handles:
channels 1 rate 44100 window_size 2048 advance 1024 channel 0 : 50 100 0.5 ; 150 -1 0.1 ; channel 0 : 100 -1 0.4 ; channel 0 :
In this example, the header says that the signal has 1 channel, a sampling rate of 44100 Hz, a window size of 2048 samples, and an advance of 1024 samples (which implies an overlapping of 50%). The body says, with a line for each frame and channel (in this case there is only one channel), which partials (sinusoids) are there in each frame, its frequency, the corresponding linked partial of the next frame (if any) and the amplitude of the partial. In this example, there are two sinusoids in the first frame. One with frequency index 50 (corresponds to 50*44100/2048 Hz), amplitude 0.5 (maximum amplitude is 1.0), and linked to the sinusoid of index 100 of the next frame. Another with frequency index 150 (corresponds to 150*44100/2048 Hz), amplitude 0.1, and not linked with any sinusoid of the following frame (-1 in the file). In the second frame there is only one sinusoid with frequency index 100 (100*44100/2048 Hz), amplitude 0.4 and not linked (-1 in the file). The last frame doesn't have sinusoids, meaning there is silence.
In this phase, analysis data are converted again to a series of samples, so that we have a sound more or less similar to the original sound. This phase is also parameterized.
Analysis data can be obtained by loading a text file (with the above format), or using the data that is in memory after doing the analysis phase.
The first applied parameter is equalization. This is a simple step, as the sound representation we have is in terms of frequencies and amplitudes, so we only have to multiply the value of amplitude of each partial (sinusoid) by a factor depending on that frequency gain in the equalization settings.
Later, we optionally change the sampling rate of the synthesized sound. For that, we modify the advance of each frame, and multiply, by a factor, the increment in radians/sample of each partial (sinusoid). Also optionally we do a time scaling, simply altering the advance of each frame. At the same time, if the user specified it, the frequency of each sinusoid is changed as indicated by frequency scaling and frequency shifting settings.
When additive synthesis is done, there is the option (active by default) of silencing the sinusoids that, by any means, were with a frequency less than 0 or greater than the Nyquist frequency, avoiding aliasing. In the same phase, there is the option of interpolating the frequency or amplitude of sinusoids that last more than 1 frame. It is important to interpolate the amplitude, as if it wasn't interpolated, there would be a jump between two frames that possibly would make an audible noise.
When a sinusoid ends, there are two options to avoid distortion: one of them is to extend the sinusoid until it crosses zero, and the other one is to prolong the last value of the sinusoid giving it a exponential decay with a determinated factor. This way we avoid that when a sine ends at the end of a frame, and has a value distinct from zero, an audible 'click' appears, because has zero value immediately after.
Additive synthesis is done by calculating a sinusoid with the frequency and amplitude envelope for each partial obtained in the analysis phase. This is implemented by adding a sine for each partial, with its phase and amplitude gradually changing while the partial lasts. Like this; for each frame and partial:
for (int i=0; i<advance; i++) buffer[i] += amplitude[i] * sin( phase[i] );
Although in the current implementation a table of amplitudes and phases isn't calculated; instead, we add a value to the current phase depending on the current frequency, and the current amplitude is calculated as a linear interpolation of the starting and ending values in the frame. This way only if interpolation is selected in settings.
Now, the rest of the synthesis parameters is applied: compression, expansion, and amplitude normalization.
Compression and expansion work similarly. The first one reduces dynamic range by reducing the amplitude of the sound parts that have high intensity. The second one increases dynamic range by reducing the amplitude of the sound parts that have low intensity. To do this, first, the absolute value of the signal is calculated, and then a low pass filter is applied to it, depending on the attack and release time settings. When this new signal exceeds (or is below) the established threshold, the compressor (or the expander) acts, according to the settings, applying a gain of less than 1.0 to reduce the amplitude.
Amplitude normalization is needed because, sometimes, we obtain a sound with very high or very low amplitudes, and when converting samples from floating point to 16-bit integers, it can appear loss of signal-noise ratio (when amplitudes are very low) or overflow (when amplitudes are very high). Overflowing, when a sample is avobe a 1.0 value, and is converted to a 16-bit integer, there is an effect that makes it take negative values, producing an unwanted audible 'click'. Normalization avoid these problems.
The last part is to save the signal, from main memory to a file, thanks to the sndfile library. The user can select parameters like file format (WAV, AIFF, ...), and sample format (signed 16-bit, floating point, ...).