This blog describes my experience and work on audio upmixing filters in VLC in the 2020 chapter of Google Summer of Code, under the mentorship of Alexandre Janniaux (unidan), Thomas Guillem (tguillem) and Jean-Baptiste Kempf(j-b).
The code can be found here.
A Little Background on How Modules in VLC work
VLC has a core and a lot of modules. These modules are almost all independent from each other and provide most of the functionalities in VLC. When a filter is selected, the associated module is loaded. These modules generally have four parts :
-
Module description : This section defines the name of the module, a small description of what the module does does, any parameters the module requires, the capability of the module, and the callbacks of the module.
- The name of the module is the term that is displayed in the GUI.
- The description appears on hovering over the module in GUI.
- Parameters of the module are the values passed to the module and which help it in performing its function. They usually have default values. For example, a video sharpening filter requires sharpening strength.
- Capability defines the type and priority of the module. When a certain subcategory of modules is called, all the associated modules are called and arranged in descending order of priority, until the desired module is found. For example, a user-selected audio filter has the capability of
("audio filter", 0)
. - The callbacks of the module are the functions of the module that are called when the function is loaded into and flushed from memory. The callback includes the Open() function and sometimes the Destroy() function.
-
Open() : This function is called when a module is loaded into memory. It contains checks to validate the input to the module, initialises the internal structures of the module and calls the DoWork() function, iterating over the audio input in block_t format.
-
DoWork() : This function receives the input audio samples in a block_t format. Block_t is an internal core VLC structure which contains a certain amount of audio samples. DoWork performs operations on this block_t structure and returns the block_t to the Open() function. Along with the samples, the DoWork function also receives the internal data structures, if any, from the Open() function.
-
Destroy() : This is an optional callback defined in the module description. It frees the memory space occupied by the internal data structures of the module.
What is upmixing
Upmixing is a process that takes an audio input with a certain number of channels and converts it into an audio output with a greater number of channels. In this case, I upmixed stereo audio i.e. audio with 2 channels (L R) into 4 or more channels ( L R C S ). There were two major aspects to my GSoC project :
- Generic upmixing.
- Matrix-based upmixing.
Generic Upmixing
Generic upmixing uses mathematical and statistical tools to generate center and surround channels from the left and right channels. I worked on two such algorithms:
Passive Surround Decoding
Passive Surround Decoding is the simplest among the upmixing methods. It is a passive method that derives the center channel and the surround from the sum and difference of the left and right channels.
Center [i] = ( X_left[i] + X_right[i] ) * 0.5 ;
Surround [i] = ( X_left[i] - X_right[i] ) * 0.5 ;
Principal Component Analysis (WiP)
Principal Component Analysis is a statistical tool that is used for dimensionality reduction. If a number of points are given in a two-dimensional space, using principal component analysis, we can find two orthonormal vectors that best fits these data points, i.e the average distance between the vectors and these points are minimum. These vectors are the two Eigen vectors that are obtained from the covariance matrix of these points and are inequal. In audio upmixing, the larger vector signifies the correlated parts ie the center channel, while the smaller channel signifies the uncorrelated parts ie the surround channel.
In the filter, I first calculated the total and products of the two channels. I used this data to find the covariance matrices and the two Eigen vectors. These were later multiplied with the left and right channels to obtain the center and the surround channels. To smooth over the difference in weights from one frame to another, a fading procedure was used. The block was divided into 64 equal parts, and the coefficients of the previous block were used proportionately to make it as smooth as possible.
The issue with this filter was that the fading procedure did not fix these artifacts. The volumes kept fluctuating and would become extremely loud in certain areas and almost silent in other areas. I am testing out normalization and other methods to try and fix this issue.
Matrix-based Upmixing
Matrix-based upmixing uses decoding matrices to extract multichannel audio from stereo audio. Unlike generic upmixing, it extracts these channels from encoded audio ie multichannel audio that was encoded into stereo channels for transmission. Even though information is lost due to encoding, the decoder makes an acceptable approximation of the center and surround channels when decoded.
Under GSoC, I worked on two matrix-based upmixers :
- Dolby Prologic
- Dolby Prologic II
Dolby Prologic
Dolby Prologic is used to decode audio encoded with Dolby Surround. It is a 4:2:4 matrix, producing 4 channels from stereo audio which was encoded from 4 channels. It uses the following decoding matrix :
L | R | C | S | |
---|---|---|---|---|
Lt | 0.412956 | 0.073144 | 0.707107 | 0.707107 |
Rt | 0.073144 | 0.412956 | 0.707107 | -0.707107 |
This algorithm was already implemented in VLC’s dev branch in modules/audio_filters/channel_mixer/dolby.c
. I changed the calculations of the center and the surround channels slightly, reducing the volume of the center and surround channels by 3 dB and the left and right channels by 6 dB.
I used Handbrake to encode the sample audio and test the Pro Logic decoding using this audio.
Dolby Prologic II
Dolby Prologic II is the successor of Dolby Prologic. It is a 5:2:5 matrix, producing 5 channels from stereo audio which was encoded from 4 channels. In short, compared to Dolby Prologic which had one rear channels, Dolby Prologic II has 2 rear channels. The decoding matrix of Dolby Prologic II is :
L | R | C | LS | RS | |
---|---|---|---|---|---|
Lt | 0.412956 | 0.073144 | 0.343724 | -0.294947 | -0.178698 |
Rt | 0.073144 | 0.412956 | 0.343724 | 0.242277 | 0.205033 |
This matrix is the inverse of the encoding matrix. It was implemented in the same file as Dolby Prologic, as it uses similar filter checks. A parameter was provided to the user through which they could select either of these filters.
Other related operations
Surround channel delay
Precedence effect is an acoustic phenomenon in which output from two audio samples are perceived as one if they are delayed by 50 ms or less. Due to this effect, the listener is able to perceive the space and direction of the sound.
This was implemented for the Dolby matrices by delaying the surround to provide the precedence effect. The user can choose a delay of upto 100 ms. The sampling rate of the audio is used to find the number of samples to delay. For example, for a sampling rate of 48 kHz, the number of delayed samples can be from a minimum of 48 samples to 4800 samples.
Initially, I had written a version which performed swapping with a delay line, initialised to zero. However, this was a very expensive operation, as all samples went through the delay line, significantly increasing the CPU usage. Next, I modified the delay line to store the data only when it was in excess. However, this led to a lot of memory problems as well as issues with delays bigger than one block_t (VLC’s internal stream payload structure). Finally, I settled on storing the block_t in a queue until they were completely processed and sent to output. Initially, this created a lot of memory issues, segmentation faults and erroneous audio. After a lot of troubleshooting, I was able to get it to work.
Implementation of this procedure took quite a time. I repeatedly ran into weird cases, even when I had accounted for all use-cases. Eventually, I was able to overcome it, thanks to the help and the encouragement of my mentors.
Hilbert transform (WiP)
The rear channels are shifted by 90 degrees when they are encoded to DPLii. Hence, the rear channels generated by the decoding matrix should be shifted back by 90 degrees to obtain the original data. This is achieved by performing the Hilbert transform. This was done in accordance with the MATLAB documentation, using ffmpeg’s avfft.h.
First, we perform Fast Fourier transform on the audio. Next, we double the first half of the output and set the second half to zero. This data is then passed through an Inverse Fast Fourier Transform. The resulting data is used as the rear channels.
The main issue with this algorithm is that there is noise in the output. I have performed various tests on it and the noise continues to persist.
Experience
This summer helped me hone my skills as a C developer. I learnt how to read and understand huge codebases, like VLC, MPlayer (for the qualification task) and ffmpeg (for the Hilbert transform). I learnt a lot about memory management in C. I also learnt a lot about Makefiles, automake and the automake configurations, while working to integrate ffmpeg’s avfft and adding checks. Working so closely with a media application like VLC has also vastly increased my understanding of digital audio and the principles of audio filters. I have also gained insights into a lot of theoretical and mathematical topics, such as Fourier transforms, covariance and normalization.
Working with Alexandre, Thomas and Jean has been a great experience. They have provided resources, motivation and help whenever required. They are very understanding and patient. From the beginning, the onus was on what I had learned, which was a very refreshing experience.
This has been one of my most enjoyable and fruitful summers. I shall continue to work and finish my projects and collaborate closely with VideoLAN’s VLC. I have recommended and will recommend Google Summer of Code, especially VideoLAN, to my peers.