Replay Gain - A Proposed Standard

RMS Energy

What we don't want to do!

It's easy to calculate the RMS energy over an entire audio file. For example, Cool Edit Pro (from Syntrillium) does this in its Analise:statistics box. Unfortunately, this value doesn't give a good indication of the perceived loudness of a signal. It's closer than that given by the peak amplitude, but it's still not good enough. For this reason, we have to calculate the RMS energy on a moment by moment basis (as described on this page), then do something useful with all that data (as described on the next page).

General concept

The signal is chopped into 50ms long blocks.
Then, for each block:
Every sample value is squared (multiplied by itself).
The mean average is taken.
The square root of the average is calculated.

If you read those steps backwards, it's obvious why it's called Root Mean Square (RMS) averaging. Basically, that's all we have to do.

Averaging time

The block length of 50ms was chosen after studying the effect of values between 25ms and 1s. 25ms was too short to accurately reflect the perceived loudness of some sounds. Beyond 50ms there was little change (after statistical processing). For this reason, 50ms was chosen.

Stereo files

The only difficulty lies in what to do with stereo files. We could sum them to mono before calculating the RMS energy, but then any out-of-phase components (having the opposite signal on each channel) would cancel out to zero (i.e. silence). That's not how we perceive them, so it's not a good solution.

The alternative is to calculate two RMS values (once for each channel) and then add them. Unfortunately a Linear addition still doesn't give the same effect as our ears. To demonstrate this, consider a mono (single channel) audio track. We replay it over 1 loudspeaker, and remember how loud it sounds. If we now replay it over 2 loudspeakers, how large should the signal to each speaker be such that, overall, the sound is still as loud as before? You'd think the answer would be half as large (since we have two speakers - that's what a linear addition would suggest) but if you try it, you'll find that the answer is about 3/4.

We get the right answer if we add the means of the channel-signals before calculating the square root. In mixing pan-pot terms, we're using "equal power" rather than "equal voltage". If we also assume that any mono (single channel) signal will always be replayed over two loudspeakers, we can treat a mono signal as a pair of identical stereo signals. Hence a mono signal gives (a+a)/2 (i.e. a), while a stereo signal gives (a+b)/2, where a and b are the mean squared values for each channel. After this, we carry out the square root and conversion to dB.

Implementation

In the MATLAB implementation, the RMS calculation is carried out by the following lines (modified here for clarity) from ReplayGain.m:

      % Mono signal: just do the one channel
      if channels==1,
         Vrms_all(this_block)=mean(inaudio(start_block:end_block).^2);
      % Stereo signal: take average Vrms of both channels 
      elseif channels==2,
         Vrms_left=mean(inaudio(start_block:end_block,1).^2);
         Vrms_right=mean(inaudio(start_block:end_block,2).^2);
         Vrms_all(this_block)=(Vrms_left+Vrms_right)/2;
      end
      % Convert to dB
      Vrms_all=10*log10(Vrms_all+10^-10);

In the last line, the addition of ten to the power minus ten (10^-10) prevents the calculation of log(0) (which would give an error) during periods of digital silence. A level of approx -100dB is calculated instead, which (on this scale) is below the noise floor of a 24-bit recording.

10*log10(signal) is the same as 20*log10(square_root(signal)). Thus we carry out the square root and the conversion to dB in one simple step, without actually having to call a square root function.

Suggestions and further work

This whole process can probably be extensively optimised from within a lower level language. I leave this task to whoever decides to implement ReplayGain!

The block length of 50ms could be adjusted if extensive trials showed a different length was more appropriate. However, 50ms seems to work well, and makes sense on both psychoacoustic a signal processing grounds.