Essay - "Digital Audio: What Does This
Term Really Mean?" - February, 1998
By John Busenitz
If you're like me, you are probably numb to the word
"digital". It rates right up there with terms like "Generation X",
"paparazzi", and "au pair". "Digital" is used to denote the
messiah as well as satan, and is inadequate for both. There is no such thing as
"digital-ready". In fact, "digital", which loosely means "based
on numbers" doesn't really describe what it is meant to. Rather, it actually
describes three things: discrete-time, quantization, and binary-based manipulation and
storage.
Yeah, I apologize for sounding like the dorky engineering person I am, but this
technobabble most accurately describes what is happening. And I'm not so dorky. It's not
like I play Internet MUDD on Friday nights, and have every Wierd Al recording. But enough
about me. Basically, what is happening with digital is chopping stuff up into pieces. The
music, that is. "Digital" is all about taking something that is so detailed and
big, like actual music, or the microphone feed, and approximating it with numbers. Which
means it is not "perfect sound forever" despite what the marketing people who
coined this cute but wrong phrase might have you believe. The engineers (to toot my own
horn) always knew better.
How is this significant? Well, as you are all probably subconsciously aware of , it means
smaller, cheaper, and Some Other Things. We will get into those later. Is this vastly
different from analog, like a vinyl record? No, not really. Because they all are a limited
picture of the thing they represent, which is the musical signal.
Think of an image on a TV screen or a monitor. If you look at it from far away, it appears
continuous, pretty much like one of your family album snapshots. As you get closer, you
can see the picture is made of dots, i.e., it is quantized, just like space-time, and is
composed of little parts, or quantities. Furthermore, it's quantized horizontally as dots
(the "vertical resolution") and vertically as lines (the 525 horizontal scanning
lines, the "horizontal resolution"). Of course, you could call the horizontal
scanning lines big rectangular dots if you want, but the point is, the image on our
regular NTSC TVs is quantized in the X and Y direction, and is therefore, already digital!
If we really want to get philosophical, electrical current is the passing of discrete
electrons, nerves in our ears and brain are passing discrete impulses, and sound waves are
individual oxygen and nitrogen atoms striking our eardrums, so they are all
discrete-timel. But this essay is not for the purpose of starting flame wars.
For CDs, instead of being quantized in 2-dimensions of space (horizontal and vertical),
the signal is quantized in time and also level (voltage, i.e., volume). The actual sound
of the music, after passing through the microphones, is recorded, or sampled, at precise
instances in time. As is more or less intuitive, the more often things are sampled, the
better they can be represented (and decoded into a reasonable facsimile of the original
signal). According to theory, we have to sample at twice the highest frequency we wish to
record. This was suggested by Nyquist several decades ago, and the theory is termed the
Nyquist Criterion. The idea is that a sine wave can be totally reproduced when using only
two samples for every period (one complete sine wave). So now we have to make sure that we
sample at a rate high enough to record everything we can hear. For current CDs, the
samples are made 44,100 times each second (44.1 kHz), although the original digital tape
recording master may be at a higher sampling rate.
If the actual music signal changes faster than it is being sampled, there are problems.
What results is called "aliasing", the original information going under a new
name. Like wagon spokes in a western which appear to turn backwards, this aliasing is the
high frequency (quickly-changing) information that is now backwards. Not in time, but in
frequency. Obviously, this messes things all up. So what we have to do is filter the
musical signal so that there won't be any information being passed through that is of a
higher frequency than half of the sampling frequency. For CDs, this mean removing
(filtering) everything above 22.05 kHz (half of 44.1 kHz).
Now that I have discussed the time quantization in CDs, let's talk about the other
dimension of quantization, namely, level, or amplitude. The sampling rate describes the
"chopping up" of a signal in terms of time, but there's also the amplitude at
each of those instances to record. As you might expect, more quantization intervals are
better, but a huge number of them are not necessary, and take up a lot of space on the
disc. And, since the actual value (voltage) has to be rounded up or down to the nearest
value (represented by one of 216 values with 16 bit "words" in CDs),
there is some amount of error, called quantization error. The way to take care of that is
to add a little bit of noise (not the music that teenagers listen to today, but sort of a
hissy, rushing-water sound a bit like what one can hear between FM stations). This
randomizes the errors so they aren't as noticable, in part because noise is tuned out by
the ear. This sort noise is called "dither", a term which I believe originates
with that crazy Dagwood's mean ol' boss. Anyway, dither decorrelates the quantization
noise from the musical signal, which makes it less audible.
All of what I just described - high-frequency limiting, and finite, discrete, levels - are
basically what happens with all of our ways of recording and playing music, such as vinyl
records, tapes, DAT, and so on. The methodology is different, but the fundamentals are
similar. With vinyl, the cutter head and the stylus can't move faster than a certain
speed, so there is a high-frequency limit. And they can't move with perfect precision and
accuracy (and vinyl can't record such, anyway), so there is noise and inaccuracy in the
amplitude. This means that, just like with "digital", there is a limit to how
small a change in amplitude is realizable, because noise obscures anything smaller than
itself.
So the limits we end up with for CD digital audio are a bandwidth of 22.05 kHz and a
signal-to-noise ratio (SNR) of 96 dB (this means that the music can be, at most, 96 dB
louder than the noise). These are the absolute best, theoretical limitations. In the
beginning of digital audio, they were unobtainable, but with modern technology (such as
oversampling, noise shaping, etc.) we can better push the envelope. The question at hand
is, are these limitations good enough? Or do we need to increase bandwidth and SNR?
In this authors opinion, the answer to both questions is yes AND no. No, 16 bits are not
quite adequate, under the quietest listening conditions. Our sensitivity in the midrange
is just a little beyond that. However, nobody listens under the quietest listening
conditions. And noise shaping, which increases resolution in the midrange where we are
more sensitive to sounds, at the cost of less resolution at higher frequencies where we
don't hear as well, pretty much solves the problem. Precious little music could even take
advantage of it. And, more importantly, the industry is not using the current 16-bit
standard to its incredible potential. So why in the world could we expect them to do
better with an even more sophisticated standard?
These reasons are enough for me to think that we should stick with the same word length
(number of bits) and use any extra information capacity to work on the Achilles heel of
current stereo, which is spatial reproduction. This is not to say, however, that a longer
word length shouldn't be used in the recording and processing stages. During those
periods, one cannot be sure what the loudest level will be, so the engineers need a lot of
headroom (unused bits) to make sure there won't be any clipping. And digital signal
processing can make use of a longer word length during the mathematical computations, to
make sure that resolution doesn't go down below 16 bits when something is digitally
attenuated, among other operations. But, once the final result is had, it will surely have
less than a 96 dB dynamic range, which one can nicely fit into a 16-bit system. Another
issue is that very few, if any, electronics can support 24 bits of resolution, at the
current consumer voltage. The inherent noise floor of the electrical components themselves
(resistors and semiconductors) is too high to allow such high signal/noise ratios. And
24-bit D/A converters use that only in name, not performance, for the most part. What is
the point in such high medium resolution if the electronics can't support it?
As for increasing the sampling rate, there are similar arguments. It is the bandwidth of
the whole signal chain that matters, not just the recording medium. What combination of
microphone, preamplifiers, power amplifiers, and speakers have a bandwidth of 20 kHz?
Precious few. This really puts a damper on increasing bandwidth even if we assumed that
people can hear higher than 20 kHz. And there are no reliable studies that support such
assumptions.
There is, though, a bit of reasoning for higher sampling rates. In order not to cause
problems (aliasing and imaging), the signal going into the A/D converter and coming out of
the D/A converter must be low-pass filtered (attenuate the higher frequency portion of the
signal). To make sure nothing above half the sampling frequency is present, a very sharp
(slope) filter must be used. This causes problems like ripple in the frequency response
and wacky phase behavior at the higher frequencies. So, if we move the sampling frequency
up, those problems won't be nearly as audible; in fact, we can use a simpler filter at a
lower frequency. However, a technique called "oversampling" helps with this
problem. Oversampling inserts fake samples in between the real samples. With more samples,
we must sample at a faster rate to reproduce the signal properly. The fake samples are
ultimately filtered out and are too high in frequency to hear. And when we sample at a
higher frequency, Ta-daaa! The filter can be one that doesn't cause audible coloration.
However, there is still something to be said for a medium that is flat to just beyond the
upper frequency limit of audibility. This means that it won't, in addition to the
high-frequency limiting of other components, contribute to the audibility of the
cumulative effects of high-frequency limiting of the system as a whole. If every component
is a few decibels quieter at 20 kHz, it can all add up and become audibly improved.
There are a number of well-recorded CDs that illustrate the potential of CD sound. The PGM
discs, made by the late Gabe Wiener, are especially representative of what can happen when
the recording engineer actually knows what he is doing. I own "The Buxtehude
Project" and the Ricercar harpsichord album, which exhibit incredibly detailed and
pure sonics. There are sure to be other recordings of similar quality, which further
proves the point that the problem is not with CD technolgy, but the implementation
thereof, not to mention the abundance of poor recordings that give the current technology
a bad name.
You can tell that I am somewhat skeptical of blindly increasing the sampling rate and word
length. Not because doing such is bad in and of itself, but because it will require a
compromise in other, more important, areas, like the number of channels and the amount of
music. I would rather have better spatial reproduction and/or more music than ostensibly
"improved" sonics that are really more hype than reality. With noise shaping and
oversampling, we can solve any problems that are even hinted at with straight 16-bit flat
dither 44.1 kHz sampling CD digital audio. In any case, DVD can theoretically support a
multitude of sampling rates and word lengths, so everyone can be satisfied . . .
theoretically. At any rate, I've no doubt that many of you may disagree on these issues,
and I welcome the fuel for future discussion.
John Busenitz
� Copyright 1998, Secrets of Home Theater & High
Fidelity
Return to Table of Contents for this Issue.