Posts on Factual Audio

How sound travels

etienne@edechamps.fr (Etienne Dechamps) — Sun, 03 Jun 2018 17:53:00 +0100

In previous posts about the acoustic realm, we’ve assumed that sound propagates into free space, willfully ignoring the pesky issue of obstacles or walls getting in the way. Such simplifying assumptions will only get us so far, if only for the fact that the inside of your living room is not levitating in free space. We need to take a close look at what happens when sound comes into contact with something other than air. But before we go there, we need to examine exactly how sound propagates in air — or any other fluid for that matter; that is the focus of this post. From there it will be easier to explain what happens when obstacles enter the picture.

Plane waves

In previous posts, we’ve been assuming sound is being produced by a point source. Conceptually, point sources are like infinitely small balloons (spheres) that contract and expand to produce a sound wave around them, thus producing a spherical wavefront.

While these concepts served us well, there are cases, such as in this post, where the technicalities of dealing with spheres can get in the way of understanding. For this reason, in this post we’re going to visualize a wavefront as a flat plane (or a line in 2D). This is called a plane wave. To create such a wavefront, one could use a vibrating membrane, just like in a typical loudspeaker driver, in lieu of a pulsating balloon.

At first glance this approach might seem disconnected from the spherical waves we’ve been using before. This is not so: there are ways to bridge these two. Specifically, a plane source can be represented as an infinite set of point sources: [note] The general idea of using a set of point sources to approximate a complex wavefront goes much farther than this, as evidenced by the Huygens-Fresnel principle. [end note]

Furthermore, a sufficiently small portion of a spherical wavefront can be approximated as a planar wavefront:

Part of what makes plane waves simpler is that they are not subject to the inverse square law, because, contrary to an expanding sphere, sound power doesn’t get spread over a larger surface as the wave progresses. This is consistent with the above diagram: when looking at a spherical wavefront on a scale that is much smaller than the radius, then, by definition, the distance to the source doesn’t change much, and as a result the inverse square law can be neglected. This progressively ceases to be true as we zoom out. In the reminder of this post we are also going to ignore the small effect of air attenuation to avoid distractions.

When dealing with plane waves, it is often assumed, for simplicity’s sake, that the wavefront has infinite extent, i.e. it’s a literal geometrical plane. Said differently, it is assumed that the planes — and, by extension, the sound source that’s producing them — extend much farther than the phenomenon we’re studying, such that it doesn’t matter what’s happening at the edges. As usual, this simplifying assumption does not fundamentally affect the phenomena that we’ll be studying in this post; it just makes things more focused and easier to visualize.

Modeling air

Our goal in this post is to understand what’s going in a volume of air as sound travels through it.

“What’s going on in a volume of air” is one of the most vague and all-encompassing questions one could possibly ask. Let’s say we want to study the propagation of a simple 1000 Hz sine wave, which has a wavelength of ~340 millimeters. A cubic millimeter of air contains about 10¹⁶ (ten million billion) molecules. Since air is a gas, these molecules are constantly moving around in mostly random fashion. Clearly, trying to understand what’s going on by looking at individual molecules — on a microscopic scale — sounds like a tall order.

Instead, it makes more sense to look at the behavior of air on a macroscopic scale. For the distances we are dealing with, the number of molecules is large enough that the overall behavior emerging from their random interactions can be statistically summarized through a few simple metrics, such as density, temperature and pressure. Among these metrics, the one we care about the most is of course pressure — remember that pressure variation is the very definition of sound itself.

That said, for the purposes of our investigation, it wouldn’t make sense to go to the other extreme and study the entire volume of air as a single unit. Indeed, sound waves are pressure fluctuations that propagate through space; if we only look at the overall pressure of the entire volume of air, information about these local fluctuations is lost.

Therefore, it looks like the best way to approach this problem is to strike a compromise. We will arbitrarily divide the space into a number of individual “pockets” of air of constant mass, each with its own properties. The idea is to study a complex system by breaking it down into simple constituent parts and examining their interactions. [note] This general approach is not just useful for educational purposes, mind you. It forms the basis of how physical laws are applied by engineers and scientists to study the behavior of complex systems, using techniques like the finite element method or the finite difference method. The acoustics of loudspeakers and concert halls, for instance, are often designed using such methods. [end note] These pockets of air should be small enough that we can clearly see the propagation of the sound wave through space, but not too small lest we lose sight of the big picture. Here, for our 1000 Hz wave, it would be enough to have pockets that are, say, around 17 mm in size, such that a single wavelength is broken down into 20 pockets or so.

Initial setup

Let’s set the stage as a single vertical plane wave source, i.e. an infinitely large membrane, situated on the left and vibrating laterally in a free field. The goal is to study what’s happening to the air pockets situated on the right of that membrane. Here is a 2D projection along with a grid of cubic air pockets:

Air is under ambient (atmospheric) pressure, which in practice means that our little air pockets want to expand outwards; in other words, each air pocket exerts a force on its neighbors — that force is the very definition of what pressure is. It is represented by black arrows on the above plot, for one example air pocket.

However, since all our air packets apply the same force on each other, everything cancels out, and nothing happens — the gas is at rest. This is still true in the presence of the membrane, if we assume that it is locked into position. [note] If we assume that there is air on the left on the membrane, then there would be pressure on the membrane from the left side too, and it would have no reason to move anyway. This is not pictured above as I want to focus on what’s happening on the right side of the membrane. [end note] In this initial state, pressure is the same everywhere, therefore the sound (relative) pressure is zero everywhere, and everything is silent. In the remaining of this post I will mostly ignore ambient pressure to focus on sound pressure.

What happens if we start moving the membrane?

Compression

As already decided, we’re going to make our membrane vibrate at a frequency of 1 kHz. We also need to decide on an amplitude of vibration, let’s say 15 micrometers (peak). This is called the displacement of the membrane. [note] This might seem very small, but as we’ll see below, even a minuscule displacement can sound quite loud. This is the reason why a loudspeaker membrane doesn’t seem to be vibrating to the naked eye. Except perhaps low-frequency subwoofers, which, for equal sound pressure, require higher displacement because the wavelength is longer. [end note] For the purposes of this post, we’re going to assume that the motion of the membrane is irresistible, i.e. it will move at its own pace whether the surrounding air likes it or not. [note] I’m not going to discuss this assumption here, because the behavior of sound sources is a story for a different post. I’ll just mention in passing that this is consistent with how a typical loudspeaker driver would behave. [end note]

Like a bulldozer, the membrane will displace any air particle that stands in its way. These particles will accumulate in front of the membrane; as a result, the air pockets in front of the membrane (column A) become denser. But remember that we’ve defined our air pockets as having constant mass; if they become denser, that necessarily means that their volume will decrease. In other words, we have compression. Behold, as we are now entering the world of thermodynamics.

This is how things look like when the sine wave is at 5% of its cycle. About 50 microseconds have elapsed, and the membrane has moved by about 5 micrometers. [note] Yes, the membrane has traveled a third of the way in 5% of the time. That’s because a sine wave starts fast, but slows down as it approaches its peak. [end note] Notice the compression of column A:

Note that, in order to make the compression visible, these plots are not to scale: the displacement amplitude is magnified by a factor of 1,000 compared to reality.

Remember that a gas is just a collection of fast-moving particles that are constantly colliding with one another and with its boundaries. If one of our constant-mass gas pockets shrinks in volume, then the particles it contains, being confined into a smaller space, will collide with each other and with the surroundings more often and with more energy. This means that the temperature and pressure of the air pocket will increase.

These changes in volume come and go at such a rapid pace that the air pockets don’t have any time to exchange heat to each other; this means we are dealing with an adiabatic process. [ref] Beranek, Leo L., Acoustic measurements, §2.2.J (page 49). Beranek, Leo L., Acoustics: Sound Fields and Transducers, §2.2.2 (page 24). Also see “Why are sound waves adiabatic?” for a more detailed treatment of the question. [end ref] This makes things easier, because it means the pressure can be deduced directly from the volume. Indeed, in such a process, pressure is inversely proportional to volume to the power of the adiabatic index, which in the case of air is about 1.4.

So let’s apply this reasoning to the situation shown on the plot above. Remember that our pockets are normally ~16 mm in length, but the ones in column A have been compressed by ~5 µm lengthwise, a ~0.03% reduction in volume. This is such a tiny change that the exponent can be approximated as a linear factor, leaving us with pressure inversely proportional to volume. In fact, this approximation is precisely what allows us to assume air is linear so long as sound pressure is not increased to eardrum-splitting levels. [ref] Beranek, Leo L., Acoustic measurements, §2.8 (page 80). Thuras, “Extraneous Frequencies Generated in Air Carrying Intense Sound Waves”, The Journal of the Acoustical Society of America 1935 6:3, 173–180. [end ref]

And with that, finally, we can quantify the average sound pressure in column A at this instant: it’s about 0.04% of ambient pressure, or ~40 Pa (~125 dB SPL). This value show up as red on the plot.

So far so good. What happens next?

Propagation

Let’s go back once again to the definition of pressure: it’s the force that air pockets exert on their neighbors. We’ve just concluded that pressure inside the air pockets in column A has increased, which means they are pushing outwards with more force compared to the air at ambient pressure.

This additional force has no effect on the membrane, since we’ve decided that its motion is irresistible. It doesn’t have any effect along column A either, because all the air pockets in this column are in the same boat, so the forces they exert on each other cancel out.

The same cannot be said of column B, however. That column is feeling less force from column C on the right, which is still at ambient pressure, than from pressurized column A on the left. This force doesn’t cancel out and is acting to push column B towards the right. According to Newton’s second law, this will result in the air in B accelerating to the right — the amount of acceleration is the total force divided by the mass (density) of air.

One interesting observation we can make at this point is that air particles in B are not being displaced in lockstep with the particles in A: the compression in A determines the acceleration of compression in B, not the compression itself. This hints at a time delay between events in A and events in B, which in the end just means the speed of sound is finite.

Since we know the force (pressure) from A, and we know the mass of an air pocket, we could be tempted to use Newton’s formula to determine the average acceleration of air particles in column B. From there, we should be able to determine particle velocity, and thus displacement, in B, after a certain amount of time has passed. We could then deduce the change in volume, and from there the change in pressure using the same reasoning as above, thus completing the cycle.

Now, it would be great if it was that simple, but it’s not. The problem is that while that “certain amount of time” passes, the parameters of our problem are continuously evolving, throwing our neat calculation into disarray. The membrane is not standing still during this time and keeps moving to the right, increasing pressure in A further, which changes the force that A exerts on B. Furthermore, we need to keep in mind what happens in B while it’s being compressed: we’ve determined earlier that a decrease in volume results in a proportional increase in pressure. The additional pressure in B pushes back against A, reducing the total force that B is subjected to. [note] In that sense the air in B behaves like a spring: indeed, pushback force (here, additional pressure in B) is proportional to distance from equilibrium (here, change in volume), which means we can directly apply Hooke’s law. This is why sound propagation is often explained by using the analogy of weights interconnected by springs. [end note] So, to recap, we have:

The mounting pressure in A, which exerts an increasing force on B, compelling it to accelerate;
The mass (density) of air in B, which constitutes inertia, reducing the acceleration;
The mounting pressure in B, pushing back against the force from A, gradually reducing the acceleration as B gets compressed.

Numerically, the only direct way to figure out what happens at this point is to advance our “simulation” by very small time increments, recalculating the various forces, velocities and displacements at every step of the way. This would of course be quite tedious, which is why mathematical tools were invented precisely to solve this kind of problem: calculus, and in particular differential equations. If we combine Newton’s second law with the law governing adiabatic processes, we end up with the acoustic wave equation. Solving that equation at t = 50 µs results in the above plot, and if we advance time by a further 50 µs, we get the following:

The acoustic wave equation is the very essence of the phenomenon of sound propagation. It provides the fundamental basis for many properties of sound waves, including the fact that perturbations are allowed to travel through space undisturbed — this is illustrated by the 5 µm displacement of the membrane from 50 µs ago simply moving to the right with no other changes. [note] Of course, in the real world there would be attenuation, as we’ve seen in a previous post, but that’s still a linear change that doesn’t affect the shape of the wave. [end note] The sum of two solutions that satisfy the equation — i.e. two waves — is itself a wave: that’s the foundation for the superposition principle and acoustical interference phenomena.

The equation also provides a way to quantify the speed of propagation. Remember that the adiabatic index — or, more generally, the bulk modulus, or compressibility — determines how much pressure (force) results from a given change in volume, and that force increases acceleration of air particles; meanwhile, their mass creates inertia, which decreases it. Therefore, it should come as no surprise that the formula for the speed of sound, deduced from the wave equation, is a simple combination of these two parameters.

Towards steady state

Let’s fast forward to peak displacement of the membrane, at 25% of the cycle, t = 250 µs:

One thing to note here is that sound pressure in front of the membrane has decreased to zero. This is because the membrane is barely moving at this point, as it has reached the crest and is about to rush back towards its initial position. As a result, the air in front of the membrane had time to expand back to ambient pressure. This leads to the observation that when displacement is at peak, pressure is near zero, and vice-versa. An equivalent way to state this, since we’re dealing with a sine wave, is that displacement lags 90° behind pressure.

Let’s fast forward further, to 50% of the cycle, t = 500 µs:

The membrane went back to its original position. By doing so, it briefly leaves a vacuum in its wake, leading to an imbalance of forces, since the membrane is not pushing back against air pressure anymore. As a result, the air to the right of the membrane expands further to fill the gap: we have rarefaction, the opposite of compression.

And finally, after a full cycle, t = 1,000 µs:

And we’re back to where we’re started. At this point in time, if the membrane stops moving, the leftmost columns will again expand, restoring normal ambient pressure and pushing the high pressure region to the right. If the membrane continues pushing right instead, a new sine wave cycle begins anew.

Acoustical interference

etienne@edechamps.fr (Etienne Dechamps) — Sun, 25 Mar 2018 20:35:00 +0100

Previously, I described how audio signals can sum together and produce various results depending on frequency, amplitude and phase. We’ve seen that these rules apply in any realm — digital, analog, or acoustic — as long as the medium is linear.

This is pretty much the end of the story for the digital and analog realms, where the signal can simply be represented as a variation of amplitude over time, i.e. a single waveform. The acoustic realm, on the other hand, doesn’t just have a time dimension, it has spatial dimensions as well. So far, when discussing wave summation and interference we’ve been ignoring these spatial dimensions, instead focusing on sound pressure at a single point in space only. Just like the previous post focused on how distance in space can affect the amplitude of a signal, it seems appropriate to discuss the concepts of summation and interference in a spatial context.

In this post we’ll look at the interaction between two spatially separated sound sources that, for the sake of simplicity, are both radiating the same 250 Hz sine wave. We’ll use a 2D projection because 3D space doesn’t matter too much for this post and it makes things easier to visualize. Furthermore, just like in the previous post, we’re going keep things clear and focused by assuming free field conditions, i.e. sound propagates unimpeded without meeting any obstacles such as walls.

Two sources, one listener

Let’s set the stage as, say, a 10-meter square area with some arbitrary reference listening position equidistant from both sources. Something like this:

As one might expect, the sound from sources A and B is going to sum at the listening position. To understand what’s going to happen there, we can refer back to the summation rules we’ve discussed in a previous post. Since the two initial waves are sine waves, we know the result is going to be a sine wave too, and the resulting amplitude and phase will depend on the amplitude and phase of the two waves.

So what is the amplitude and phase of the two constituent waves? Well, for starters, it depends on the characteristics of the sources. For the sake of simplicity we could assume the two sources are identical; for example they could be two closely-matched units of the same loudspeaker model. In fact, we’re going to go one step further and assume that the sources are not only identical but are also monopoles, i.e. they radiate sound in the same way in all directions. This way we don’t have to worry about changes in amplitude and phase depending on the angle at which the listener is facing the source. In a real-world situation we wouldn’t be able to get away with these assumptions, but they don’t really affect the basic principles described in this post; they just make it easier to calculate the results.

Now that the source characteristics are out of the way, we have to consider what happens between the sources and the listener. Here the situation is simple — since the two waves are propagating through the same medium and the listener is sitting at equal distance from both sources, there is no reason to believe their relative amplitude and phase would be affected.

From there we can deduce that both waves will arrive at the listener with the same amplitude and phase, and we can apply the appropriate rule to conclude that the two waves are going to interfere constructively at the listening position.

So far so good. What if we move the listener to the side?

In that case the listener is not sitting at equal distance from A and B anymore. This has two immediate consequences. First, due to the inverse square law, the wave from B will have higher amplitude than the wave from A at the listening position. However, because the distances are still somewhat similar in this example, the difference is benign — a mere ~1 dB. For this reason, we’re going to ignore this effect for now.

More importantly, we should point out that the speed of sound is not infinite, which means sound from B will arrive earlier than sound from A at the listening position. In the conditions of a typical room, air travels at around 343 m/s. In our example, sound from A has to travel an additional 0.7 m, which means it will arrive around 2 milliseconds late.

Now, such a small delay might seem inconsequential at first, but consider this: 2 ms is exactly half the period of the 250 Hz sine wave we’re using in our example. Which means that when the wave from A arrives at the listening position, its phase is exactly opposite (180°) relative to the wave from B. If we look at the summation rules for this scenario, we deduce that destructive interference will occur, and since the amplitudes of the two waves are (almost) the same, we can conclude that there is no sound at the listening position!

Interference patterns

One observation that we can make at this point is that the nature of the interference depends on the difference in propagation delays between the two waves (path difference). if the delay is negligible compared to the period, or if it’s close to an integer multiple of the period, we get constructive interference; at the other extreme, if it falls in the middle, we get destructive interference. We’ve also seen that, in space, due to the speed of sound, delay and distance are intricately linked. Thus our two example sources interact to create interference patterns where sound amplitude and phase varies wildly from one point in space to the next. In other words, we have a complex sound field that shows patterns of constructive and destructive interference:

The above plot assumes each of the two sources individually produces 0 dB at every point in space and ignores the inverse square law. If we instead assume that each source can produce 0 dB at a distance of 1 m and apply the inverse square law, we get a more realistic result:

We can also look at this phenomenon from the perspective of wave propagation by examining the evolution of local sound pressure over time. Paul Falstad’s excellent simulator can be used to do just that.

Transient sound field

In the example above, we’ve determined that interference occurs because the sound from one of the sources arrives 2 milliseconds later than the sound from the other source. I then showed the resulting sound field after the sound from both sources has traversed the whole area, and the sound field has stabilized: this is called the steady state. This begs the question: what about the transient state, i.e. the window of time just before the sound from the second source arrives at the listener?

In the post where we discussed wave summation, I carefully avoided that question by assuming that the waves were continuous; that is, they have no beginning and no end. Let’s depart from that assumption for a moment and assume that our sound wave has a well-defined start time: our initial condition is that the sound field is completely silent, and our two ideal 250 Hz sine wave sources are switched on simultaneously at t = 0. To make things more obvious, I also rearranged the sources to increase the path difference:

Initially, the listener is not hearing anything because the sound from the first source has not reached it yet. At around 8 ms, the sound from the first source has arrived and the listener starts hearing a wave of about -9 dB amplitude. At around 18 ms, the sound from the second source arrives, interferes with the sound from the first source, and the sound level at the listening position drops to about -14 dB. Shown differently, this is the sound that reaches the listener:

If our listener was a microphone, the above shows the waveform that it would capture. But in the real world we don’t look at sounds — we hear them. This immediately raises a number of very important questions. What if the listener is not a microphone but a human being? What would they hear in this scenario? Would they notice the transient state at all? Would they hear a difference if we only used a single source of similar steady-state amplitude?

These questions are not about physics anymore — they’re about perception. In other words, we are stepping into the field of psychoacoustics. The answers to these questions depend on many factors, such as frequency, transient state duration, and relative amplitude. They also depend on the shape of the signal: if, in our example, we used a very short, impulse-like signal instead of a sine wave, then the sound from the first source would have ended before the sound from the second source reaches the listener. In such a scenario there is no interference at all, the listener hears the same signal twice, and there is no steady state to speak of! As if that wasn’t enough, perception is also affected by the angle at which sound arrives at the listener. [ref] Toole, Floyd E., Sound Reproduction: Loudspeakers and Rooms, chapter 7. [end ref] Needless to say, this is a vast topic to which we will come back in more detail in future posts.

Caution: When making your own acoustical measurements using a microphone and an analyzer, you should always keep in mind that, by default, most of the results that are displayed (in particular the frequency response) are only valid for the steady state. For this reason you should exercise extreme caution when interpreting these measurements and drawing conclusions from them with regard to how the system will sound like to a human listener. This can be quite misleading, especially at high frequencies, and especially if the system under test has long transient states (such as a room, where sound travels distances measured in meters). More details in later posts.

The wavelength

I mentioned above that, in the acoustic realm, delay (or duration) and distance are two sides of the same coin: distance differences cause delay differences which, relative to the period of the wave, cause interference patterns. We can simplify this story by sidestepping the concept of delay entirely; instead, we could choose to only deal with distances.

In order to do that, we need to convert the wave period, which is a duration, into a distance. Using the speed of sound, we can compute the distance that sound travels in the duration of one period; this is called the spatial period, or more commonly, the wavelength. The formula is very straightforward: the wavelength is the period multiplied by speed of sound (or speed of sound divided by frequency). As usual, calculators are available.

Using the wavelength, we can rephrase the interference phenomena described in the previous section purely in terms of distances. For example, we can say that if the distance difference between the listener and the two sources is close to an integer multiple of the wavelength, then constructive interference will occur. Furthermore, the wavelength of a 250 Hz wave is ~1.4 m, which explains why, in our initial example, destructive interference occurs when the distance difference approaches half that wavelength (0.7 m).

Distance between sources

We’ve observed above that sound from two spatially separated sources can combine to form an alternating constructive and destructive interference pattern. This is true for the specific example that I’ve chosen, but things do not necessarily turn out that way in the general case. Indeed the sound field might adopt a different shape if any of our initial assumptions (i.e. identical, monopole sources) are broken.

More interestingly, the distance between our two sources also matters. Remember that this particular interference pattern is inherently caused by the path difference between the sources and the listener — if the path difference approaches half a wavelength, destructive interference occurs. Consider this: basic geometry tells us that the path difference cannot be more than the distance between the sources themselves. [note] This maximum is reached if the listener sits on the line that goes through both sources, such that one source lies behind the other from the perspective of the listener. [end note] If that distance is less than a quarter of a wavelength, then destructive interference simply cannot occur anywhere in space, and we are left with only constructive interference:

From this we can conclude that interference patterns tend to disappear if the sources are moved closer to each other. [note] This is why two subwoofers (low frequency, long wavelength) will act just like a single subwoofer with twice the output if they are positioned right next to each other. Things are not that simple at higher frequencies, because the wavelengths are shorter. [end note] Since wavelength is inversely proportional to frequency, lowering the frequency has the same effect.

Caution: Two sound sources can interact in very different ways depending on the wavelengths (frequencies) involved. This is important because the audible range spans a huge range of wavelengths: from 17 meters (20 Hz) down to 17 millimeters (20 kHz). This in turn means that the shape of the sound field can vary wildly depending on which part of the frequency spectrum we are looking at. For this reason it is imperative to clearly state the frequency range of interest when discussing acoustical phenomena in real-world situations — otherwise the discussion makes no sense.

Keeping our distance

etienne@edechamps.fr (Etienne Dechamps) — Wed, 28 Feb 2018 22:53:00 +0000

This post will focus on a specific aspect of the acoustic realm: specifically, how the amplitude of a sound wave is altered as it travels through free space. This is useful to gain a better understanding of how loudspeaker systems behave; naturally, it is not as useful when discussing headphones, as in this case there is not much space for sound to propagate into.

To make things easier to understand, I will use a simple unrealistic scenario as a basic starting point. Then, I will elaborate on this scenario so that we get closer and closer to a realistic model of how sound propagates in the real world. Here are our initial assumptions:

Sound is propagating through a perfect lossless medium. In other words, the medium (in our case, air) opposes no resistance to the passage of sound, and there is no dissipation of sound energy, such as in the form of heat. This is not completely true in practice, and we will revisit this assumption later in this post.
Sound is coming from a single, small point in space, not from multiple points at the same time. Such a source is called a point source. Perfect point sources don’t exist in the real world, but are a good approximation as long the listener sits in the far field; that is to say, the distance to the source is large compared to the size of the source itself. [note] If we can’t make that assumption, then an alternative is to decompose a complex source into a number of individual point sources, the sound from which sum together at the listening position; this approach is known as the Huygens-Fresnel principle. [end note]
Sound does not reflect off nearby surfaces. Think of the sound source as levitating in mid-air, with no floor, no walls, no ceiling, basically nothing for sound to bounce off of. This situation is known as free field or free space conditions. Strictly speaking this is implied by the previous assumption, since after all a reflection is itself a sound source — thus the sound would not be coming from a single point anymore.

The latter assumption is of course egregiously wrong when sound propagates in a room, which is the main reason why the scenarios described in this post are somewhat unrealistic. The rules described below should be seen as building blocks to gain a better understanding of how sound propagates through space; they can’t be applied as-is for sound waves propagating through a room where reflections are present. We will tackle that problem in a later post.

Caution: Keep in mind that sound pressure, the metric used to quantify the amplitude of sound in the acoustic realm, refers to pressure variations at a given point in space. It does not refer to some overall amount of sound power that is being radiated away from a source. Local sound pressure depends on where the listener is; in contrast, total sound power is a property of the source only. [note] When a specific sound pressure level is mentioned in the specifications of a sound source (e.g. a loudspeaker), it is always relative to a specific distance from the source (typically 1 meter), at a specific angle (typically on-axis, i.e. 0°), under specific conditions (typically half-space, i.e. with the loudspeaker against a wall). Such a number is valid for this point only. [end note]

One dimension

To start with, let’s consider the case where sound is propagating down a narrow tube. In fact, as an approximation we will assume for the sake of the discussion that the tube is infinitely narrow, such that we don’t have to deal with any reflections off the sides of the tube. (This might seem like a strange place to start, but bear with me for a moment.) We could simulate such a scenario by attaching a tube to a loudspeaker, like so:

A more entertaining analogy is the tin can telephone, where the propagation medium is not air, but a piece of string.

In such a setup, the problem is one-dimensional: there is nowhere the sound can go but down the line. Furthermore, since we’ve assumed earlier that the medium is lossless, there really is no reason for any energy to get lost as sound propagates down the tube. Therefore the sound pressure at points A and B (as well as any other point along the tube) is equal. It’s as simple as that.

Two dimensions

Let’s go one step further and add a second spatial dimension into the mix. One could imagine that our loudspeaker is sandwiched between two parallel planes, where the space between the two planes is infinitesimal:

Another possible analogy is a vibrating piece of paper — a two-dimensional upgrade to the tin can telephone from the previous section.

In this setup things get more complicated, because the loudspeaker is radiating sound in many directions at once. This can be conceptualized as expanding circles called wavefronts. As a wavefront gets farther and farther away from the sound source, its circumference increases. However, energy does not magically appear out of thin air as this happens, so the same amount of sound energy gets spread over a larger circle. In the diagram above, the wavefronts represented by the inner and outer circles carry the same amount of energy in total, but since the outer circle is larger, the sound pressure at a given point on the circle is lower. [note] This reasoning does not require that the source be a monopole, i.e. that it radiates sound equally in all directions. In other words, sound energy is not necessarily evenly distributed along the circle. In general, loudspeakers are not monopole sources; at higher frequencies, they will typically radiate more sound out the front than out the back. Directivity makes no difference as far as this post is concerned, but it is an important concept in other contexts. [end note]

We can easily quantify the rate at which sound pressure decreases as the wavefront gets wider. In the above example, the distance between the source and point A is exactly half the distance to point B. In other words, the radius of the outer circle is double the radius of the inner circle. The circumference of a circle is proportional to its radius, so doubling the radius is equivalent to doubling the circumference. Doubling the circumference means that a given amount of sound power gets spread over twice the area, which in turn means that the sound power per unit area (i.e. the sound intensity) is halved. Sound intensity varies with the square of sound pressure, [note] This is because sound intensity is the product of sound pressure and particle velocity, and particle velocity is proportional to sound pressure. As a result sounds pressure is multiplied with itself. This is analogous to electric power dissipated in a resistor being proportional to the square of voltage. [end note] therefore the ratio of sound pressure between the two circles is the square root of 2, or 3 dB.

This leads us to conclude that, when sound propagates along two spatial dimensions, and keeping in mind the important assumptions we’ve made so far, a doubling of the distance to the sound source results in a 3 dB drop in sound pressure level.

Three dimensions

Finally, let’s enter the real world and in its three spatial dimensions. As one would expect, our circles are now spheres:

From this point forward the reasoning is very similar to the two-dimensional case, except this time we’re dealing with surfaces, not lines. The surface of a sphere is proportional to the square of the radius, so doubling the radius is equivalent to quadrupling the surface. Quadrupling the surface means that a given amount of sound power gets spread over an area four times as large, which in turn means the sound intensity is divided by 4. Therefore, the ratio of sound pressure between the two spheres is the square root of 4, which is, well, 2, or 6 dB.

This leads us to conclude that, keeping in mind the important assumptions we’ve made so far, a doubling of the distance to the sound source results in a 6 dB drop in sound pressure level (SPL). This fundamental result, which applies not just to sound waves but to various other physical phenomena, is called the inverse-square law. [note] This is because intensity decreases with the square of the distance. Strictly speaking, since we’re talking about pressure this should be called the inverse-proportional law, but that term is seldom used in practice. [end note] Calculators exist to compute the drop for any arbitrary distance ratio.

Air attenuation

At the beginning of this post, we’ve made the assumption that sound is propagating through a lossless medium. But remember that sound is a pressure wave; by definition, its propagation involves the movement of particles that comprise the medium. In the case of air, that’s mostly oxygen and nitrogen. Because such movement goes against the natural equilibrium of the medium, it requires an expenditure of energy, which is dissipated as heat. As a result, the total energy carried by a sound wave tends to decrease as it travels through the air, in addition to the loss of sound pressure caused by the inverse-square law described in the previous section. [ref] Engineering Acoustics, “Attenuation of sound waves”; Smith, Julius O., Physical Audio Signal Processing, “Air Absorption”. [end ref] [note] Contrary to what some may believe, this doesn’t seem to have anything to do with “loops of energy” that somehow “supply the body with vital life force”, however inspiring that might be. [end note]

The physical phenomena at play here are non-trivial. The precise amount of sound absorption per unit of distance traveled in air depends on ambient pressure, temperature and humidity, requiring the use of a calculator. Most importantly, it increases with frequency, which means, somewhat shockingly, that air itself exhibits frequency response distortion. Here’s how things look like for various atmospheric conditions: [ref] Calculated according to ISO 9613–1:1993, Acoustics — Attenuation of sound during propagation outdoors — Part 1: Calculation of the absorption of sound by the atmosphere. [end ref]

Immediately, we can observe that below a few kHz, there is not much attenuation to speak of; it is practically negligible at low and medium frequencies. This should dispel the common misconception that attenuation of sound with distance is due to absorption by the air — in reality, that phenomenon is dwarfed by the inverse-square law that we’ve discussed above.

On the other hand, absorption cannot be neglected for high frequencies. In conditions typical of a living room (20 °C, 50% relative humidity), a 10 kHz sound wave will be down ~1.5 dB after 10 meters in addition to the inverse square law and other phenomena. It’s small, but it’s not zero, and it tends to go downhill fast over the last octave of the audible frequency range.

Mixing waves

etienne@edechamps.fr (Etienne Dechamps) — Tue, 30 Jan 2018 22:48:00 +0000

It is very common in real-life systems to see signals being combined in a process known as additive mixing, sometimes referred to as simply mixing, summation, addition or interference. Examples include your computer playing a notification sound on top of your music or reducing the number of channels from an audio stream before playback (downmixing); however the most complex — and most intriguing — examples come from the acoustic realm, in which sound from multiple sources combine at a given point in space. [note] In real-world home audio systems this happens literally all the time. If you’re not convinced, consider that sound bounces off walls, and the interference that this produces has huge implications for acoustics. But that’s a story for another post. [end note]

For the sake of this discussion, we’ll assume that the system that is combining these signals is, for all intents and purposes, linear. In the digital and analog realms this a safe assumption to make as long as non-linear distortion is kept in check. In particular the peak amplitude of the resulting signal needs to be low enough to avoid clipping. [note] In fact, you might be compelled to use the calculations described below precisely to ensure that you won’t drive your system into clipping; indeed this is a common concern when mixing signals digitally or electronically. [end note] To a much lesser extent, this also applies to the acoustic realm: air can be assumed to be linear insofar as we’re not dealing with extreme eardrum-busting pressures. [note] The textbook case where that assumption breaks is when large amounts of sound energy are forced down a narrow tube. One example is the throat (i.e. the loudspeaker-facing part) of the horn in a horn loudspeaker, which is subject to enormous dynamic pressures from the compression driver. (See Kleiner, Eargle, Thuras.) [end note]

What’s nice about linear systems is that they follow the superposition principle: summing two signals can be done, quite literally, by taking the values of both waveforms at a given point in time and adding one to the other. It’s as simple as that (just don’t forget about the sign). That being said, it would be nice to have a general idea of what to expect when summing waves of specific shapes. This is especially important in cases where the two signals being mixed might be related to each other, as is often the case. I’ll start with very simple scenarios and make my way to the less trivial cases.

Same frequency, same phase

Let’s say we want to sum two pure sine waves of identical frequency and phase. [note] It is important that the frequencies match exactly, otherwise you will end up with a surprising (and counter-intuitive) pattern called a beat. Fortunately this is rarely a problem in practice because in a typical home audio system all signals use a consistent timing reference (often originating from a single digital clock signal), which means their frequencies are always in precise lockstep. [end note]

The above plot illustrates the fact that when two sines of identical frequency are mixed together, the result is always a sine of that same frequency. [ref] This follows from mathematical properties of sines. [end ref] Furthermore, in this particular example, the peaks of the two sine waves are aligned because they have the same phase. Since they’re aligned, it follows naturally that the peak amplitude of the resulting signal is simply the sum of the peak amplitudes of the two original signals.

We’ve seen previously that the RMS amplitude of a sine wave is simply its peak amplitude divided by a constant, so we can in turn deduce that the RMS amplitude of the resulting signal is the sum of the RMS amplitudes of the original signals. The resulting amplitude is higher than the amplitudes we started with, which is why this is called constructive interference. [note] The general concepts of constructive and destructive interference are not limited to sine waves. They can be used to describe the interaction between any waveforms as long as they are coherent. Two sine waves of identical frequency are always coherent. Note that there seems to be a lot of confusion around the terms “coherent” and “in phase”. According to the Wikipedia definition two coherent waveforms do not necessarily have the same phase, yet the term “coherent sum” is often thrown around to specifically describe perfect constructive interference. [end note]

The sum of two sine waves that have identical frequency and phase is itself a sine wave of that same frequency and phase. The resulting amplitude (peak or RMS) is simply the sum of the amplitudes.

Same frequency, opposite phase

Now let’s take the same scenario as above, but this time one of the two waves is 180° out of phase, i.e. it is inverted.

This time the peaks of one wave are aligned with the valleys of the other wave. This means their amplitudes subtract instead of add, which is why this is called destructive interference or cancellation. In the special case where the two waves have exactly the same amplitude, they cancel each other out perfectly, and the signal vanishes — the result is silence. Active noise control, the technology used in noise-canceling headphones, is an example of this phenomenon.

The sum of two sine waves that have identical frequency and opposite phase is itself a sine wave of that same frequency. The resulting amplitude (peak or RMS) is the difference between the two amplitudes. The resulting phase is the phase of the wave with the strongest amplitude.

Same frequency, arbitrary phase

When dealing with two sine waves of identical frequency, but neither identical nor opposite phase, we find ourselves deprived of a simple rule of thumb; there’s no avoiding the math in this case. However, the following plots will give you an idea as to what to expect. They describe the effect of summing two sine waves that differ by a particular amplitude ratio (line color) and phase (horizontal axis). The “same phase” case that we’ve seen in the first section is located in the middle of the horizontal axis, while its sides illustrate the “opposite phase” case that we’ve seen in the previous section.

The sum of two sine waves that have identical frequency is itself a sine wave of that same frequency. The resulting amplitude and phase depend on the amplitude ratio and phase of the two waves.

Different frequencies

So far we’ve only been dealing with pure sine waves of identical frequency. What about the sum of sine waves of different frequencies?

In this case, as illustrated by the above plot, the result is not a sine wave. We can see that the peak amplitude is the sum of the peak amplitudes of the two waves — here, 8 (5+3). This makes sense, because two sine waves of different frequency will, given enough time, eventually line up with each other such that their peaks are aligned, even briefly. [note] Here I’m assuming that the signal is of infinite duration, or at least sufficiently long that it makes little difference. If the signal is of limited duration, the peaks will not necessarily have the opportunity to line up, and the peak amplitude will therefore be lower. [end note]

RMS amplitude is less trivial. We can’t use the same trick as before because we’re not dealing with a single sine wave anymore — we’re dealing with a different wave shape. Mathematically there are many ways to derive the answer, but perhaps the most straightforward is to use the Parseval theorem, which basically states that the sum of squares of a waveform is equal to the sum of the squares of the sine waves (of different frequencies) that comprise it. In the above example, the RMS amplitudes of the original sine waves are approximately 3.5 and 2.1, so the RMS total is the square root of 12.5+4.5=17 — which is approximately 4.1.

The sum of two sine waves of different frequencies is not a sine wave. The peak amplitude of the resulting wave is the sum of the peak amplitudes of both sine waves. The RMS amplitude of the resulting wave is the RMS total of the amplitudes of both sine waves.

Arbitrary signals

Since any signal can always be decomposed into a sum of sine waves using the Fourier transform, the above section de facto provides a general solution that can be applied to any pair of signals: decompose both signals, add the sine waves at each frequency using the rules from the first three sections above, sum them back using the rule from the fourth section, and you’re done. [note] Granted, that’s not really easier than just adding the two waveforms directly, is it — especially since computing the Fourier transform without the help of a computer is, well, extremely hard. The point was to demonstrate the variety of approaches that can be used to think about these sums. Depending on the situation, some of these techniques may be easier to apply than others. [end note]

Caution: If you know the amplitude spectrum for the two signals that you want to sum, you might be tempted to simply add them together in the frequency domain, by summing the amplitude at each frequency. In most cases that’s incorrect, because at each frequency you are summing sine waves without knowing their respective phase — the actual result might be less than the sum of the amplitudes. The correct result can only be obtained by summing the raw complex output of the Fourier transform, which includes phase information, before extracting the amplitude information.

That said, there are cases where we can get away with a simpler approach. Consider a scenario in which the two signals are completely and utterly unrelated to each other; for example, it could be two noise signals, or two separates pieces of music, or the voice of a commentator on top of a soundtrack, or a notification sound playing on top of other content. The trick in that case is to see the two signals as two independent random processes, in the mathematical sense of the term. [note] The usual caveats apply, including the simplifying assumptions that the signals are truly unpredictable and extend infinitely in time. This is an approximation, albeit one that works well in practice. [end note]

From that perspective, peak amplitude is intuitively the sum of peak amplitudes, since, given enough time, two peaks will eventually align. As for RMS, the trick is to see that the sum of squares is just another name for the variance, of which RMS amplitude is the square root (which makes it the standard deviation). The variance of the sum of two independent random variables is the sum of the variances, so we can directly deduce that the RMS amplitude of the sum of two unrelated signals is the root of the sum of the squared RMS amplitudes — i.e. the RMS total. [note] The RMS total is also known as the power sum. This makes sense because power is proportional to the square of the amplitude. [end note] In that sense, summing two unrelated signals is similar to summing two sine waves of different frequencies as described in the previous section.

The peak amplitude of the sum of two unrelated signals is the sum of the peak amplitudes. The RMS amplitude of the sum is the RMS total of the amplitudes.

A word about decibels

At this point, I should warn you about decibels. Due to their logarithmic nature, decibels are great for manipulating ratios and multiplying things, but they tend to get in the way when doing additions. As a gentle reminder, the sum of, say, ×2 (6 dB) and ×5 (14 dB) is ×7, which is 17 dB, not 20 dB.

In fact, it gets even more confusing when you realize that what I just did in this example was directly summing two values, which is only a good idea in some of the cases that I’ve described above: namely, perfect constructive interference, or when dealing with peak amplitudes. It is not the correct approach when dealing with RMS totals. Which is really unfortunate, because decibel values are typically expressed in RMS, so the intuitive direct approach is wrong in many cases.

Fortunately, decibels can also help us in their own way, as they make it easier to express the amplitude of the sum when the ratio of the amplitudes of the original signals is known. For example, ×2 is +6 dB, which can help when summing signals of identical amplitude. What’s much more interesting, though, is that the RMS total of two signals of identical amplitude turns out to be their RMS amplitude multiplied by the square root of 2… which, conveniently, happens to be approximately +3 dB. If we combine that with the knowledge we gained in the earlier sections, we can say, for example, that summing two unrelated signals of equal amplitude results in a peak amplitude of +6 dB and an RMS amplitude of +3 dB above the original signals. Isn’t that nice?

In fact, we can summarize what we’ve just learned as follows:

Sum of two signals of equal amplitude	Peak	RMS
Sine waves of identical frequency and phase	+6 dB	+6 dB
Sine waves of identical frequency and opposite phase	-∞ dB	-∞ dB
Sine waves of different frequencies	+6 dB	+3 dB
Unrelated signals	+6 dB	+3 dB

These results can be extended to arbitrary ratios of amplitudes, and calculators exist to help you do just that, be it for the RMS total case or the direct (peak) case.

I’ll leave you with one last rule of thumb: decibels being a logarithmic unit, the contribution of a given term to the sum falls quickly as the decibels decrease. For example, the direct sum of 0 dB and -20 dB is only 0.8 dB. This can be used to dismiss the contribution of the smaller terms as negligible. [note] It’s also why low-level noise, which normally sits at -80 dB or less relative to the signal, has a negligible effect on the overall amplitude. [end note]

Reasoning about phase

etienne@edechamps.fr (Etienne Dechamps) — Sun, 21 Jan 2018 23:18:15 +0000

So far, when discussing the various frequency components of a signal — i.e. the sine waves that together make up the signal — I’ve mostly focused on the amplitude of these sine waves. In order to keep things simple and short, I’ve been purposefully evading the subject of phase, an often-overlooked property of waveforms. Now it’s time to face the music (pun intended) and fill this gap, as this concept will be useful in future posts.

The above plot shows four different sine waves. Contrary to what I’ve discussed previously on this blog, these waves don’t differ in frequency nor amplitude. Instead, their cycles are offset from each other in time. In other words, they have different phase.

Phase indicates what part of the wave cycle is occurring at a given point in time. When not specified, that moment is conventionally defined as the origin of time (i.e. t=0). For example, in the above plot, the phase of the solid blue sine wave is zero, because its cycle starts at the origin. In contrast, at the same instant, the other sine waves are at a quarter, a half, and three quarters of their cycles, respectively.

When a sine wave reaches the end of its cycle, it’s right back when it started and a new cycle begins again. In that way, progressing through the cycles of a sine wave is a bit like running around in circles. Hence it is no surprise that the terminology used when reasoning about phase is that of turns and angles. It seems natural in that context to define a full cycle as 360 degrees — or, alternatively, 2π radians — and from there, the other sine waves shown above could be defined as fractions of a full cycle: 90° (½π rad), 180° (π rad) and 270° (³⁄₂π rad). [note] From a mathematical perspective, the analogy goes much farther than this. These concepts are all variations on the same common theme. [end note]

Phase in the frequency domain

We’ve seen previously that the Fourier transform can be used to decompose any signal into a number of constituent sine waves — one per discrete frequency. In addition to the amplitude of each sine wave, the output of the Fourier transform (i.e. the spectrum) also contains their phase. For example, here is the phase information that the Fourier transform produces for each of the four above signals: [note] For the sake of consistency and to avoid confusion, I cheated a bit here — the real components of the Fourier transform are cosines, not sines, so strictly speaking the output should be offset by 90°. This is purely arbitrary and doesn’t matter one bit, though. [end note]

We’ve also seen previously that a linear system can alter the amplitude of the frequency components that flow through it. In the same way, a linear system can also alter their phase (which is not the same thing as delaying them — see below), and the extent of these alterations can be shown on a plot, called the phase response. Here is the phase response of a system that is similar to the example from that previous post:

Phase and delay

The sine waves I’m using as examples have a frequency of 1 kHz, which means that a cycle completes in 1 millisecond. From that perspective, it is tempting to think about phase as a time offset; for example, one might say that a 1 kHz wave with a phase of 90° is offset by one quarter of a millisecond relative to the reference. This quantity is known as the phase delay.

This is where things get tricky and misleading, though. One subtlety that is often overlooked when dealing with such concepts is that, mathematically speaking, phase is a property of a periodic signal, which implies a continuous wave of infinite duration with no beginning and no end. Real-world signals of course do not meet these criteria. For most intents and purposes this does not really matter, but in some cases it does, and this is one of those. Specifically, it makes it very easy to misinterpret the meaning of phase delay.

Imagine that, in the above plot, the sine waves went on forever on both sides of the figure. What does “delay” even mean in that context? I could say that the 90° wave is delayed by 0.25 ms relative to the 0° wave, but I could just as well flip things around and say that the 0° wave is delayed by 0.75 ms relative to the 90° wave. Since both signals extend infinitely to the left, it makes no sense to imply that one started before the other. I could even go one step further and say that the 90° wave is delayed by, say, 10.25 ms (10 full cycles, plus a quarter cycle) and it would still mean the same thing. For this reason, the word “delay” needs to be handled very carefully in this context. [note] For an illustration of what can happen when people get confused about these concepts, see this epic trainwreck of a debate where 46 participants spend 456 posts fighting to the death over the deep philosophical meaning of the word “delay”. The math is easy; it’s interpreting the results that’s hard. [end note] Following the same logic, in terms of the phase itself, 90° is equivalent to -270°, 450°, -630°, etc. [note] In fact, when using complex numbers to do signal analysis, these are not just equivalent: they are the exact same number, landing in the same spot on the complex plane. [end note] [note] You might come across phase response plots where the range of values exceed 360°. This is called unwrapped phase and is meant mostly as a visual aid, making the graph more readable by avoiding sudden jumps at the boundaries of the range, and making some calculations easier. The underlying data is the same. [end note]

Then again, one might still be tempted to argue that this is a mostly theoretical distinction: after all, any real-world device that changes the phase of the signal that passes through it has to apply some kind of delay, right?

Well, not necessarily. As a counter-example, consider the case of a very basic device that reverts the polarity of the signal in the analog realm. In other words, it changes the sign of the voltage, which could be as simple as swapping two wires. The opposite of a sine wave is that same sine wave shifted by 180°, so in effect, this device shifts the phase of every frequency component of the input signal by 180°. One can even say that it has a phase delay of 0.5 ms at 1 kHz (and 1 ms at 500 Hz, and 0.25 ms at 2 kHz, and so on). Yet it would be impossible to say that this device delays the signal for any reasonable definition of “delay” (i.e. in terms of information or energy propagation). It physically can’t, since there is no memory (buffer) for it to hold the signal in, and energy is coming out as soon as it gets in. This device exhibits phase shift and thus phase delay, yet there is no actual delay.

For these reasons, in general, it is not possible to know the true delay of a device by making a single phase shift measurement at a single frequency. [note] A pure delay produces a phase shift at every frequency equal to frequency times delay — which, incidentally, is very effective at making a mess in phase response plots. But delay cannot be directly recovered from the phase shift at a single frequency, because, as explained above, there is loss of information — the phase shift is constrained to a 360° range, making its interpretation ambiguous. [end note] It is, however, often possible to get more information about delay if a number of phase shift measurements are taken at various frequencies, such as by looking at a phase response plot like the one from the previous section. More specifically, this can be used to compute the group delay of the device, albeit with a number of caveats related to measurement accuracy. Some devices, especially those that exhibit amplitude frequency response distortion like the above example, have a group delay that itself varies with frequency, which makes interpretation even trickier. [note] If you’re not convinced how tricky this is, consider that the example I’ve used above (which is quite mundane, really) has negative group delay in the low frequencies. Let that sink in for a moment. That doesn’t seem physically possible, but it is. [end note]

Noise and distortion

etienne@edechamps.fr (Etienne Dechamps) — Tue, 21 Nov 2017 21:05:00 +0100

As the audio signal makes its way through the different realms of the system, it travels through various digital, analog, and acoustic components that alter the signal in various ways. Some of these alterations might or might not be audible, or might only be audible under certain conditions. In most scenarios relevant to this blog they are undesirable side effects from limitations in the components that make up the system, though in some specific cases they can be deliberately introduced in pursuit of a specific goal (e.g. equalization).

In order to build a high quality audio system, it is necessary to keep signal degradation (i.e. unwanted alteration) under control, and this requires a good understanding of what these alterations might be, what causes them, and how to avoid or alleviate them. Ever since the advent of sound reproduction more than a century ago this topic has been the subject of great debate, some thoughtful and innovative, some misguided or downright counter-productive. Hopefully this blog will do more of the former and less of the latter — but for now, this post serves as a brief introduction to the issues at hand.

Signal alterations can be divided into three broad categories: noise, frequency response distortion, and non-linear distortion. [note] When used by itself without qualification, the term “distortion” can refer to some or all of these categories, depending on context. [end note] Real-world systems exhibit all three kinds in varying amounts. What follows is a brief overview of the issues at hand; future posts will look at each of them more closely.

Noise

Noise describes an alteration of the signal in which a separate, unrelated signal is added (i.e. mixed in, superposed) to the original signal. [note] This is the definition I’ll use throughout this blog, pursuant to IEC 60268–2. In other contexts noise might be used in a more specific way (e.g. broadband noise only), or in a more general way (e.g. signal differences introduced by non-linear distortion are considered to be part of the noise signal). [end note] That additional signal has its own characteristics including amplitude and frequency content (spectrum), which are combined with the characteristics of the original signal. Noise is often quantified by comparing the amplitude of the noise to the amplitude of the signal, a metric known as the signal-to-noise ratio (the higher the better).

Noise can appear in all three of the audio realms. In the digital realm it can take the form of dithering noise for example, though modern digital systems provide good enough performance that noise sits comfortably below the threshold of audibility. This is not always the case in the analog realm, where noise problems are the most common, the most objectionable, and the most pernicious — often the result of complex electromagnetic interference phenomena, subtle hardware defects, or compatibility issues. Finally, the acoustic realm is rife with often overlooked sources of noise from ordinary life, from the rumbling of an air conditioning unit to the occasional car driving down the street.

Depending on amplitude and frequency content, noise might or might not constitute a problem in practice. For example, low-level broadband colored noise (such as white noise) will often go unnoticed because its spectrum is roughly similar to typical ambient noise that we are all continuously subjected to in our daily lives. [ref] See Albert Donald G., Decato Stephen N., “Acoustic and seismic ambient noise measurements in urban and rural areas”, Applied Acoustics, 119, 135–143, (2017) for examples of ambient noise spectra. [end ref] The same cannot be said of narrowband noise concentrated in specific frequencies. [note] Unfortunately that distinction is lost when noise measurements are condensed into a single number (such as signal-to-noise ratio), discarding spectral distribution in the process. [end note] Furthermore, narrowband noise is more likely to affect the perception of minute detail in the original signal, due to an auditory phenomenon known as masking.

Waveform and spectrum of a sine wave affected by white noise. The noise spectrum, circled in red, is often called the “noise floor”. [note] From that spectrum plot you might be tempted to conclude that the signal-to-noise ratio is about 50 dB. That would be wrong — it’s actually much worse, around 17 dB. You can’t read it directly from the graph because the noise is spread across multiple frequencies. This is a very common mistake when reading spectrum plots, which I might describe in more detail in a future post. [end note]

In a real-world scenario, noise is only really noticeable when the original signal is relatively quiet, such as when there is a break in a piece of music, because all that remains is the noise itself. Conversely, noise is inaudible when a significantly loud signal is playing, again because of masking (but this time in reverse). In practice, I tend to abide by the following rule of thumb: if you can’t hear anything when playing a silent signal, then your system is probably fine as far as noise is concerned.

Frequency response distortion

In the first post on this site, I explained that an audio signal can be decomposed into a number of constituent signals of various frequencies. One way an audio component can alter the signal is by changing the amplitude (i.e. applying gain) on some of these frequencies more than others. This relationship between frequency and gain is known as the frequency response.

If this relationship is constant, i.e. the same amplitude multiplier is always applied to a given frequency regardless of the shape of the signal, then we are dealing with a linear time-invariant system. We can plot the frequency response on a graph, known as a frequency response graph (or, more technically, a Bode plot). [note] One thing that I’ve omitted to keep things simple is that a linear time-invariant system is not just allowed to change the amplitude of individual frequency components, it can also change their phase. This is conveyed through the phase response. Technically the term “frequency response” encompasses both magnitude response and phase response, though the latter is often dismissed for lack of relevance in most audio discussions. More on the phase response in the next post. [end note]

The frequency response of a system that amplifies frequencies around 1 kHz by about 6 dB. This particular type of distortion is called a resonance.

How is this useful? Well, remember that the frequency domain is often more useful than the time domain when it comes to understanding how audio signals are perceived. Frequency response tells us how a system alters the frequency components of the signal that flows through it. That makes it one the most powerful tools in the audio engineer’s toolbox.

Studies show that frequency response is extremely important when it comes to assessing audio quality in real-world scenarios. For example it has been shown that the human auditory system is capable of detecting frequency response variations as tiny as 0.1 dB [ref] Toole, F. and Olive. S., “The Modification of Timbre by Resonances: Perception and Measurement”, J. Audio Eng. Soc., 36(3), 122–141, (1988). From Fig. 9: coherent sum of 0 dB and -40 dB sources is ~0.1 dB. [end ref] , and that it is by far the most important criterion when it comes to assessing the quality of a loudspeaker [ref] Olive, Sean, “A Multiple Regression Model For Predicting Loudspeaker Preference Using Objective Measurements: Part 1 — Listening Test Results”, presented at the 116th AES Convention, Berlin, Germany, preprint 6113, (May 2004). [end ref] . Make no mistake: this metric is a huge deal, and I expect most posts on this blog will make use of it in one way or another.

The digital and analog realms typically contribute very little in the way of frequency response distortion. Virtually all of it is found in the acoustic realm, as even the best loudspeakers and headphones struggle to keep their frequency response within a few dB of the intended target [ref] Many examples can be found in the SoundStage database (loudspeakers) and the InnerFidelity database (headphones). [end ref] . It gets worse: the acoustics of home listening rooms (room modes especially) can result in low frequency alterations in the order of 10 dB or more! [ref] Literally any frequency response measurement made in a small room will show this phenomenon. One example can be found in Leduc Michel, 2009, “How Does Listening Room Acoustics Affect Sound Quality?“, Audioholics (graph under “Standing waves” section). A series of representative examples can be found in Toole, Floyd E., Sound Reproduction: Loudspeakers and Rooms, figure 13.9. [end ref] No wonder why these are often said to be the “weakest links” of the audio chain…

Non-linear distortion

Ideally, audio components should meet the definition of a linear system as described above; that is, they should be able to accurately track the input signal, such that the output is precisely proportional to the input. Of course that is not the case in practice. Besides noise (which we’ve already covered), consider, for example, that the driver inside a loudspeaker is made from imperfect materials that have imperfect physical properties, so its movement will not perfectly track the signal, instead giving rise to non-linear distortion. One example in the analog realm is crossover distortion that can occur in certain types of amplifiers.

Fortunately, any reasonable audio system will be at least approximately linear, so we can still use the frequency response to reason about the system. At the same time, we need to keep ourselves honest and account for any remaining non-linear behavior that might alter the signal in ways that frequency response and noise measurements will not predict. This is appropriately called non-linear distortion, also known as amplitude distortion.

This definition makes non-linear distortion a very open-ended category, as indeed a signal can be distorted in an infinite variety of ways. Since it is not possible to run an infinite number of tests, measuring and quantifying the non-linear behavior of a system can be quite challenging. That said, most non-linear distortion comes in well-known shapes and forms, so a few standard measurements are usually good enough to characterize the performance of an audio system.

One important aspect of non-linear distortion is that, contrary to frequency response distortion, it can cause new frequencies to appear that weren’t present in the original signal. By far the most well-known type of non-linear distortion is harmonic distortion, where new frequencies appear that are whole multiples (harmonics) of the frequencies in the original signal. It is often summarized in a single number, total harmonic distortion (THD), which indicates the total amplitude of the introduced harmonics relative to the original signal (more precisely, the fundamental). A related phenomenon is intermodulation distortion, where multiple frequency components in the signal interact with each other to create new frequencies in a specific pattern.

At the beginning of this section I described examples of non-linear distortion that can occur during normal system operation. However, the most egregious, obvious, and audible non-linear distortion issues occur when the system is driven beyond its limits; that is, signal amplitude is pushed too far and the system is unable to keep up. When this happens the peaks of the waveform cannot be reproduced faithfully and the system “bottoms out”, a phenomenon known as clipping. To mention a few examples, this can happen in the digital realm (which is otherwise immune from most other forms of non-linear distortion) due to overflowing the largest possible sample value; in the analog realm due to exceeding the power limits of an amplifier; or in the acoustic realm due to exceeding the movement limits of a driver (excursion).

Waveform and spectrum of a 1 kHz sine wave showing symptoms of “hard clipping”, which produces odd-order harmonics (circled in red). The THD in this example is about 7%. [note] Here’s how this number is calculated. First, take the RMS amplitude of the harmonics: 3 kHz -26 dB (0.051) 5 kHz -32 dB (0.025) 7 kHz -46 dB (0.0047) 9kHz -47 dB (0.0043)… the total RMS amplitude of the harmonics is 0.057. The RMS amplitude of the fundamental is 0.78. The THD is the ratio of these two numbers, which is 0.073, or 7.3%. [end note]

So how audible is non-linear distortion? Well, that depends. Because non-linear distortion can take many forms, there is no simple answer. For example, distortion in the lower frequencies is less audible than in higher frequencies, and it is less audible if the newly introduced frequencies are close to the fundamental (thanks, once again, to masking). [ref] Temme, Steve, “Application Note: Audio Distortion Measurements“, Brüel & Kjær, 1992. [end ref] Condensing non-linear distortion measurements into a single, simplistic THD number, as is often done, certainly doesn’t help. With that in mind, real-world examples tend to suggest that a THD of 10% is likely to be audible, 1% is borderline, and 0.1% is unlikely to be audible. [ref] Audioholics, ”Human Hearing - Distortion Audibility Part 3”, 2005. [end ref]

Amplitude, quantified

etienne@edechamps.fr (Etienne Dechamps) — Tue, 21 Nov 2017 21:04:00 +0100

In the last series of posts I’ve been focusing on the concept of amplitude, first defining it as the strength of an audio signal, then describing its physical manifestations as the signal makes its way through the audio playback chain, and lastly explaining how decibels are used as a way to manipulate these numbers.

The attentive reader might have observed that, back in that very first post, I deliberately left out an important part for later. The issue that still needs to be addressed is exactly how to quantify amplitude, i.e. how do we calculate its value for a given audio signal. Let’s go back to our original example of a pure tone (sine) waveform, but with a vertical scale this time:

Now let me ask you a question: what is the amplitude of that signal?

Peak amplitude

You might be tempted to answer that question with “1.0” [note] Well, unless you are an audio engineer and you know what’s up. But in that case, what are you doing here? [end note] , because that’s the as far as the curve travels from its middle point. Or perhaps you might answer “2.0”, because that’s the total height of the waveform (-1.0 to 1.0).

These are not the only possible answers (as we’ll see below), but they are valid answers nonetheless. The former answer (“1.0”) is called the peak amplitude of the signal. The latter answer (“2.0”) is called the peak-to-peak amplitude of the signal.

Peak-to-peak amplitude is twice the peak amplitude. [note] Strictly speaking that’s a bit of an oversimplification, because it assumes that audio signals are symmetrical, but they often aren’t — see Hetrich, Wayne L., “Real-World Audio Wave Form Asymmetries and the Effect on the Audio Chain”, presented at the 55th AES Convention, New York, NY, USA, preprint 1193, (October 1976). However this rarely matters in practice. [end note] In practice peak amplitude is more widely used than peak-to-peak amplitude. [note] One notable exception is marketing material (including manufacturer-provided specifications), where peak-to-peak amplitude is often used because the number looks bigger — don’t be fooled! [end note]

The problem with peak amplitude

You might be wondering why we can’t simply stop there. After all, we just came up with a quantitative definition of amplitude, so our job here is done, right?

Not so fast. Let’s not forget that amplitude is used in a variety of contexts for a variety of calculations and comparisons. We need to make sure that the definition we came up with (peak amplitude) is appropriate for what we’re going to use it for.

Peak amplitude is appropriate in some contexts. For example, if you’re trying to determine whether a digital signal is going to clip, peak amplitude is definitely the metric you should use to make that determination. But in most cases, what we’re most interested in is the amount of energy that is being conveyed in that audio signal on average. [note] “Energy” is used here in a generic sense, as physically speaking there is no “energy” in a digital signal for example. However it does map directly to the physical definition of energy when the signal enters the analog or acoustic realms, and since the acoustic realm is really all that matters in the end, it makes sense to use that term to describe audio signals in general. [end note] Peak amplitude fares poorly in that scenario. To understand why, let’s look at an extreme example of a signal that is very different from a sine wave:

That signal has the same peak amplitude as the previous example. Yet, it’s easy to see that it conveys less energy: it’s mostly silence only interrupted by a train of narrow peaks. That makes peak amplitude ill-suited for estimating the overall strength of the signal.

RMS to the rescue

To solve this problem, we need a different metric. Ideally, we want to compute some kind of average of the signal. We can’t use the mean — that would just amount to zero, since the positive and negative parts of the signal would cancel each other out.

As it turns out, there is a standard way to compute the average value of an audio signal (or any alternative signal for that matter): the root mean square (RMS). It’s a simple formula: we square the signal values, sum the squares, divide the result by the number of values, and then finally we take the square root of that number. [note] For the sake of example, I’m assuming a discrete-time signal here. [end note] Because the values are squared, the positive and negative parts of the signal add up instead of canceling each other. [note] This discussion assumes that the signal is centered on zero — i.e. that there is no DC offset. In audio this is almost always a safe assumption to make. Incidentally, RMS when used in this context is the same thing as standard deviation. [end note]

If we apply that formula to the first example, we end up with ~0.707. More generally, for a pure sine wave (and only for a pure sine wave!), the math tells us RMS amplitude is equal to peak amplitude divided by the square root of two (√2). Or, when working in decibels, that’s peak amplitude minus ~3 dB.

When applied to the second example, we end up with ~0.424. As expected, we get a lower value as the signal conveys less energy. In other words, the ratio of peak amplitude to RMS amplitude — known as the crest factor [note] Sometimes informally — and somewhat incorrectly — referred to as dynamic range in some contexts. [end note] — is different because the shape of the waveform is different.

Closing thoughts

We’ve seen that there is more than one way to quantify the amplitude of a signal, and that different approaches will produce different results depending on the shape of the waveform. Depending on the context in which the numbers are used, some approaches might be more appropriate than others.

In practice, the method used to calculate the amplitude is often stated near the unit. For example, “Vrms”, “Vp”, “Vpp”. Otherwise, it is often assumed that RMS was used. In particular, levels expressed in decibels (e.g. “dBV”), are virtually always RMS [ref] IEC 60027–3:2002, Letter symbols to be used in electrical technology — Part 3: Logarithmic and related quantities, and their units, §4.1 [end ref] (with the possible exception of dBFS, which sadly is ambiguous).

What about loudness? One could see loudness as the way us humans measure the amplitude of the sounds that reach our ears. Because the human auditory system is extremely complex, it is not easy to estimate how loud a given signal will be perceived in general. Of the two approaches that I’ve described, RMS is the one that approximates loudness best, but it is still a very crude estimation. Nonetheless, in practice, RMS amplitude is often used as a poor man’s proxy for loudness due to its simplicity. More perceptually accurate ways to estimate loudness would typically involve weighting or even more advanced processes such as those described in ITU-R BS.1770. But that’s a story for another post.

Decibels, explained

etienne@edechamps.fr (Etienne Dechamps) — Tue, 21 Nov 2017 21:03:00 +0100

In the previous post, I introduced a number of physical quantities that are used to describe the amplitude of an audio signal:

In the digital domain, it is a sample value;
In the analog domain, it is voltage in volts (V);
In the acoustic domain, it is a pressure difference in pascals (Pa).

However, in the audio literature, marketing materials and equipment specifications, these are not the units that are typically used. Instead, one often finds such quantities expressed in decibels (dB) or a related unit. There are good reasons for this, and they have to do with how we humans perceive loudness.

On the usefulness of ratios

In audio, we care more about ratios of quantities (e.g. “×2”) than absolute values (e.g. “3 V”). To understand why, here’s an example:

The difference in loudness between 0.1 Pa and 0.2 Pa is very noticeable.
The difference in loudness between 1.0 Pa and 1.1 Pa is barely noticeable.

In both cases the absolute difference is the same: 0.1 Pa. However, the ratio is very different; in the former case it’s ×2, in the latter case it’s ×1.1. This makes a compelling case for using ratios, not differences, when comparing amplitudes.

There is another advantage to using ratios. Remember that all three quantities (sample value, voltage, and sound pressure) are proportional to each other when the audio signal moves from one realm to the next. [note] Assuming ideal conditions, i.e. no noise or distortion. [end note] This means that a given ratio applies in all three domains: twice the sample value is also twice the voltage and twice the sound pressure. This is very convenient because it means that a given ratio (often called gain) has the same meaning regardless of context.

Now, if only there was a unit that made working with ratios easier…

Enter the decibel

The decibel (dB) is, in its purest form, a dimensionless unit that represents a ratio between two quantities, just like “×” or “%”. What sets the decibel apart is that it is also a logarithmic unit, meaning that its value is proportional to the logarithm of the ratio, not the ratio itself. This will be clearer with examples:

Amplitude ratio	Decibel equivalent
×0.01	-40 dB
×0.1	-20 dB
×0.5	-6 dB
×1	0 dB
×2	+6 dB
×4	+12 dB
×8	+18 dB
×10	+20 dB
×100	+40 dB

Because the decibel is a logarithmic unit, it behaves differently from more conventional linear units. The trick is to not fight their logarithmic nature, but embrace it. Here’s how:

The most useful ratios to keep in mind are +6 dB (×2) and +20 dB (×10).
To invert the ratio, just change the sign: for example, -6 dB is the same as dividing by 2.
When combining ratios, decibels don’t multiply; they add up. For example, ×4 (2×2) is 12 dB (6+6), not 36 (6×6). This might seem strange, but this property makes decibels very convenient to use in practice — it’s easier to add than to multiply.
0 dB means ×1, i.e. no change in amplitude.
×0 is -∞ dB (negative infinity). You might have seen this as the “mute” position on some volume knobs.
For less trivial cases, online calculators are available.

Caution: All of the above assumes decibels are used to express ratios of field quantities. Digital sample value, voltage, and sound pressure are examples of field quantities. When dealing with power quantities (i.e. watts), however, there is a catch: in that case, +6 dB is ×4, not ×2, which is +3 dB. This might seem strange and confusing, but once again there is an explanation: in practice power is proportional to the square of the field quantity. So, for example, when voltage is doubled, power quadruples. Or, said differently, if voltage is increased by 6 dB, then power increases by… 6 dB — which is why the rules makes sense.

Using decibels for absolute values

In theory, one could stick with linear units when dealing with absolute values (e.g. 2 V), and use decibels when dealing with ratios (e.g. ×2). However, when a calculation involves both, the mental gymnastics can be challenging. [note] Not convinced? Try calculating 2 V × -11 dB by hand. [end note]

It would make more sense to use decibels for everything, including absolute quantities. Fortunately, that’s easy: we just need to agree on a reference, and then express in decibels the ratio between that reference and the quantity we wish to convey. The resulting decibel value is called level. For example, if the reference is 1 V, then the level of 2 V is +6 dB. [note] And the calculation from the previous note becomes 6 dB - 11 dB = -5 dB. Much easier! [end note]

In practice, the reference value is indicated by a suffix affixed to the unit. Here are some references commonly used in the three audio realms:

Quantity	Reference	Equivalent level
Sound pressure	20 µPa	0 dB SPL
Voltage	1 V	0 dBV
Voltage	~0.77 V	0 dBu [note] Yes, there are two different references in common use for voltages. That’s sad, but that’s the way it is. To convert from one to the other, add or subtract ~2.2 dB. [end note]
Sample value	Full scale	0 dBFS [note] In other words, 0 dBFS is the maximum level before digital clipping (truncation) occurs. Thus valid amplitudes have negative dBFS values. Or at least that’s how it’s supposed to work. The definition of dBFS can be quite fuzzy and ambiguous, as explained in that Wikipedia page. Caveat emptor. [end note]

Sometimes the suffix is omitted and has to be inferred from the context. For example, it is fairly common to find “dB SPL” written as simply “dB”, especially in mainstream publications. This is best avoided as it can lead to confusion between ratios and levels. Conversely, “dBr” (“relative”) is sometimes used to make it explicit that decibels are used to express a ratio, not a level.

Additional suffixes and variants are often used to convey additional information. The most common examples include “peak” or “RMS”, which denote different ways of looking at the amplitude of the signal, and “dBA”, which indicates the use of A-weighting. More on these in later posts.

From bytes to your eardrum

etienne@edechamps.fr (Etienne Dechamps) — Tue, 21 Nov 2017 21:02:00 +0100

In the previous post, I described how an audio signal is represented. Now, let’s discuss the various physical forms that audio signals take as they travel through each stage of an audio playback system.

For the sake of this discussion, I am going to assume the most common and straightforward use case: playing a digital stream over loudspeakers (or headphones). By “digital stream” I mean any audio signal that is processed by a computer or computer-like system; that can be anything including a MP3 file, online video, online music streaming, the soundtrack of a Blu-ray disc, etc. This does not include analogue media such as vinyl discs or cassette tapes.

Before this digital content can reach your eardrums, it has to go through a series of steps. Between these steps, the audio signal is materialized in different ways depending on which part of the audio “pipeline” we are looking at. In this post I refer to these concrete representations as realms [note] “realm” is not a widely used term — the term “domain” is normally used. However, I felt that this could create confusion with time domain and frequency domain, which are completely unrelated concepts. [end note] . I am going to start at the source and then make my way through to the listener.

To keep things clear and simple, the example signal I’ll use throughout this post is a monophonic (one channel) 1 kHz sine wave. For all intents of purposes, each additional channel can be assumed to act like a completely separate audio signal that takes a similar path through the system.

The digital realm

It all starts within the digital device, which can be any computer or computer-like gadget (PC, smartphone, Bluetooth receiver, etc.). Most devices read or receive audio data in digitally compressed form. Popular digital compression algorithms include MP3, AAC and FLAC.

Digital compression [note] Not to be confused with dynamic range compression, which is completely unrelated. [end note] is a complicated subject, which I won’t dig into further in this post. In any case, the data first goes through a decoder which converts the compressed signal into uncompressed form, which looks like this:

This plot shows that, in the digital realm, audio is not represented by a continuous, smoothly changing signal — instead, all we have are regularly-spaced “snapshots” that indicate what the signal amplitude is at some point in time. This is called a discrete signal, and the “snapshots” are called samples. In this example, we have 44100 samples every second, or more formally, the sample rate is 44.1 kHz. Such a sample rate is standard for music — other types of content, such as movies or games, use a slightly higher rate, 48 kHz, for mostly historical reasons.

Because memory is not infinite, each sample value has a finite precision. In practice, each sample is typically converted to a signed integer with a precision, or bit depth, of 16 bits. This process is called quantization. A 16-bit signed integer can take a value from -32768 to +32767. Samples outside of this range cannot be represented, and will be clamped to the nearest possible value; this is called digital clipping, and is best avoided as it sounds quite bad. A signal that peaks at the highest possible amplitude without clipping is called a full-range or full-scale signal [ref] IEC 61606–1:2009, Digital audio parts — Basic measurement methods of audio characteristics — General, §3.1.10 [end ref] .

Finally, the signal is physically represented simply by transmitting the value of each point, or sample, one after the other. For example, the above signal is transmitted as 4653, 9211, 13583, etc. in the form of binary numbers. This way of transmitting the signal is called pulse-code modulation.

This section just skirted the surface of how digital audio works. The details of how sampled signals behave in practice are often counter-intuitive; as a result, misrepresentation of digital audio phenomena is quite commonplace in the audiophile community, leading to confusion and misguided advice. Digital audio expert Chris Montgomery produced a series of videos that explains these complex phenomena with very clear and straightforward examples — it is a highly recommended resource if you wish to learn more about the digital realm.

The analog realm

Loudspeakers and headphones cannot receive a digital signal; it has to be converted to an analog signal first. This conversion is done in an electronic circuit appropriately named the digital-to-analog converter, or DAC. This is where computer engineering ends and electrical engineering begins. The main task of the DAC is to take each sample value and convert it to some electrical voltage on its output pins. The resulting signal looks like the following:

Caution: In the plot above, the unit used for the vertical scale is the volt. In other words, the amplitude of the audio signal in the analog domain is defined by its voltage. It is not defined by current nor power. Even when the signal is used as the input of a loudspeaker, it is still voltage that determines the sound that comes out; power dissipation is a consequence, not a cause, of the audio signal flowing through the loudspeaker. As Pat Brown elegantly puts it: “power is drawn, not applied”. [note] Another way to state this is to say that properly engineered analog audio devices act as voltage sources (or sinks), which are connected to each other by way of impedance bridging. [end note]

The DAC took our discrete signal and converted it into a continuous electrical signal, whose voltage is (hopefully) proportional to the digital sample value. The central (mean) value of the signal, called the DC offset, is zero volts; the signal swings around that central value, alternating between positive and negative voltage. In this example, our full-scale digital stream was converted to an analog signal that swings between -1.41 V and +1.41 V. Depending on the specific model of DAC used, its volume control setting (if any) and the signals involved, these numbers can vary — typical peak amplitude can go as low as 0.5 V [ref] Wikipedia, Nominal levels, peak amplitude for consumer audio [end ref] or as high as 2.8 V [ref] IEC 61938:2013, Guide to the recommended characteristics of analogue interfaces to achieve interoperability, §8.2.1 [end ref] .

The amount of current or power transferred from the source of an analog signal (e.g. a DAC) to the equipment plugged in at the other end of the cable (the load, e.g. a loudspeaker) is determined by the impedance of the load, also known as the input impedance. According to Ohm’s law, the lower the impedance, the more current, and therefore power, will be required to sustain a given voltage.

DACs, as well as most other types of analog audio equipment (such as filters or mixers), are not designed to provide significant amounts of power. Instead, they are meant to be connected to a high-impedance load, normally 20 kΩ or higher [ref] IEC 61938:2013, Guide to the recommended characteristics of analogue interfaces to achieve interoperability, §8.2.1 [end ref] . This means that the load is acting much like a voltmeter or oscilloscope — it is “peeking” at the input voltage without drawing significant power from it. Such a signal that carries some voltage but very little power is called a line-level signal.

On the other hand, loudspeakers (and headphones to a lesser extent) are low-impedance devices — often between 4 Ω and 8 Ω in the case of speakers. This is because they operate under a relatively low voltage, but require a lot of power. For example, most speakers will happily produce comfortably loud sound with as little as 6 V, but might consume as much as 9 watts while doing so [note] From the numbers given a keen eye will deduce that this example speaker has an impedance of 4 Ω. One thing to note is that loudspeaker impedance is highly dependent on the frequency of the signal, making the use of one number an oversimplification. The impedance that manufacturers advertise, called the rated impedance, is 1.25 times the minimum impedance of the speaker across its rated frequency range. (see IEC 60268–5:2003, Loudspeakers, §16.1) [end note] . Line-level equipment is not designed to provide such a large amount of power.

This problem is solved by using a power amplifier. This is a component that conveniently provides a high-impedance input for connecting line-level equipment, while exposing an output that is capable of providing large amounts of power, such as 10W or more, to a low-impedance load. [note] In practice, most amplifiers are also capable of increasing the voltage (amplitude) of the signal; this is called the gain of the amplifier. This is because most loudspeakers require voltages that are somewhat higher than line level in order to play loud enough. Still, the primary goal of a power amplifier is to provide power, not to increase voltage. [end note] Such outputs provide so-called speaker-level signals.

In some home audio systems, the DAC and the amplifier are integrated into one single device, which is called an integrated amplifier or more commonly an AV receiver (AVR).

The acoustic realm

Finally, in order to reach your ears, the analog signal must be converted to an acoustic signal, that is, actual sound waves. This is accomplished using a device called an electroacoustic transducer, or driver. The output of a driver when excited with our example signal, as measured at some point in front of it, might look like the following:

Note the change of vertical scale. We’re not dealing with voltage anymore — amplitude takes the form of sound pressure instead. Indeed, sound is a physical phenomenon in which transient changes in pressure (compression, rarefaction) produced by the vibration of a sound source propagate through the space around it. In other words, it is a longitudinal wave. Sound pressure, expressed in Pascals (Pa), quantifies the difference between normal atmospheric pressure and some local, dynamic change in pressure, at a given point in time and space. The human ear is equipped to detect these changes, which are then — finally! — perceived as sound by the human brain.

An ideal transducer will produce sound pressure proportional to the voltage applied to it, like in the above waveform. However, it is difficult to design a driver that is capable of doing that across the entire range of audible frequencies. Consequently, a number of transducer types are available, which are commonly referred to as subwoofers, woofers, midranges and tweeters.

In order to reproduce the entire range of human hearing, several of these drivers — often two or three — are assembled inside a single “box”, called the enclosure. In most designs the drivers are mounted flush with one side of the box, which is called the front baffle. An electrical circuit called a crossover splits the input signal into the frequency ranges appropriate for each driver. The resulting device is called a loudspeaker.

What I’ve described here is called a passive loudspeaker, which is the most common type in consumer “Hi-Fi” systems. Sometimes the amplifier and speaker are integrated into the same device; this is called an active or powered speaker. Examples include professional “studio monitor” speakers, which have line-level inputs. Other products, such as “Bluetooth speakers”, go one step further and throw in a DAC as well for a completely integrated solution.

Headphones are a special case and typically only have one driver per channel, which makes them simpler. Conceptually, a headphone is akin to a miniature loudspeaker. Because of their proximity to the ear, they don’t have to produce as much sound pressure; therefore they require much less power to operate (often less than 1 mW).

One notable aspect of the acoustic realm is that sound propagates in all three dimensions — the audio signal (sound pressure) is not the same at every point in space. In particular, speakers exhibit radiation patterns that vary with angle and frequency, and the sound they emit can bounce off surfaces (reflection). This in turn means that they interact with their environment (the listening room, or, in the case of headphones, the listener’s head) in ways that are complex and difficult to predict but nonetheless have an enormous impact on how the radiated sound will be perceived by a human listener. This makes choosing and configuring a speaker system quite the challenge. Hopefully, future posts on this blog will provide some pointers.

Anatomy of an audio signal

etienne@edechamps.fr (Etienne Dechamps) — Tue, 21 Nov 2017 21:01:00 +0100

This inaugural Factual post starts from first principles, by laying down some of the fundamental foundations necessary to start reasoning about audio signals. I will then build on these principles in the posts to follow.

The time domain

An audio signal is an oscillating phenomenon: it is defined by a quantity that alternatively increases and decreases over time while keeping in the vicinity of some central value.

Out of the infinity of shapes that an audio signal can take, probably the simplest is a pure tone, also called a sine wave from the name of the mathematical function that it describes. Here is an example of a sine wave:

The horizontal axis is time, which is why it is often said that this representation shows the signal in the time domain (another term is waveform). The above signal oscillates around the central value represented by the horizontal line. According to the horizontal scale, this particular signal repeats once every millisecond: this is its period, also known as the cycle. Said differently, the signal repeats one thousand times per second: it has a frequency of 1000 hertz. In order to be audible, the frequency of the signal must sit between 20 Hz and 20 kHz: this interval is known as the audible range of the human hearing system.

In the image above, the height of the signal is known as the amplitude (the term magnitude is also used). I deliberately left out the vertical scale and unit because they depend on the context — more on this to follow in the post about audio realms. Another complication is that the “height” of the curve can be defined in multiple different ways — which are explored in a separate post as well.

Amplitude is related to loudness, in the sense that if we take a signal and increase its amplitude (by amplifying its oscillations), the human hearing system will perceive the signal to be louder. Likewise, if we decrease its amplitude (by attenuating), it will be perceived as being quieter.

Caution: This relationship between amplitude and loudness does not necessarily hold when comparing signals that have differing frequencies. This is due to the fact that the human hearing system does not perceive all frequencies as being equally loud [note] The effect can be quantified using equal-loudness contours. [end note] . For example, if you listen to a 30 Hz tone and then to a 2 kHz tone of equal amplitude, the latter will sound much louder than the former.

Of course, most audio content is not a pure tone. In practice, an audio signal for, say, music, might look like this:

As the above image shows, a musical signal is way more complex than a pure tone. And that’s not even a complicated musical piece — this is pianist Joohyun Park, solo, playing a single note [note] Specifically, this is one of the first notes played at the beginning of the Allegro track from The Music of Battlestar Galactica for Solo Piano. [end note] . What’s really problematic, however, is that this representation doesn’t seem to relate to our perception at all — to the naked eye, it doesn’t look like a musical note played on a piano.

The frequency domain

In order to make sense of such complex signals, we need a better way to look at the data. Fortunately, the above signal can be decomposed into a number of pure tones of various frequencies and amplitudes, thanks to the superposition principle. The mathematical tool used to do the decomposition is called the Fourier transform. For example, if we were to apply the Fourier transform to our first pure tone example, the result could be represented as follows:

The vertical axis is still amplitude, but the horizontal axis has changed — it now represents frequency. This representation shows the signal in the frequency domain, or, in other words, it shows the spectral density (often simply called spectrum) of the signal.

A keen eye might have noticed that the horizontal axis is using a logarithmic scale, which is commonplace for this type of plot. This scale provides a better view of how we perceive sound: it is very easy to hear the difference between a 100 Hz tone and a 200 Hz tone, but the same cannot be said about 10000 Hz and 10100 Hz tones, even though the difference is still 100 Hz. This is because in the former case, there is a 100% increase, while in the latter case, the increase is only 1%. In other words, the human auditory system perceives relative change, as opposed to absolute change [note] This is consistent with other human senses, as predicted by the Weber-Fechner law. [end note] . The term octave is used to describe a frequency factor of two; for example, the range 2 kHz to 8 kHz is two octaves wide. The term decade is also sometimes used, and describes a tenfold increase in frequency.

The above plot is showing us that the signal can be decomposed into a single 1 kHz tone, but we already knew that. What’s more interesting is what happens when we apply the Fourier transform to the musical signal:

Here things become interesting. This plot is telling us that our musical example can be decomposed into a 260 Hz tone with high amplitude, combined with 520 Hz and 780 Hz tones of lower amplitude.

Such a result is typical of a recording of an instrument playing a single note. The first tone, at 260 Hz, is called the fundamental and indicates the pitch of the sound, in other words the note being played, C5 in this example. The 520 Hz and 780 Hz tones, because they are multiples of the fundamental, are called harmonics. They are interpreted by the human hearing system to help determine the timbre of the instrument. If the same note was being played on say, a flute or a violin, the frequency of the fundamental would be the same but the relative amplitudes of the harmonics would be different.

This is interesting because we can directly relate what we see on the plot to how the sound will be perceived, i.e. what the signal sounds like. Of course interpreting this data still requires some effort — most people wouldn’t be able to tell “of course that’s a piano playing a C5” just by eyeballing the above image. Furthermore, if I had used a more complex example (such as a symphonic orchestra playing in unison), the spectrum would have been just as unreadable as the waveform. Nevertheless, in practice, the spectrum often provides a much more useful view from a perceptual perspective. This is why audio engineers will often ignore the time domain, instead focusing their efforts on the frequency domain.

Frequency-domain data can be converted back to the time domain using the appropriately-named inverse Fourier transform. One might wonder if any information gets lost during these conversions. From a purely mathematical point of view, the answer is no, but there is a catch. The above plots do not show the full output of the Fourier transform. In reality, the result of the Fourier transform includes amplitude information (which is shown above) and phase information (which I omitted). Amplitude determines the strength of the constituent tones, while phase indicates which part of the waveform cycle occurs at a specific point in time. As long as phase information is not discarded, it is possible to recover the original waveform, intact, simply by applying the inverse transform. That said, aside from a few specific scenarios (such as signal summation), phase is not nearly as prominent as amplitude in practical audio discussions, which is why I won’t dig further into it in this introductory post. I will revisit this topic in a later post.