Roger B. Dannenberg | Audio Latency 2

Summary

My previous blog described the sources of audio delay or latency over digital networks. I have measured actual delay using Jacktrip, one of several systems designed for low-latency audio over networks. The measurements show substantial improvement over platforms such as Zoom, but the latency still seems high in terms of intimate music performance.

About the Author

Roger B. Dannenberg is a Professor of Computer Science at Carnegie Mellon University. He is known for a broad range of research in Computer Music, including the creation of interactive computer accompaniment systems, languages for computer music, music understanding systems, and music composing software. He is a co-creator of Audacity, perhaps the most widely used music editing software.

In my previous blog, I painted a pretty bleak picture of the prospects for music performance over the Internet. In this update, I report on some actual measurements using some software that is working pretty hard to keep latency down. The results are encouraging but not great.

This data was collected with the help of my friend and amazing tuba player Roger Day, a.k.a. “Professor Beautiful,” and using Jacktrip software. There is a lot more to do, so I will either update this or write another blog soon (I hope).

The story and claim, as I left it, was that you should expect 100 to 150 ms latency even if you try hard, but I will try to update that here.

New Information

First, several friends wrote to me after my previous blog, claiming good experience with this or that software. Especially interesting was a challenge to some of my measurements. At least some of these challenges had merit.

Mistake One

My first mistake was assuming a good way to measure “best case” latency within the city (for you New Yorkers, that's Pittsburgh if you please) would be to ping Carnegie Mellon University, practically a neighbor at less than 2 miles, not to mention its two 10Gbps connections to the Internet. When my 30 ms ping time elicited some surprise by others, a little more research (using traceroute) showed that to get across Forbes Avenue and back, my packets travel to both the D.C. area and Cleveland, in both directions! I assume there is just no shorter path betwen my Verizon connection and Carnegie Mellon’s Cogent and XO connections. So it was wrong to assume this is typical or best case within a city.

Mistake Two

I also based my measurements on a connection WiFi. I knew that WiFi was not as good as a wired Ethernet connection, but I probably underestimated the number of lost or delayed packets. It seems that, at least with my equipment, delays in WiFi are frequent enough to require much more buffering and higher latency, whereas I thought it would just cause a few more dropouts.

Mistake Three

I always assumed the audio input/output on my MacBook Pro had very low latency that only depended on internal buffer sizes, which are determined by applications. My friend Belinda Thom did some experiments that indicated otherwise. It is hard to know exactly where all latency originates, but we can compare the difference between the built-in MacBook Pro audio and an external audio interface. With low-latency settings on Logic Pro and direct monitoring, I measured 29 ms end-to-end latency with MacBook Pro audio and 11 ms end-to-end latency using an external USB audio interface (a relatively inexpensive Behringer UMC22). Thus, you can save an additional 18 ms using an external interface!

Aside: Why would Apple add audible delays to their built-in hardware? You would think where Apple has total control over the hardware and interface, it would be very simple to eliminate unnecessary latency. But I’ve noticed newer MacBooks do not have a single clock for microphone and speaker sampling. That must mean that engineers decided it was cheaper to decouple these components, and thus the microphone and speaker do not have exactly the same sample rate. Any pro audio interface must have a single clock for input and output because you cannot simply drop extra samples or duplicate missing samples when one sample stream gets ahead of the other. Apple solves this problem using software to resample one stream to match the other stream when necessary. This is a pretty expensive operation both in terms of computation and latency, but I suppose the hardware guys saved a few pennies per machine and that would more than pay for the sophisticated resampling software and interfaces needed to work around this limitation. But after all that, there is no way to recover the lost 18 ms of latency. Maybe management decided serious users would use external interfaces anyway – they have higher quality, and my laptop does not even have an audio line input jack, so it is obviously not intended for any serious audio work.

New Prediction

My previous estimate was 100 to 150 ms latency, but we should be able to make some corrections. I was able to ping Roger Day’s computer in the neighborhood with about a 13 ms delay, so we’re saving 17 ms already. I do not have a good feeling for the impact of WiFi, but since I am seeing frequent jumps in ping times by 10 to 15 ms, maybe we could expect another 15 ms improvement. Finally, by using an external audio interface, we can shave off another 18 ms. That would bring 100 ms down to 50 ms. OK, enough theory, let’s measure this for real!

Measuring Latency

After getting Jacktrip running on two macOS computers, Roger Day and I set up 4 different configurations, with buffer sizes ranging from 64 frames to 512 frames. (A frame is essentially a sample. You probably recognize the term sample rate from CD audio which is represented by 44,100 samples per second, but that is really 44,100 samples per second per channel. To avoid ambiguity, we say a frame is all the samples in one sample period, whether you have one, two or 64 channels. So the CD audio frame rate is 44,100 frames per second.)

We both used external USB audio interfaces, we both used wired ethernet connections, we have the same ISP, and we live in the same neighborhood, so these are pretty ideal conditions for low-latency audio.

To measure latency, I used a true end-to-end test: I put my microphone and headphones very close, and Roger Day put his headphones right onto his microphone. When I tapped on my microphone, the signal went to Day's headphones, into his microphone (with negligible latency: sound travels an inch in 75 microseconds), then back to my headphones. I stuck a pocket recorder in real close to pick up both the taps and the headphone sounds, again within a couple inches of the microphone and headphones. The setup is very simple, but since the actual measurement device is a digital recorder, the measurements are certainly accurate enough for our purposes. Here’s what it looks like:

Importing the digital recordings into Audacity (my favorite editor 😀), I could easily estimate the round trip delay. Here is what that looks like:

The highlighted region is 44 samples, which is about 1 ms for reference.

Below is a summary of my measurements. This is just one test, and although we have reduced most sources of latency to a practical minimum, we are still at the mercy of the network itself (in this case, Verizon’s), which exhibits considerable delays and variability. Your numbers could be better or worse.

Network Ping Times

First, to characterize the network, here are some statistics on ping times: a round trip time to send a packet to the server (Day’s computer) and receive a reply:

Number of pings: 207
Maximum ping time: 13.2 ms
Minimum ping time: 1.2 ms
Median ping time: 5.4 ms
Mean ping time: 5.8 ms
Standard deviation: 3.0 ms

In some sense, the only number that matters is maximum ping time, because, as explained in my previous blog, the maximum tells you how much audio to buffer to avoid dropouts. So it’s great that a web page refresh would see a mean round-trip delay of under 6 ms, but 13.2 ms is more relevant to network audio.

Here are the actual audio delays we measured under different configurations:

Frames per Buffer	Round-Trip Latency (ms)	Quality
64	34	Unusable
128	50	Poor
256	89	Good
512	129	Good

The “Quality” column is a subjective indication of audio quality. Only the 256 and 512 frames-per-buffer settings seemed usable. Other settings had too many late or dropped packets which make the sound break up and crackle.

Here is a graph of the data:

With these numbers, we can do a little more analysis to see the impact of buffer sizes. First, latency grows linearly with buffer size, but there is an offset of about 25 ms seen as the Y-intercept in the graph. Some of this is simply due to audio conversion: every analog-to-digital and digital-to-analog converter adds some delay, and we are going through 4 conversions per round trip. Depending on converters, this delay is almost certainly less than 1 ms per converter. There could be a few samples buffered in converter hardware, but that’s negligible.

The big concern is jitter on the network. As the buffer size gets smaller and smaller, a working system will still have to have additional buffers to tolerate the 13.2 ms worst-case round-trip network time to avoid dropouts. Note that the worst case may be much worse than 13.2 ms.

More on Network Latency

I observed 19 out of 207 ping times greater than 10 ms, or about 10%. About 1% of times exceeded 13 ms. It’s reasonable to assume the cause was one-way delays with the other direction taking about half the mean ping time, or about 3 ms. Thus, we are actually seeing 7 ms one-way times about 10% of the time, and over 10 ms one-way times about 1% of the time. Why does this matter? Remember that we have buffers in each direction to handle the worst case, which is at least 10 ms. Assuming 10 ms of buffers in each direction, we get total round-trip audio buffers of 20 ms, which is in the ballpark of the 25 ms Y-intercept on the graph.

However, while the Y-intercept seems consistent with actual network jitter, Jacktrip claims there are 4 packets in its buffers, which would imply a Y-intercept close to zero because the total amount of buffering should be proportional to frames per buffer.

There are still lots of things to explore. Jacktrip has an option to buffer more packets, which should allow smaller buffers, lowering the audio IO latency and putting the buffers where you need them to counter network latency. Jacktrip can also send redundant packets if packets are getting lost. Jacktrip can collect statistics, which might tell us more about what the network is doing.

Rather than sending duplicate packets, there are some clever error correction schemes that work even better, but no error correction is implemented in Jacktrip to my knowledge.

Latency Allocation

Given all this, what do I think now? Let's think about where latency appears and try to break things down by allocating time as needed. We know audio input/output takes 11 ms in LogicPro, but I am optimistic that this could be trimmed, especially since 32-frame buffers represent less than 1 ms of audio, so I will allocate 5 ms each for input and output. Based on ping time, my network seems to have a worst-case delay of at least 10 ms each way, but let’s increase that to 15. That gives us 5 + 15 + 5 each way, for a total of 50 ms. I can imagine this getting a little better for very well-behaved networks or substantially worse for longer connections with greater packet delays and drops.

Wrapping Up

Some simple measurements I made in July indicated one might hope to achieve 100 to 150 ms round-trip audio latency over the network using off-the-shelf technology. Actual measurements show you can do at least a little better. Part of the difference is a combination of things I got wrong in the first measurements: WiFi packet loss seems to cause more than just occasional glitches, built-in laptop audio hardware may add a lot of latency, and crossing from one ISP to another may add some long-distance travel, even if your final destination is nearby. In my defense, I'll say my 100-150 ms estimate based on connecting to Carnegie Mellon was correct, but now that I see that connection involves a very long network path, it is not so interesting – except to CMU faculty and students!

With my ISP, 50 ms seems possible over short distances to other clients of the same ISP, but this will require a better understanding of network timing and maybe a different software implementation. For now, even though Jacktrip is at least a good attempt to do everything right, we only achieved 90 ms round-trip latency with reasonable quality in terms of dropouts. (Well, unreasonable for recordings, but maybe OK for just playing together.) It would be nice to have a more complete picture including a good explanation for where all the time is going now. There may be some implementation problems leading to an extra 20 ms or so delay (the Y-intercept), and possibly sending redundant packets or using error correction could overcome some of the network problems.

Finally, we have not yet played together over our connection. I have heard plenty of reports and clips of people playing, but there is never any believable quantitative data to go along with examples. My sense is that people are dealing with more network latency than they would put up with in a room where they can just move closer together. That is encouraging. I look forward to some music making using what we have so we can describe the experience in terms of actual measured latency.