Beat Tracking

Getting a computer music system to listen to a performance and derive the tempo and position of beats.
See also Music Understanding

Back to Bibliography by Subject


“Following an Improvisation in Real Time”
This is where I got started. Bernard Mont-Reynaud is mostly responsible for the beat-tracking work described in this paper. Our simple efforts here are based on earlier work by Bernard and by Longuet-Higgins.

Allen and Dannenberg, “Tracking Musical Beats in Real Time,” in Proceedings of the International Computer Music Conference, Glasgow, Scotland, September 1990. International Computer Music Association, 1990. pp. 140-143.

This work addressed problems that were encountered in previous work (Dannenberg and Mont-Reynaud, see above). We used a simple model that seems to underly just about every beat-tracking (and phase-locked loop) system: when a note onset is earlier than predicted, increase the tempo, and when the onset is later than expected, decrease the tempo. The two parameters are “how much to increase/decrease” and “at what deviation do you want to just ignore note onsets rather than try to adjust to match them.” My system with Mont-Reynaud did not work well, but it seemed to be very sensitive to these parameters. Paul Allen and I systematically explored the parameter space and showed that, at least for our test set, no parameter combination would work well.

We then set out to do better by using beam search to track multiple hypotheses about the tempo and phase. This approach has been very influential, especially in David Rosenthal's work. There is a longer, unpublished version of this paper (see below).

ABSTRACT: Identifying the temporal location of downbeats is a fundamental musical skill. Observing that previous attempts to automate this process are constrained to hold a single current notion of beat timing and placement, we find that they will fail to predict beats and not recover beyond the point at which the first mistake is made. We propose a new model that uses beam search to consider multiple interpretations of the performance. At any time, predictions of beat timing and placement are made according to the most credible of many interpretations under consideration.

[Adobe Acrobat (PDF) Version] [Postscript Version]


Allen and Dannenberg, “Tracking Musical Beats in Real Time,” unpublished manuscript, 1990.
This is the full version of this paper. A shorter version was published in the 1990 ICMC (see above). I am not quite sure of the sequence, but I believe that this version came first, and then we cut it down to meet the page restrictions of the conference proceedings.

[Adobe Acrobat (PDF) Version] [Postscript Version]


Dannenberg, “Toward Automated Holistic Beat Tracking, Music Analysis, and Understanding,” in ISMIR 2005 6th International Conference on Music Information Retrieval Proceedings, London: Queen Mary, University of London, (2005), pp. 366-373.
ABSTRACT:Most music processing attempts to focus on one particular feature or structural element such as pitch, beat location, tempo, or genre. This hierarchical approach, in which music is separated into elements that are analyzed independently, is convenient for the scientific researcher, but is at odds with intuition about music perception. Music is interconnected at many levels, and the interplay of melody, harmony, and rhythm are important in perception. As a first step toward more holistic music analysis, music structure is used to constrain a beat tracking program. With structural information, the simple beat tracker, working with audio input, shows a large improvement. The implications of this work for other music analysis problems are discussed.

[Adobe Acrobat (PDF) Version]


Hu and Dannenberg, “Bootstrap Learning for Accurate Onset Detection,” Machine Learning 65(2-3) (December 2006), pp. 457-471.

ABSTRACT: Supervised learning models have been applied to create good onset detection systems for musical audio signals. However, this always requires a large set of labeled training examples, and hand-labeling is quite tedious and time consuming. In this paper, we present a bootstrap learning approach to train an accurate note onset detection model. Audio alignment techniques are first used to find the correspondence between a symbolic music representation (such as MIDI data) and an acoustic recording. This alignment provides an initial estimate of note boundaries which can be used to train an onset detector. Once trained, the detector can be used to refine the initial set of note boundaries and training can be repeated. This iterative training process eliminates the need for hand-labeled audio. Tests show that this training method can improve an onset detector initially trained on synthetic data.

[Adobe Acrobat (PDF) Version] [Online publication at Machine Learning journal website.]


Hu and Dannenberg, “A Bootstrap Method for Training an Accurate Audio Segmenter,” in Proceedings of the Sixth International Conference on Music Information Retrieval, London, UK, September 2005. London: Queen Mary, University of London & Goldsmiths College, University of London, 2005. pp. 223-229.

Computer alignment is used to get an initial estimate for the locations of onsets (you must have a MIDI or symbolic score of the audio). Using these locations as training data, machine learning is used to build an onset detector. The onset detector can then be used to create its own training data, thus “pulling itself up by its own bootstraps,” resulting in further improvements.

These two papers are similar, but the Journal version has more detail and also reports on some experiments with polyphonic onset detection.

ABSTRACT: Supervised learning can be used to create good systems for note segmentation in audio data. However, this requires a large set of labeled training examples, and handlabeling is quite difficult and time consuming. A bootstrap approach is introduced in which audio alignment techniques are first used to and the correspondence between a symbolic music representation (such as MIDI data) and an acoustic recording. This alignment provides an initial estimate of note boundaries which can be used to train a segmenter. Once trained, the segmenter can be used to refine the initial set of note boundaries and training can be repeated. This iterative training process eliminates the need for hand-segmented audio. Tests show that this training method can improve a segmenter initially trained on synthetic data.

[Adobe Acrobat (PDF) Version]