These notes demonstrate the approach with a concrete example, and describe variations.
In all cases, we are considering discrete sequences of symbols, e.g. &ldqo;How similar is ABCXDefG to AYBCDEZFG?” We might say these are similar because they both contain ABCDG in that order, or if we consider e similar to E and f similar to F, we could say both contain letters similar to ABCDEFG. For musical applications, the symbols can be an encoding of musical pitches, chords, intervals, inter-onset intervals, etc.
Notice that an insertion in string A is equivalent to a deletion in string B — either can account for a 1-character difference. The naive way to compute edit distance is to generate all possible deletions from both string A and string B, and find the longest match. E.g. if string A is ABC and string B is AC, we compute all variations of string A: ABC, BC, AC, BC, A, B, C, and all variations of string B: AC, C, A, and look for the longest match, AC. But this is exponential in time: If a string has length N, each element is either deleted or not, so we have 2^N cases to check!
Dynamic programming builds a solution incrementally. The main principles that allows a faster-than-expoonential solution is that we are only interested in the best solution, and we can build upon partial solutions because edit distance is additive.
The edit distance problem is closely related to the Longest Common Subsequence (LCS) problem, where you want to find an ordered sequence that shares the most elements with A and B. The difference is that edit distance adds up penalties for edits (so fewer edits are better), whereas LCS add up matches between A and B (so more is better).
Here is a quick preview of how this can work. Imagine that we are computing the edit distance from AXBCDEF and ABCYDEF. Since the last characters of AXBCDEF and ABCYDEF match, we can reduce the edit distance problem to finding the edit distance to AXBCDE and ABCYDE. Now, we've reduced the problem to a simpler one. The key to dynamic programming is that we will solve all simpler problems first, and, surprisingly, there are only N^2 of them, not an exponential number.
This gives us a recursive way to compute edit distance. Let's let α and β represent arbitrary strings, so αX is a string ending in X, and βY is another string ending in Y. Here's where you have to really pay attention and think hard because it is the heart of dynamic programming:
The edit distance from αX to βY is the minimum of:
This recursion still takes exponential time if implemented naively. We introduce a simple trick: store the edit distances between all prefixes of the strings so we only compute each one once. There are N^2 prefixes, so the total work (and space) will be N^2, assuming we are comparing two strings of length N. (Yes, it's NxM for strings of lengths N and M.)
Finally, the algorithm will compute a matrix like the following.
A X B C D E F A 0 1 2 3 4 5 6 B 1 2 1 2 3 4 5 C 2 3 2 1 2 3 4 Y 3 4 3 2 3 4 5 D 4 5 4 3 2 3 4 E 5 6 5 4 3 2 3 F 6 7 6 5 4 3 2
Each cell of the matrix is an edit distance between two prefixes of the input strings. E.g. the edit distance from prefix AXB of the first string to prefix ABC of the second string is 2, highlighted in red:
A X B C D E F A 0 1 2 3 4 5 6 B 1 2 1 2 3 4 5 C 2 3 2 1 2 3 4 Y 3 4 3 2 3 4 5 D 4 5 4 3 2 3 4 E 5 6 5 4 3 2 3 F 6 7 6 5 4 3 2
Notice that the bottom right element of the matrix is the edit distance between the complete strings A and B. This is the final output of the algorithm.
The core and inner loop of the algorithm computes each cell of the matrix. Here is some pseudo-code showing the calculation for each row r, column c, assuming the strings to compare are A and B:
M[r, c] = min(M[r-1, c] + 1,
M[r, c-1] + 1,
M[r-1, c-1] + (2 if A[c] != B[r] else 0))
You should see if you can execute this code in your head to obtain the red 2 in the table above. Search for examples in the table where the minimum value comes from other terms of the expression.
One final detail is the “boundary condition” for the first row and column. Assuming zero-based indexing, when we compute row 0 or column 0, this pseudo-code will try to access column and row indices of -1. This would fail in a direct implementation. Intuitively, a row or column of -1 corresponds to the edit distance between a string and the empty string. The edit distance is just the length of the non-empty string, since we need to delete each element to obtain a matching empty string.
In my implementation of the pseudo-code, I replace direct access to array M with a function call that tests for the special cases of row < 0 and col < 0:
def mget(r, c)
if r >= 0 and c >= 0
return M[r][c]
elif r == -1
return c + 1 // add 1 because index is zero-based, string length is one greater
else // must be c == -1:
return r + 1
M[r, c] = min(M[r-1, c] + 1,
M[r, c-1] + 1,
M[r-1, c-1] + (1 if A[c] != B[r] else 0))
We simply replaced 2 (deletion from string A + deletion from string B, total cost 2) with 1 (replacement to make A[c] match B[r], total cost 1).
CI
, deletion cost CD
, and replacement cost CR
:
M[r, c] = min(M[r-1, c] + CI,
M[r, c-1] + CD,
M[r-1, c-1] + (CR if A[c] != B[r] else 0))
For (a real) example, if the string A represented pitches extracted from audio, and there was a high likelihood of detecting repetitions, e.g. instead of the true melody ABCD you might expect the transcription to be AABBBCCDDDD, then you could assign a very low (or zero) deletion cost CD to reduce the penalty for these transcription errors.
Fortunately, there is a simple trick: Just form the log of probabilities. Now the edit distance is the sum of the log of probabilities, so the minimal edit distance gives us the log of the probability of the most likely random mutation from string A to string B. It’s a bit technical, and arguably not the perfect answer you want, but in practice, this is a very useful interpretation of the numbers and provides a nice basis for estimating costs. For example, you can actually use real-life data to estimate replacement costs, e.g. what's the likelihood someone will sing a half-step off vs. a whole-step off, etc.
Recall that we had to deal with the “boundary condition” problem in our core algorithm description. Instead of accessing column -1 (which did not exist), we interpreted -1 as meaning the cost of deleting every character up to the current row, or the cost of matching a prefix to the empty string.
To allow for matching a query to any substring of string B, we can simply modify the cost returned when we access column -1. We want zero cost to skip a prefix of B and then start matching, so we can simply return 0 as the cost whenever we need to access M[r, -1].
We also want to ignore arbitrary suffixes of string B. To do this, we simply scan down the last column and pick the lowest number (edit distance) as our final result. That way, we do not consider the additional insertion costs charged against us for trying to match the remainder (suffix) of string B.
This approach is called Dynamic Time Warping. It is a special case of dynamic programming, and all of the techniques described above can be applied.