1. String Matching
1.1 definitions
- å = alphabet
- X,Y,Z = strings over å
- XY = string X concatenated with string Y
- X3 = XXX
- |X| = length of string X
- Y(i) = i th element of Y
- n = |Y|
- m = |X|
1.2 string matching problem
- input: pattern X , and string Y
- output: does X occur in Y ? $Z,W:ZXW = Y
The obvious solution is to check for X starting at each Y(i) , giving O(nm)
Less obvious solutions by KMP(Knuth, Morris, and Pratt) and BM (Boyer, Moore)
are O(m+n) proceed in 2 phases:
- analyze X so that a failure to match at point Y(j) starting at Y(i) will allow you
to restart the pattern match later in Y than Y(i+1)
- Do the pattern match.
1.3 more definitions
- X a period of Y if Y is a prefix of X¥ and |X| < |Y|
- Y is periodic if P £ m/2
1.4 periodicity lemma
If Y has periods p,q with p+q £ m then Y has period GCD(p,q) .
1.4.1 proof:
if p = q done.
else, assume p > q
y has period p-q or Y(i) = Y(i+(p-q)) for 1 £ i £ m-(p-q)
- Y(i) = Y(i+p) for 1 £ i £ m-p
- Y(i) = Y(i-q) for q+1 £ i £ m
- 1+2 Þ Y(i-q) = Y(i+p-q) for q+1 £ i £ m-p+q
- 2+3 Þ Y(i) = Y(i-q) = Y(i+(p-q)) for q+1 £ i £ m-(p-q)
- You can also get: Y(i) = Y(i+p) = Y(i+p-q) for 1 £ i £ m-p
We just need m-p ³ q to satisfy p+q £ m . Now, apply Euclids theorem to get that Y has period
GCD(p,q)
1.5 finding periods
To find the period, P , of Y , just offset Y by P and check for matching overlaps.
1.6 Another lemma:
|Y| = m with minimum period P and Z with |Z| = n ³ m then
- if Y occurs in Z at i and j then |i-j| ³ p
- if Y occurs in Z at i and i+d where d £ m-p then p divides d .
1.6.1 proof:
if i-j ³ m then we are done.
The rest of the proof is by pictures.
1.7 The witness array
|Y| = m and is of size min{P,m/2}
- fwit(1) = 0
- fwit(i) = k £ m+1-i s.t. Y(k) ¹ Y(i+k-1)
2. The algorithm
Duel(i,j)
input: pattern Y plus a witness array and a text, Z with |i-j| £ (p,m/2)
output: return i or j
- compare Y to Z starting at Z(i) and Z(j) . Check the witness point.
We have eliminated one element of a pair of patterns in unit time.
3. text analysis in the nonperiodic case
Input: pattern Y with witness vector, text Z
- Divide Z into chunks, Ti of size m/2
- In parallel, check if Y starts in Ti using Duel .
- Eliminate all but one occurence by using Duel .
- Check the last case by brute force.
3.1 Timing analyses
- steps 2 and 3 are T(n,m) = log(m) and W(n,m) = m*n/m = O(n)
- step 4 T(n) = log(n) and W(n,m) = n/m*m = n
So the total is T(n) = log(n) and W(n) = n
File translated from TEX by TTH, version 1.30.