Statistical
Signal Processing (Detection and Estimation)
The goal of hypothesis test (detection) is to decide which
hypothesis (among n hypotheses) is true, where n can be any integer greater than
1. A hypothesis is a statement about
a random event. Note a hypothesis need not be a random variable, which is
a function mapping an elementary event to a real number.
The goal of estimation is to extract quantity
of interest from noisy data/signal. Estimation problems can be classified into the
following categories:
- Parameter/point estimation: a parameter is a point in
n-dimensional vector space (e.g, Euclidean space, Hilbert space, and Krein
space), where n can be any positive integer. A parameter could be
deterministically continuous, or discrete random variable, or continuous
random variable, but it cannot be deterministically discrete. If
the parameter is a discrete
random variable, the estimation problem becomes a hypothesis test. Note a
hypothesis test is not a subset of parameter estimation since a hypothesis need not be a random
variable.
- (Time-varying) signal estimation: the signal
could be discrete-time or continuous-time, and it could also be deterministic or
stochastic. A basic method for signal estimation is Kalman filer. For
continuous-time signal estimation, Ito statistical calculus will be used,
which is fairly advanced.
- Density
estimation: there are two approaches, i.e., parametric and
non-parametric. Parametric methods are the same as those in parameter
estimation. Non-parametric methods use various kernel functions (like
weighted windowing used in FIR filter design, spectral estimation,
non-linear image processing, etc.)
- Confidence interval estimation
Key techniques:
all detection/estimation problems are optimization problems with respect to a
certain criterion as follows. So
whenever one talks about an optimal detector/estimator, don't forget to ask
him/her what optimal criterion is used.
- Maximum likelihood criterion: (used for both
detection and estimation)
- LRT (Likelihood Ratio Test) for simple
hypothesis test: the threshold is 1, that is, decide the one with a
larger likelihood. Can be used for multiple hypothesis test.
- GLRT (Generalized LRT) for composite
hypothesis test (take the max/super-norm of the likelihood function over
the parameter set, as the generalized likelihood function).
- ML estimator for estimation: (used for
the case when prior distribution is unknown or when the parameter to be
estimated is deterministic). An ML estimate is biased, but is
consistent (asymptotically unbiased) and asymptotically efficient
(asymptotically achieving the Cramer-Rao bound).
- Neyman-Pearson criterion: (used for detection
only)
- Maximize the detection probability
(called power) under the constraint on the false alarm probability
(called size). Generally, it's a constrained optimization
problem and can be solved by Lagrange multiplier method.
- A typical solution to the Neyman-Pearson
problem (the above constrained optimization problem): LRT/GLRT.
That is, determine the threshold, based on the false alarm probability;
use this threshold for LRT/GLRT.
- Insight: Neyman-Pearson criterion
achieves a tradeoff between two contradictory requirements: detection
probability and false alarm probability. The larger the false
alarm probability, the larger the detection probability.
- Fundamental implication: whenever there
are two contradictory design requirements, Neyman-Pearson criterion can
be used to formulate a constrained optimization problem, and the optimal solution
results in a best tradeoff between the two contradictory requirements.
- Minimum variance unbiased (MVUB) estimation
criterion:
- Every efficient estimator is MVUB, but
the converse is not true. That is, an estimator may be MVUB
but not achieve the Cramer-Rao
bound.
- Fisher information matrix: expectation of
second derivative of Shannon codeword length log(1/p(X|\theta).
- I(\theta)= E[s(\theta,X)*s^T(\theta,X)] =
E[d^2 log(1/p(X|\theta)/d\theta^2], where the score s(\theta,X)=dlog(p(X|\theta))/d\theta,
s^T is a transpose of s.
- Cramer-Rao bound: the estimation error
covariance matrix
C=E[(\hat{\theta}-\theta)*(\hat{\theta}-\theta)^T]>=I^{-1}(\theta), where
I^{-1}(\theta) is an inverse of I(\theta).
- Bayesian criterion: (used for both detection
and estimation)
- Used for estimating a random parameter
(not for estimating a deterministic parameter).
- Assume that the prior distribution is
known and the cost/risk function is given.
- Minimize the average
cost/risk. This is because the cost is generally a random
variable (due to randomness of noise) and cannot be meaningfully
optimized. So we choose to optimize the average cost. If the
cost is a discrete random variable, we have another option: minimax
criterion.
- If the cost function is quadratic, the
Bayes estimator, called Minimum Mean Square Error (MMSE) estimator, is
the conditional mean of the posterior PDF (i.e., conditional
distribution of the parameter given the observations).
- The nice thing about MMSE is that the
MMSE estimator is linear if the signal and observation are jointly
Gaussian. This has similar flavor of quadratic
programming (an optimization theory), which also involves solving
linear equations since the derivative/gradient of a quadratic cost
is linear.
- For non-Gaussian signal/noise, MMSE
estimator is not linear. Due to the simplicity of linear
estimation, we choose to use linear MMSE estimator, which is
sub-optimal compared to MMSE.
- If the cost function is uniform, the
Bayes estimator reduces to MAP
(Maximum A Posteriori probability) estimator. A point at
which a density achieves its maximum value is termed a mode of the PDF.
So MAP estimate is the mode of the posterior PDF. Note that the
conversion from Bayes to MAP requires that the threshold for the cost
function be very small.
- If the cost function is absolute, the
Bayes estimate, termed Minimum Mean Absolute Error (MMAE) estimator, is
the median of the posterior PDF.
- In summary, the MMSE, MMAE, and MAP
estimates are the mean, median, and mode of the posterior PDF,
respectively. In general, Bayes estimates are features of
the posterior PDF. This is one reason why Bayesian
estimators are widely used in pattern recognition.
- For Gaussian posterior PDF, MMSE=MMAE=MAP.
- Minimax criterion: (used for both detection
and estimation)
- Game-theoretical approach: minimax
detector/estimator uses zero-sum game theory. Suppose there are
two players: an experimentalist and Mother Nature. Mother
Nature chooses the least favorable prior PMF (probability mass
function). Then the minimax detector/estimator reduces to
Bayes detector/estimator with respect to the least favorable
prior. So minimax detector/estimator can be regarded as a
special case of Bayes detector/estimator.
- Minimax detector/estimator is applicable to
both discrete and continuous random variable detection/estimation. For
continuous r.v., play the strategies with arbitrary pdf. (In other
words, discretize the range of the r.v. and get a pmf for the resulting
discrete r.v.. The limit of this discrete r.v. is the continuous
r.v. So we can play with this pmf of the discrete r.v. in
place of the pdf of the continuous r.v.)
Types of estimators
- Maximum likelihood estimator
- Bayesian estimator
- Moment estimator:
- let sample moments equal to ensemble
moments; solve the equations, resulting in estimates of parameters of
interest.
- Least square estimator:
- does not require prior distribution
- minimize the quadratic cost, for given
observations Y, over the parameter of interest \theta.
Minimum Description Length (MDL) model
selection
If someone asks you whether a Markov chain is a reasonable model to do the
time series analysis, you should ask him/her comparing to what? Without a
baseline solution, anything could be reasonable to start with.
So the first step in modeling is selecting a model. Then, the second step
is to estimate the parameter of the model.
MDL can be used as a criterion for model
selection. It measures the universal code length of the data, given a
model. One can compare two different models based on their MDLs using the
data available and favor one with shorter code length. However, it is not
clear how to optimize from a model set (functional space) with infinite models
based on MDL. Note that MDL is used for comparing given models in a
parameterized family, rather than searching for the best model over the whole
functional space.
Kalman filters requires 1) linear system
equation, 2) Gaussian noise. The linearity of system equation
preserves the Gaussianity of the process (Gauss Markov process, actually, AR
process). Since the state process is a Gauss Markov process, Kalman
filters only have to propagate mean and covariance. Kalman filters can be
used to estimate the state of time-varying linear system. Just
change the system matrices from constant ones to time varying ones.
"Graphical models are a marriage between
probability theory and graph theory. They provide a natural tool for dealing
with two problems that occur throughout applied mathematics and engineering --
uncertainty and complexity -- and in particular they are playing an increasingly
important role in the design and analysis of machine learning algorithms.
Fundamental to the idea of a graphical model is the notion of modularity -- a
complex system is built by combining simpler parts. Probability theory provides
the glue whereby the parts are combined, ensuring that the system as a whole is
consistent, and providing ways to interface models to data. The graph theoretic
side of graphical models provides both an intuitively appealing interface by
which humans can model highly-interacting sets of variables as well as a data
structure that lends itself naturally to the design of efficient general-purpose
algorithms.
Many of the classical multivariate probabalistic
systems studied in fields such as statistics, systems engineering, information
theory, pattern recognition and statistical mechanics are special cases of the
general graphical model formalism -- examples include mixture models, factor
analysis, hidden Markov models, Kalman filters and Ising models. The graphical
model framework provides a way to view all of these systems as instances of a
common underlying formalism. This view has many advantages -- in particular,
specialized techniques that have been developed in one field can be transferred
between research communities and exploited more widely. Moreover, the graphical
model formalism provides a natural framework for the design of new systems."
--- Michael Jordan, 1998.
EM algorithm vs. BCJR
EM is a sub-optimal algorithm to compute maximum
likelihood or MAP. What EM obtains is a fixed-point, rather than the
desired global optimum. It is guaranteed that each step in EM converges to the
fixed-point.
BCJR is an algorithm to compute a posteri
probabilty exactly.
EM is an iterative algorithm while BCJR is a
sequential algorithm.