Stochastic Process

Classes of stochastic processes:

White noise

A continuous-time process is called white noise if for arbitrary n, sampling at arbitrary time instants t_1, t_2, ..., t_n, the resulting random variables, X_{t_1}, X_{t_2}, ..., X_{t_n} are independent, i.e., their joint pdf f(x_1, x_2, ..., x_n)= f(x_1)*f(x_2)*...*f(x_n).
marginal distribution is enough to determine joint pdf of all orders.

Gaussian processes:

used to model noise

white Gaussian noise: marginal pdf is Gaussian.
colored (and wide sense stationary) Gaussian noise: characterized by marginal distribution and autocorrelation R(\tau).

heavily used in communication theory and signal processing, due to 1) Gaussian assumption is valid in many practical situations, and 2) easy to obtain close-form solutions with Gaussian processes. E.g., Q function and Kalman filter.

Poisson processes:

used to model arrival processes
heavily used in queueing theory, due to 1) Poisson assumption is valid in many practical situations, and 2) easy to obtain close-form solutions with Poisson processes. E.g., M/M/1 and Jackson networks.

Renewal processes

used to model arrival processes
heavily used in queueing theory, e.g., M/G/1, G/M/1, G/G/1

Markov processes:

the queue in M/M/1 is a Markov process.

Semi-Markov processes

the queue in M/G/1 and G/M/1 is a semi-Markov process.

Random walk
Brownian motion
Wiener process
Diffusion process
Self similar process, long range dependence (LRD) process, short range dependence (SRD) process

Similar to Probability theory, the theory of stochastic process can be developed with non-measure theoretic probability theory or measure theoretic probability theory.
How to characterize a stochastic process:

Use n-dimensional pdf (or cdf or pmf) of n random variable at n randomly selected time instants. (It is also called nth-order pdf). Generally, the n-dimensional pdf is time varying. If it is time invariant, the stochastic process is stationary in the strict sense.

To characterize the transient behavior of a queueing system (rather than the equilibrium behavior), we use time-varying marginal cdf F(q,t) of the queue length Q(t). Then the steady-state distribution F(q) is simply the limit of F(q,t) as t goes to infinity.

Use moments: expectation, auto-correlation, high-order statistics
Use spectrum:

power spectral density: Fourier transform of the second-order moment
bi-spectrum: Fourier transform of the third-order moment
tri-spectrum: Fourier transform of the fourth-order moment
poly-spectrum.

Limit Theorems:

Ergodic theorems: sufficient condition for ergodic property. A process possesses ergodic property if the time/empirical averages converge (to a r.v. or deterministic value) in some sense (almost sure, in probability, and in p-th mean sense).

Laws of large numbers
Mean Ergodic Theorems in L^p space
Necessary condition for limiting sampling averages to be constants instead of random variable: the process has to be ergodic. (not ergodic property)

Central limit theorems: sufficient condition for normalized time averages converge to a Gaussian r.v. in distribution.

Laws of large numbers

Weak law of large numbers (WLLN)

Sample means converge to a numerical value (not necessarily statistical mean) in probability.

Strong law of large numbers (SLLN)

Sample means converge to a numerical value (not necessarily statistical mean) with probability 1.
(SLLN/WLLN) If X1, X2, ... are i.i.d. with finite mean \mu, then sample means converge to \mu with probability 1 and in probability.
(Kolmogorov): If {X_i} are i.i.d. r.v.'s with E[|X_i|]<infinity and E[X_i]= \mu, then sample means converge to \mu with probability 1.

For {X_i} are i.i.d. r.v.'s with E[|X_i|]<infinity, E[X_i]= \mu, and Var(X_i)=infinity, then sample means converge to \mu with probability 1. But the variance of sample means does not converge to 0. Actually, the variance of sample means is infinity. This is an example that convergence almost sure does not imply convergence in mean square sense.

Mean Ergodic Theorems:

Sample means converge to a numerical value (not necessarily statistical mean) in mean square sense.
A stochastic process is said to be mean ergodic if its sample means converge to the expectation.

Central limit theorems (CLT)

Normalized sample means converge to a Gaussian random variable in distribution.

Normalized by the standard deviation of the sample mean.
Like lim_{x goes to 0}x/x=1 (the limit of 0/0 is a constant), CLT characterizes that as n goes to infinity, (S_n-E(X)/(\sigma/sqrt(n)) converges to a r.v. N(0,1), i.e., convergence in distribution. By abusing notation a bit, lim_{n goes to \infty}(S_n-E(X)/(\sigma/sqrt(n)) = Y, the limit of 0/0 is a r.v.
SLLN/WLLN is about the re-centered sample mean converging to 0. CLT is about the limit of 0/0.

(Lindeberg-Levy): If {x_i} are i.i.d. and have finite mean m and finite variance \sigma^2 (\neq 0), then

the CDF of [(\sum_{i=1}^n x_i /n) - m]/(\sigma/\sqrt{n}) converges to a Gaussian distribution with mean 0 and unity variance.

Comments: WLLN/SLLN does not require finite variance but they obtain convergence with probability and in probability, respectively; stronger than convergence in distribution in CLT. Why?
The difference is that in WLLN/SLLN, sample means converge to a deterministic value rather than a random variable as in CLT. Since CLT also requires finite variance, CLT gives a stronger result than WLLN/SLLN. That is, WLLN/SLLN only tell that sample means converge to a deterministic value but WLLN/SLLN do not tell how sample means converge to the deterministic value (in what distribution?). CLT tells that the sample mean is asymptotically Gaussian distributed.
Implication of CLT: The aggregation of random effects follows Gaussian distribution. We can use Gaussian approximation/assumption in practice and enjoy the ease of doing math with Gaussian r.v.'s.
Proof1, Proof2
What's the intuition of CLT? Why do we have this phenomenon (sample means converge to a Gaussain r.v.)?

Non-identical case: If {x_i} are independent but not necessarily identically distributed, and if each x_i << \sum_{i=1}^n x_i for sufficiently large n, then

the CDF of [(\sum_{i=1}^n x_i /n) - m]/(\sigma/\sqrt{n}) converges to a Gaussian distribution with mean 0 and unity variance.

Non-independent case:
There are some theorems which treat the case of sums of non-independent variables, for instance the m-dependent central limit
theorem, the martingale central limit theorem and the central limit theorem for mixing processes.

How to characterize the correlation structure of a stochastic process?

auto-correlation function R(t1,t2)=E[X(t1)X(t2)]

For wide-sense (covariance) stationary process, R(\tau) = R(t1,t1+\tau) for all t1 \in R.

If the process is a white noise with zero mean, R(\tau) is a Dirac delta function, the magnitude of which is the double-sided power spectrum density of the white noise. Note that the variance of a r.v. at any time in a white process is infinity.
If R(\tau) is a Dirac delta function, then the r.v.'s at any two different instant are orthogonal.

In discrete time, we have similar conclusions:

If a random sequence consists of i.i.d. r.v.'s, R(n) is a Kronecker delta function, the magnitude of which is the second moment of a r.v.
If R(n) is a Kronecker delta function, then the r.v.'s at any two different instant are orthogonal.

R(t1,t2) characterizes orthogonality between a process' two r.v.'s at different instant.
Cross-reference: temporal (time) autocorrelation function of a deterministic process (which is energy limited):

R(\tau)= \int_{-infinity}^{+infinity} X(t)*X(t+\tau) dt
Discrete time: R(n) = \sum_{i=-infinity}^{+infinity} X(i)*X(i+n)

auto-covariance function C(t1,t2)=E[(X(t1)-E[X(t1)])(X(t1)-E[X(t1)])]

For wide-sense (covariance) stationary process, C(\tau) = C(t1,t1+\tau) for all t1 \in R.

If the process is a white noise, C(\tau) is a Dirac delta function, the magnitude of which is the double-sided power spectrum density of the white noise.
If C(\tau) is a Dirac delta function, then the r.v.'s at any two different instant are (linearly) uncorrelated.

In discrete time, we have similar conclusions:

If a random sequence consists of i.i.d. r.v.'s, C(n) is a Kronecker delta function, the magnitude of which is the variance of a r.v.
If C(n) is a Kronecker delta function, then the r.v.'s at any two different instant are uncorrelated.

C(t1,t2) characterizes linear correlation between a process' two r.v.'s at different instant.

C(t1,t2)>0: positively correlated
C(t1,t2)<0: negatively correlated
C(t1,t2)=0: uncorrelated

mixing coefficient:

Why is Toeplitz matrix important?

The covariance matrix of any wide-sense stationary discrete-time process is Toeplitz.
A n-by-n Toeplitz matrix T = [t_{i,j}], where t_{i,j}=a_{i-j}, and a_{-(n-1)}, a_{-(n-2)}, ... a_{n-1} are constant numbers. Only 2n-1 numbers are enough to specify a n-by-n Toeplitz matrix.
In a word, the diagonals of a Toeplitz matrix is constant (constant-along-diagonals).
Toeplitz matrix is constant-along-diagonals; circulant matrix is constant-along-diagonals and symmetric along diagonals.

Toeplitz matrix: "right shift without rotation"; circulant matrix: "right shift with rotation".

Why is circulant/cyclic matrix important?

Circulant matrices are used to approximate the behavior of Toeplitz matrices.
A n-by-n circulant matrix C=[c_{i,j}], where each row is a cyclic shift of the row above it. Denote the top row by {c_0, c_1, ..., c_{n-1}}. Other rows are cyclic shifts of this row. Only n numbers are enough to specify a n-by-n circulant matrix.
Circulant matrices are an especially tractable class of matrices since their inverse, product, and sums are also circulants and it is straightforward to construct inverse, product, and sums of circulants. The eigenvalues of such matrices can be easily and exactly found.

Empirical/sample/time average (mean)
Borel-Cantelli theorem

The first Borel-Cantelli theorem

If sum_{n=1}^{\infty} Pr{A_n}<\infty, then Pr{limsup_{n goes to \infty} A_n}=0.
Intuition (using a queue with infinite buffer size): if the sum of probability of queue length=n is finite, then the probability that the actual queue length is infinite is 0, i.e., the actual/expected queue length is finite with probability 1.
E[Q]=sum_{n=1}^{\infty} Pr{Q>=n}, this is another way to compute expectation.

The second Borel-Cantelli theorem

If {A_n} are independent and sum_{i=1}^n Pr{A_i} diverges, then Pr{limsup_{n goes to \infty} A_n}=1.

1. First-order result (about mean): WLLN/SLLN

Question: Can we use the sample mean to estimate the expectation?

WLLN studies the sufficient conditions for or (in probability). Let. If WLLN is satisfied for every , the estimator is called a consistent estimator. .

(Markov) If , then ; i.e., convergence in mss implies convergence in probability.

For wide sense stationary LRD process with and , we have . So in this LRD case, sample mean is a consistent estimator.

(Khintchin) If are iid random variables with (and possibly infinite variance), then . Note that here converges in probability even if does not converge in mss (infinite variance case).

SLLN studies the sufficient conditions for (with probability 1).

(Kolmogorov) If are iid random variables with (and possibly infinite variance), then . Note that here converges with probability 1 even if does not converge in mss (infinite variance case).

2. Second-order result (about variance): convergence in mss and CLT

Mean ergodic theorem: .
Mean ergodic theorem (wide sense stationary, WSS): Let be WSS with . If and covariance, then .

For LRD process with and , we have . So in this LRD case, as the number of samples increases, the variance of the estimator (sample mean) reduces.
LRD with covariance matrix has , i.e., the variance of the estimator does not reduce due to more samples.
LRD with covariance matrix has , i.e., the variance of the estimator does not reduce to 0 due to infinite samples.
SRD process always has since implies .
Note that mean ergodic theorem only tells the condition for the variance of the sample mean to converge but does not tell the convergence rate. We expect a good estimator has a fast convergence rate, i.e., for a fixed n, it should have small variance; in terms of convergence rate, we know the relation iid>SRD>LRD holds, i.e., iid sequence has the highest convergence rate.

CLT:

Motivation: WLLN tells the first-order behavior of the estimator, sample mean; i.e., sample means converge to the expectation and we have unbiased estimator since Mean ergodic theorem tells the second-order behavior of the estimator; i.e., the variance of the sample means converges to 0. The next question is: what about the distribution of the estimator? This is answered by CLT.
(Lindeberg-Levy CLT): If are iid random variables with and , then , where Y is a normal random variable with mean and variance .

Denote . I.e., , where is a normal random variable with zero mean and unity variance.

How to use CLT: given the normal distribution of the estimator (sample means), we can evaluate the performance of the estimator (e.g., confidence interval). That is, CLT provides approximations to, or limits of, performance measures (e.g., confidence interval) for the estimator, as the sample size gets large. So, we can compute the confidence interval according to the normal distribution.

Since , we also have . Then we can compute the confidence interval of .

CLT does not apply to self-similar processes since the randomness does not average out for self-similar processes. The sample means have the same (non-zero) variance; i.e., even if the number of samples goes to infinity, the variance of the sample mean does not go to zero.
CLT only use the expectation and variance of the sample mean since the sample mean is asymptotically normal. High order statistics of the sample mean are ignored. Why is the limiting distribution Gaussian? In the proof using moment-generating function, we see that the high-order terms are gone since first order and second order statistics dominate the value of the moment-generating function as n goes to infinity. Intuitively, first order and second order statistics can completely characterize the distribution in the asymptotic region, since the randomness is averaged out; we only need to capture the mean and variance (energy) of the sample mean in the asymptotic domain. Gaussian distribution is maximum entropy distribution under the constraints on mean and variance.
What about LRD, subexponential distribution, heavy-tail distribution?

Large deviation theory

Motivation: WLLN tells But it does not tell how fast it converges to zero. Large deviation theory characterizes the convergence rate. Why is this important? Because in many applications, we need to compute the probability of a rare event like where is large.
Large deviation principle (LDP) is about computing a small probabilityfor sufficiently large n, more specifically, or for a>E[X]. LDP characterizes the probability of a large deviation of in the order about the expectation n*E[X]; i.e., a large deviation of in the order about the expectation E[X].
In contrast, CLT is for computing . It characterizes the probability of a small deviation in the order about the expectation. As n goes to infinity, goes to 0; so it is a small deviation with respect to a fixed expectation. On the other hand, this deviation is in the same order of the variance of the sample mean, so that this deviation normalized by the variance is still finite.
For LRD process with and , even though we have WLLN, i.e., , the probability does not satisfy large deviation principle (i.e., the convergence rate is not exponential). We need to compress the time axis to obtain large deviation principle.

White Gaussian noise:

The auto-correlation function: .
The power spectral density:
The marginal PDF is Gaussian with zero mean, infinite variance. If the white Gaussian noise passes a filer with frequency response H(f) in frequency band [-B,B], the resulting marginal PDF is Gaussian with zero mean, variance .

Given a state equation and observation equation of a system,

x_{n+1} = A* x_n + v_n

y_n = B * x_n

where A and B are invertible square matrices, and v_n are i.i.d. and independent of x_n, then the observation y_n is a Markov chain (not a general HMM). Since y_{n+1}=B* x_{n+1} = B*A* x_n + B* v_n = B*A* B^{-1} * y_n + B* v_n, hence y_n is a Markov chain, i.e., the next state is independent of the past, given the current state.