Theoretical Foundation

In this section we will study the distribution of the $ i$-th gene and the construction of a confidence interval for to the localization parameter associated with that distribution.

Let $ \boldsymbol{\beta}$ be the set of $ N$ individuals with $ p$ genes that make up the population and $ \boldsymbol{\beta^*}
\subset \boldsymbol{\beta}$ the set of the best $ n$ individuals. If we assume that the genes $ \beta^*_i$ of the individuals belonging to $ \boldsymbol{\beta^*}$ are independent random variables with a continuous distribution $ H(\beta^*_i)$ with a localization parameter $ \mu_{\beta_i^*}$, we can define the model

$\displaystyle \beta^*_i=\mu_{\beta_i^*}+e_i,\ \ \ $   for$\displaystyle \ \ i=1,...,p,$ (1)

being $ e_i$ a random variable. If we suppose that, for each gene $ i$, the best $ n$ individuals form a random sample $ \{\beta_{i,1}^*,\beta_{i,2}^*,...,\beta_{i,n}^*\}$ of the distribution of $ \beta^*_i$, then the model takes the form

$\displaystyle \beta^*_{ij}=\mu_{\beta_i^*}+e_{ij},\ \ \ $   for$\displaystyle \ \ i=1,...,p \ $   and$\displaystyle \ \ j=1,...,n.$ (2)

Using this model, we analyze an estimator of the localization parameter for the $ i$-th gene based on the minimization of the dispersion function induced by the $ L_2$ norm. The $ L_2$ norm is defined as

$\displaystyle \Vert e_i\Vert _2^2 = \sum_{j=1}^n(e_{ij})^2,$ (3)

hence the associated dispersion induced by the $ L_2$ norm in the model 2 is

$\displaystyle D_2(\mu_{\beta_i^*})=\sum_{j=1}^n(\beta_{ij}^* - \mu_{\beta_i^*})^2,$ (4)

and the estimator of the localization parameter $ \mu_{\beta_i^*}$ is:

$\displaystyle \hat{\mu}_{\beta_i^*}=\arg\min D_2(\mu_{\beta_i^*})= \arg\min \sum_{j=1}^n(\beta_{ij}^* - \mu_{\beta_i^*})^2.$ (5)

Using for minimization the steepest gradient descent method,

$\displaystyle S_2(\mu_{\beta_i^*})=-{\partial D_2(\mu_{\beta_i^*}) \over \partial \mu_{\beta_i^*}},$ (6)

we obtain

$\displaystyle S_2(\mu_{\beta_i^*}) = 2 \sum_{j=1}^n (\beta_{ij}^* - \mu_{\beta_i^*}),$ (7)

and making (7) equal to 0 yields

$\displaystyle \hat{\mu}_{\beta_i^*} = {\sum_{j=1}^n \beta_{ij}^* \over n} = \bar{\beta}_i^*.$ (8)

So, the estimator of the localization parameter for the $ i$-th gene based on the minimization of the dispersion function induced by the $ L_2$ norm is the mean of the distribution of $ \beta^*_i$ [KS77], that is, $ \hat{\mu}_{\beta_i^*}=\bar{\beta}_i^*$.

The sample mean estimator is a linear estimator1, so it has the properties of unbiasedness2 and consistency3, and it follows a normal distribution $ N(\mu_{\beta^*_i},\sigma^2_{\beta^*_i}/n)$ when the distribution of the genes $ H(\beta^*_i)$ is normal. Under this hypothesis, we construct a bilateral confidence interval for the localization of the genes of the best $ n$ individuals, using the studentization method, the mean as the localization parameter,and the standard deviation $ S_{\beta_i^*}$ as the dispersion parameter:

$\displaystyle I^{CI} = \left[ \bar{\beta}_i^* - t_{n-1,\alpha/2}{S_{\beta_i^*} ...
...{n}}; \bar{\beta}_i^* + t_{n-1,\alpha/2} {S_{\beta_i^*} \over \sqrt{n}} \right]$ (9)

where $ t_{n-1,\alpha/2}$ is the value of Student's $ t$ distribution with $ n-1$ degrees of freedom, and $ 1-\alpha $ is the confidence coefficient, that is, the probability that the interval contains the true value of the population mean.

Domingo 2005-07-11