Theoretical Foundation

In this section we will study the distribution of the

-th gene and the construction of a confidence interval for to the localization parameter associated with that distribution.

Let $\boldsymbol{\beta}$ be the set of individuals with genes that make up the population and $\boldsymbol{\beta^*} \subset \boldsymbol{\beta}$ the set of the best individuals. If we assume that the genes $\beta^*_i$ of the individuals belonging to $\boldsymbol{\beta^*}$ are independent random variables with a continuous distribution $H(\beta^*_i)$ with a localization parameter $\mu_{\beta_i^*}$ , we can define the model

$\displaystyle \beta^*_i=\mu_{\beta_i^*}+e_i,\ \ \$ for $\displaystyle \ \ i=1,...,p,$

(1)

being

a random variable. If we suppose that, for each gene

, the best

individuals form a random sample $\{\beta_{i,1}^*,\beta_{i,2}^*,...,\beta_{i,n}^*\}$ of the distribution of $\beta^*_i$ , then the model takes the form

$\displaystyle \beta^*_{ij}=\mu_{\beta_i^*}+e_{ij},\ \ \$ for $\displaystyle \ \ i=1,...,p \$ and $\displaystyle \ \ j=1,...,n.$

(2)

Using this model, we analyze an estimator of the localization parameter for the -th gene based on the minimization of the dispersion function induced by the norm. The norm is defined as

$\displaystyle \Vert e_i\Vert _2^2 = \sum_{j=1}^n(e_{ij})^2,$

(3)

hence the associated dispersion induced by the

norm in the model 2 is

$\displaystyle D_2(\mu_{\beta_i^*})=\sum_{j=1}^n(\beta_{ij}^* - \mu_{\beta_i^*})^2,$

(4)

and the estimator of the localization parameter $\mu_{\beta_i^*}$ is:

$\displaystyle \hat{\mu}_{\beta_i^*}=\arg\min D_2(\mu_{\beta_i^*})= \arg\min \sum_{j=1}^n(\beta_{ij}^* - \mu_{\beta_i^*})^2.$

(5)

Using for minimization the steepest gradient descent method,

$\displaystyle S_2(\mu_{\beta_i^*})=-{\partial D_2(\mu_{\beta_i^*}) \over \partial \mu_{\beta_i^*}},$

(6)

we obtain

$\displaystyle S_2(\mu_{\beta_i^*}) = 2 \sum_{j=1}^n (\beta_{ij}^* - \mu_{\beta_i^*}),$

(7)

and making (7) equal to 0 yields

$\displaystyle \hat{\mu}_{\beta_i^*} = {\sum_{j=1}^n \beta_{ij}^* \over n} = \bar{\beta}_i^*.$

(8)

So, the estimator of the localization parameter for the -th gene based on the minimization of the dispersion function induced by the norm is the mean of the distribution of $\beta^*_i$ [KS77], that is, $\hat{\mu}_{\beta_i^*}=\bar{\beta}_i^*$ .

The sample mean estimator is a linear estimator¹, so it has the properties of unbiasedness² and consistency³, and it follows a normal distribution $N(\mu_{\beta^*_i},\sigma^2_{\beta^*_i}/n)$ when the distribution of the genes $H(\beta^*_i)$ is normal. Under this hypothesis, we construct a bilateral confidence interval for the localization of the genes of the best individuals, using the studentization method, the mean as the localization parameter,and the standard deviation $S_{\beta_i^*}$ as the dispersion parameter:

$\displaystyle I^{CI} = \left[ \bar{\beta}_i^* - t_{n-1,\alpha/2}{S_{\beta_i^*} ... ...{n}}; \bar{\beta}_i^* + t_{n-1,\alpha/2} {S_{\beta_i^*} \over \sqrt{n}} \right]$

(9)

where $t_{n-1,\alpha/2}$ is the value of Student's

distribution with

degrees of freedom, and $1-\alpha$ is the confidence coefficient, that is, the probability that the interval contains the true value of the population mean.

Domingo 2005-07-11