Project GATTACA

Molecular Determinants of a Phenotype

Whenever a new phenotype is recognized (e.g. resistance to a new drug emerges, changes occur in the host range, etc.), it is important to determine the molecular changes responsible for the new phenotype. This process involves several steps. These include identifying the specific protein(s) involved and then locating the specific amino acid changes that are necessary and sufficient for converting a wildtype organism to one with the new phenotype. For well-studied viruses, the first step can be fairly straightforward since the number of possible proteins is small and their functions may all be known. This is especially true in cases of drug resistance where the target protein is known in advance. However, locating the specific substitutions within that protein can be more daunting.

A typical process for determining the relevant locations experimentally involves choosing a single pair of sequences (one wildtype, one mutant), identifying all of the differences between them, and then doing a series of mutation experiments in an attempt to identify the minimal set of differences that can change wildtype to mutant and vice versa. This approach has two drawbacks. One is that it can be costly in both time and resources, especially if there are many differences between the wildtype and the mutant. The other is that it is not guaranteed to find all of the relevant positions. This is because it is typically infeasible to test all possible combinations of mutations so a heuristic search is performed and because there may be key positions that happen to have the same amino acid in the chosen wildtype and mutant strains. To address these issues, we propose that doing computational analysis on all available data prior to experimentation might be able to (1) reduce the number of experiments needed, (2) optimize the selection of experiments to perform, and (3) identify a wider range of key positions.

Our computational procedure involves an exhaustive search through all possible explanations (or hypotheses) for the change in phenotype. Each hypothesis consists of a particular set of positions in the protein and optionally a particular condition on the amino acid in each position. For example, one hypothesis might be stated as "alanine in position 42 and a hydrophobic amino acid at position 178". For any reasonably sized protein, the space of all such hypotheses can get quite large so we use biological knowledge to further restrict the hypothesis space. For a particular phenotype, it may be known that the mutations must be within a particular region, or that only a small number of changes are sufficient. For example, if the phenotype involves antibody binding, we can use knowledge about antibody-antigen interactions to say that the total number of amino acids involved should be on the order of 10, that these should be restricted to a few contiguous regions, and that only 2-4 of the 10 amino acids are critical to the binding.

Once the hypothesis space has been defined, our algorithm goes through every wildtype/mutant pair of sequences and eliminates any hypotheses that are inconsistent with the data. The definition of a consistent hypothesis depends on the hypothesis space. If the hypotheses are fully specified functions, meaning that they include both positions and conditions on those positions so that they can assign a label to any given sequence, then it simply a matter of testing whether the labels generated by the hypothesis are consistent with the actual labels. If a hypothesis only specifies positions, then we currently consider it consistent if there is at least one difference between a given wildtype/mutant pair at the specified positions. In either case, once all pairs have been analyzed, the remaining hypotheses can provide an indication as to which positions are relevant. One way of seeing this is to look at the number of hypotheses remaining that include a particular position and plotting that number for each position. An example of this is seen in the figure below. This information can then be used to determine which experiments would be the most informative to perform in the lab. The results of those experiments can be fed back in to the algorithm to further refine the hypothesis space and to generate a new set of experiments; this process can be repeated until an answer is determined.

Reduction of the Hypothesis Space with Increasing Amounts of Data

Remaining Hypotheses after Incremental Data Analysis

Hypotheses remaining after:
- 1 +/- pair analyzed
- 10 +/- pairs analyzed
- 50 +/- pairs analyzed
- 100 +/- pairs analyzed
Remaining Hypotheses after All Data is Analyzed

These figures show results from a proof-of-concept test of the algorithm. The phenotype in question is the ability of HIV-1 sequences to be neutralized by particular antibodies. The data include 711 HIV-1 Env sequences that are resistant to neutralization and 5 sequences that are susceptible. The hypotheses are defined on in terms of positions and are restricted to be consistent with the known biology of antibody-antigen interactions. The figures plot the positions in the protein on the x-axis and the number of hypotheses that include that position on the y-axis. Figure (A) shows how the hypothesis space is reduced after increasing amounts of data are run through the algorithm. Figure (B) shows that after all of the data is analyzed, the hypothesis space is significantly reduced and converges on a fairly small number of positions.

Molecular Determinants of a Phenotype

Reduction of the Hypothesis Space with Increasing Amounts of Data

Remaining Hypotheses after Incremental Data Analysis

Remaining Hypotheses after All Data is Analyzed