Previous Up Next

Chapter 4  Non-Binary Trees

Notung can fit a binary gene tree to a binary species tree, a binary gene tree to a non-binary species tree, or a non-binary gene tree to a binary species tree. Currently, Notung cannot compare non-binary gene trees with non-binary species trees. For a complete listing of the functions that Notung is able to perform on binary and non-binary trees, see Table 1.1.

Interpreting disagreement between gene and species trees as evidence of gene duplication and loss is widely accepted when both trees are binary. Disagreement between non-binary trees is less well-understood and there is no universally accepted approach to non-binary reconciliation. In this chapter, we briefly review current theory regarding non-binary nodes in gene and species trees and discuss how we apply these theories in Notung. If you plan to analyze either non-binary gene trees or non-binary species trees using Notung, it is recommended that you read this chapter. If you will be working solely with binary trees, you may skip ahead to the chapters describing the specific tasks you wish to perform.

For a more detailed description of Notung’s algorithm for reconciliation with non-binary species trees, see:

B. Vernot, M. Stolzer, A. Goldman, and D. Durand. Reconciliation with non-binary species trees. Journal of Computational Biology, in press, 2008.

More information on the algorithmics of reconciling, rearranging and resolving non-binary gene trees is available in:

D. Durand, B. V. Halldorsson, B. Vernot. A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction. Journal of Computational Biology, 13(2):320-335, 2006.

4.1  Non-Binary Trees

A non-binary, or multifurcating, tree is a tree in which at least one node has more than two children. Such nodes are referred to as polytomies, or non-binary nodes. A polytomy can have several meanings [9]. In Notung, polytomies are represented as vertical edges with more than two children. See, for example, the polytomy in Figure 4.1.

Click on image to see larger version


Figure 4.1: Notung displays trees as cladograms. Polytomies are drawn as vertical edges with more than two children. This tree contains only one polytomy, indicated by the arrow.

A hard polytomy represents the true, simultaneous divergence of all its children. A soft polytomy, on the other hand, refers to the situation where the true pattern of divergence is binary, but there is not enough signal in the data to determine the true branching order. Soft polytomies often occur if a sequence of binary divisions proceeds rapidly and the time between these events is insufficient to accumulate informative variation.

Reconciliation relies on the observation that discordance between a binary gene tree and a binary species tree is evidence that genes diverged through processes other than speciation. These processes include gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting.

Horizontal gene transfer, the transmission of genetic material from an organism in one species to the genome of an organism in another species, is a common phenomenon in prokaryotes. The extent and importance of horizontal transfer in eukaryotes is less well-understood. Like most reconciliation software, Notung does not consider horizontal gene transfer as an explanation for disagreement when reconciling binary or non-binary trees. If you believe that horizontal gene transfer played a significant role in the data set that you plan to analyze, you may wish to consider other analysis tools.

Incomplete lineage sorting refers to discordance between gene and species trees resulting from allelic variation. Since a node in the species tree represents the evolution of a population of organisms with genetic diversity, multiple alleles may be present at the locus of interest. When lineages diverge, a different allele may fix in each lineage. The resulting gene tree will be binary and will reflect the order in which new alleles arose in the ancestral population. This pattern of divergence in the genetic lineage may not correspond to the pattern of divergence in the species lineage. For example, Figure 4.2 shows three different binary branching processes of a gene tree in the context of a species polytomy.

Click on image to see larger version


Figure 4.2: Three possible outcomes of the evolution of a single genetic locus in the context of a population. Different gene families associated with the same species polytomy may have different binary branching patterns.

A true divergence between two genetic lineages corresponds to the point where allelic differences arose, not the time of speciation. Genetic divergence that greatly predates the time of speciation is referred to as deep coalescence. In Figure 4.2, for example, the divergence at x occurs much earlier than the separation of species A, B, and C, and represents deep coalescence.

The probability of incomplete lineage sorting decreases as the time between speciation events increases [12, 15, 10, 14, 8]. If branch lengths in the species tree are sufficiently long, the effect of incomplete lineage sorting on discordance between gene and species trees is negligible, and does not need to be considered. However, when the species tree is non-binary, incomplete lineage sorting is a plausible explanation for tree disagreement.

In the next section, we discuss how Notung deals with incomplete lineage sorting when reconciling binary gene trees with non-binary species trees. In the section following this, we discuss how Notung considers the multiple, possible binary histories represented by a polytomy in a gene tree and presents the most parsimonious set of events.

4.2  Fitting Binary Gene Trees to Non-Binary Species Trees

Since a species tree represents the evolution of a population of organisms, a polytomy may be either hard or soft. Hard polytomies (i.e., simultaneous divergences of three or more lineages) can result from several events, such as the isolation of subpopulations within a widespread species by sudden meteorological or geological events, or from rapid expansion of the population into open territory, resulting in reproductive isolation. Soft polytomies are frequently encountered in species trees, resulting from insufficient evidence for any particular binary branching pattern. Non-binary species trees may be common; for example, 64% of branch points in the NCBI Taxonomy Database [2] have three or more children.

Notung assumes that the probability of incomplete lineage sorting is negligible when a node in the species tree is binary. In this case, disagreement between the trees is interpreted as evidence for gene duplication or loss. In contrast, incongruence between a binary node in a gene tree and a non-binary node in a species tree can be evidence of either deep coalescence or gene duplication.

When the species tree is non-binary, Notung considers two different scenarios: cases in which disagreement can only be explained by a gene duplication (required duplications) and cases in which it is not possible to determine whether the disagreement is due to deep coalescence or gene duplication (conditional duplications). Both of these cases are illustrated in Figure 4.3.

Click on image to see larger version


   
Figure 4.3: Black squares with a “D” indicate (required) duplications. Losses are represented by dotted lines. (a) A marsupial species tree with a polytomy. (b) The phylogeny of a hypothetical gene family sampled from the same marsupial species. (c) Hypothesis 1: the disagreement between (a) and (b) can be explained by deep coalescence (node x), followed by gene duplication (node y). (d) Hypothesis 2: the disagreement between (a) and (b) can also be explained by duplication at x, followed by gene loss, followed by duplication at y. (e) The divergence at x is designated a conditional duplication (gray square) because it is not possible to determine whether the disagreement is due to duplication or deep coalescence. The divergence at y is a required duplication.

Notung implements a novel reconciliation algorithm  for non-binary species trees that distinguishes between required and conditional duplications and reports them separately.

Inferring loss events is also fundamentally different when the species tree is non-binary. When both trees are binary, an inferred loss node can always be unambiguously assigned to a specific edge in the gene tree, indicating when in the history of the gene family the loss occurred. The node is labeled with the species in which the loss occurred. However, when a loss is associated with a polytomy in the species tree, it is not generally possible to assign the loss to a single edge in the gene tree. Rather, the loss can be associated with a set of candidate edges, each of which corresponds to an alternate hypothesis regarding when the loss occurred. The inferred loss must have occurred on one of the edges in this set, but it is not possible to determine which one. Figure 4.4 shows an example of this ambiguity when assigning a gene loss in species A. This loss could be associated with any of the three colored edges indicated in Figure ??b. The three hypotheses resulting from the three possible ways of assigning the loss to an edge can be seen in Figure ??.

Click on image to see larger version


   
Figure 4.4: Losses associated with a polytomy in the species tree are ambiguous. (a) A species tree with a polytomy. (b) A gene tree drawn from the species in (a), with a loss in species A. (c) This loss can be assigned to three possible edges. Associating a loss with the green edge implies that g_A diverged first and was then lost. the blue edge implies that g_A was lost after the divergence of g_C; the red edge implies that A was lost after g_B diverged.

In a complex reconciliation with several losses, there may be many alternative hypotheses (i.e., reconciliations with different loss histories) to consider. Notung uses duplication-loss parsimony to reduce the number of candidate reconciliations. Specifically, Notung assigns each loss to a specific edge within the set of candidates, with a goal to minimize the total number of losses.

This total number of losses depends on two factors. The first is the position of the loss relative to duplications in the gene tree. Assigning a loss to an edge above a duplication implies that the loss occurred before the duplication, and only one loss is inferred. However, assigning the loss to an edge below the duplication implies that the duplication occurred first. Thus, two losses are inferred – one for each duplicated copy. Second, in some circumstances, losses in sibling species can be more parsimoniously explained by a loss in their common ancestor. The total number of losses may be reduced by assigning losses in such a way to maximize the number of cases where multiple losses can be replaced by a single loss in an ancestral species. These two factors are not independent of one another. Assigning a loss below a duplication will usually increase the total number of losses. However, in some cases, these “duplicated” losses may be combined with other losses assigned to edges below that duplication, thus reducing the total number of losses.

Two algorithms for inferring losses, one exact and the other a heuristic, have been implemented in Notung. Both algorithms are integrated with the algorithm to identify required and conditional duplications. The exact algorithm infers a history with the fewest losses, taking both of the above considerations into account. This algorithm is computationally intensive because all possible combinations of loss assignments must be considered. Its worst case running time is an exponential function of the size of the largest polytomy in the pruned species tree. In practice, the exact algorithm performs efficiently on non-binary species trees with small polytomies. However, users should be prepared for extended running times if the species tree has a polytomy with more than 12 children.

The heuristic runs significantly faster than the exact algorithm and yields the same results in many, if not most, cases. It returns only one reconciliation, which is not guaranteed to be optimal. However, in a comparison of the two methods on the 1,174 trees from TreeFam, the heuristic found an optimal solution for more than 99% of the trees. Of the seven trees where the heuristic did not find an optimal solution, in the worst case, the number of losses was overestimated by four losses from a total of 249.

NOTE: While the exact algorithm is guaranteed to return a reconciliation with a minimum number of losses, there may be more than one such optimal reconciliation; if so, Notung reports only one.

The interactive version of Notung uses the heuristic to reconcile binary gene trees with non-binary species trees. Both algorithms are available in the command line version. See Chapter 12 - Command Line Options and Batch Processing for information about these options.

4.3  Fitting a Non-Binary Gene Tree to a Binary Species Tree

In a gene tree, each lineage represents a single gene and the result of any divergence is exactly two descendant sequences. Thus, in contrast to species trees, the true branching pattern in a gene tree is always binary [8], and all multifurcations are soft polytomies. For this reason, non-binary gene trees are also referred to as unresolved trees. Some phylogeny reconstruction programs output non-binary gene trees when the true binary branching process cannot be resolved. Such uncertainty often arises if binary divisions occur too rapidly to accumulate informative variation or if the data set is noisy.

Notung’s approach to reconciling non-binary gene trees rests on the assumption that the children of a polytomy arose through an unknown series of binary divergences. Notung further assumes that, in the absence of other information, the best hypothesis for the true evolutionary history of the children of the polytomy is the binary branching pattern that entails the fewest duplications and losses; there may be more than one such binary resolution of a polytomy. The problem of reconciling non-binary gene trees reduces to finding a binary tree that agrees with the original tree everywhere except at the polytomies and has a minimal D/L Score.

The general approach is as follows: A non-binary gene tree is converted into a binary gene tree by replacing each polytomy with a temporary binary resolution. This resolution is optimal under duplication-loss parsimony, when reconciled with the appropriate binary species tree. The resolution is determined by using our rearrangement algorithm [3], which constructs an optimal duplication-loss parsimony tree in polynomial time per tree. Following rearrangement, all nodes and edges not present in the original gene tree are then removed, to obtain a reconciliation of the original non-binary gene tree. As nodes and edges are removed, any duplications or losses assigned to them are reassigned to their associated polytomy.

This process is illustrated in Figure 4.5. The optimal resolution of the polytomy at node z in the gene tree in (b) with the species tree in (a) is shown in the right subtree of (c). This entails one duplication and one loss. This information is mapped onto the original gene tree (b) to obtain the reconciled, non-binary gene tree in (d). The polytomy in the original tree represents uncertainty, as reflected in the reconciliation. The reconciled polytomy in the right subtree of (d) tells us that at least one duplication and one loss occurred in the subtree rooted at z, but the exact order of these events is unknown.

Click on image to see larger version


       
Figure 4.5: (a) A binary species tree. (b) A non-binary gene tree with genes sampled from (a). (c) Binary resolution of gene tree (b), yielding a binary tree with three duplications and three losses. (d) Gene tree (b) reconciled with species tree (a), yielding a non-binary tree with three duplications and four losses. (e) Gene tree (b) following rearrangement. The polytomy has been resolved and the weak edge has been rearranged to eliminate a duplication.

Note that multiple duplications can be assigned to a polytomy in a reconciled non-binary tree. If duplications are inferred on two or more temporary nodes in the optimal binary resolution of a polytomy, the polytomy will be assigned multiple duplications when these nodes are removed from the tree. For example, two duplications are assigned to the polytomy in the reconciled, non-binary gene tree in Figure 4.6. This differs from standard reconciliation, where every node has at most one duplication.

Click on image to see larger version


   
Figure 4.6: (a) A non-binary gene tree. (b) An optimal, binary resolution of gene tree (a) reconciled with species tree in Figure 4.5. (c) The reconciled non-binary, gene tree. The resulting tree has a polytomy with two duplications.

Rooting a non-binary gene tree

Notung can be used to infer the root of an unrooted tree by identifying the root that requires the fewest duplications and losses. In Rooting mode, when the tree is binary, each edge is assigned a root score; i.e., the D/L Score of the tree when rooted on that edge. When the gene tree is non-binary, it is also possible to root the tree on a polytomy, as shown in Figure 4.7. Placing a polytomy at the root of the tree implies that one of the edges in the true binary resolution of the polytomy is the true root.

Click on image to see larger version


   
Figure 4.7: (a) An unrooted, non-binary gene tree. (b) The rooted, binary resolution of (a) with the lowest D/L Score. Rooting the tree on any other edge would entail more duplications and losses. (c) When reconciled with species tree in Figure ??, the polytomy in (a) is the root with minimum cost.

To calculate root scores, Notung roots the tree on each edge and polytomy in turn. For each root, the rearrangement algorithm is applied to ensure that each polytomy is replaced by an optimal binary resolution. The D/L Score of the resulting tree is used as the root score for that rooting. Note that it is necessary to optimize the binary resolutions separately for each root because the D/L Score depends on the location of the root. After all edges and polytomies have been scored, the original tree is reported to the user with edges and polytomies annotated with root scores.

Note that in Reconciliation and Rooting modes, binary resolutions are used to infer duplications and losses, but the structure of the final, output tree is unchanged. In the Rearrangement and Resolve modes, Notung uses duplication-loss parsimony to transform the non-binary input tree into a binary gene tree. Resolve mode is analogous to the reconciliation method described here, with the exception that the final step of removing the added nodes and edges is not performed. The result is a reconciled binary tree that is optimal with respect to duplication-loss parsimony. For the example in Figure 4.5, the Resolve function would return the tree in (c). As there may be more than one optimal resolution, Notung presents the different histories that result in the optimal tree. See Chapter 8 - Resolve Mode for more information.

In Rearrangement mode, the rearrangement algorithm is applied not only to edges added to the tree in the resolution of polytomies, but to all edges with an edge weight below the edge weight threshold. The result is a reconciled, binary tree in which weak edges have been rearranged to minimize the D/L Score. Figure 4.5(e) shows the rearrangement of the non-binary gene tree in (b), assuming an edge weight threshold of 90.


Previous Up Next