Previous Up Next

Appendix A  File Formats

Notung can save trees in three different file formats: Newick file format, NHX file format, and Notung file format.

Newick file format specifies tree topology and node labels, but cannot be used to save reconciliation information or information about the species tree with which the gene tree was reconciled.

NHX and Notung file formats use the Newick comment field to store additional information not captured in the standard Newick specification. A reconciliation involves a gene tree, a species tree, the mapping from gene tree to species tree, and the inferred duplications and losses. Newick format stores only the gene tree. NHX format can store a gene tree, with additional information to indicate which nodes are duplications. Notung file format can store a gene tree, the species tree with which it was reconciled, and duplication and loss nodes. If you save a reconciled tree in Notung format, it will still be reconciled when you next open it in Notung.

The Notung file format holds more information, but may not be compatible with other software packages that use Newick format. The formal specification of Newick file format allows bracket-delimited comments. Programs that follow the formal specification and ignore information stored in comments will be able to read NHX or Notung format trees. However, not all programs allow comments. If you plan to use a program that does not allow Newick comments to further analyze trees saved by Notung, save your trees in standard Newick format.

A.1  Newick File Format

Newick is widely used by phylogeny programs. PHYLIP [5], PAUP* [13], and many other programs will output trees in Newick.

The general Newick syntax looks like this:

treefile subtree;

subtree descendant_list [internal_node_label] [:branch_length]

descendant_list (subtree, subtree [, subtree]) | leaf_node_name

where descendant_list is a string that specifies the organization of the subtree and
internal_node_label is the label of the root of a subtree. The optional branch_length field refers to the length of the edge from the root of the subtree to its parent. The internal_node_label and branch_length fields are optional. Some programs use these fields to store other information. For example, Notung allows the user to use either of these fields to store edge weight values.

Comments in Newick format are enclosed in square brackets and may appear anywhere newlines are permitted. Some programs use the comment field to store additional information that is not included in the Newick specification. By convention, this information is formatted as follows:

   [&&ApplicationID:Application_specific_comments] 

where ApplicationID indicates a specific program or format.

For more information about Newick file format, go to:

http://evolution.genetics.washington.edu/phylip/newicktree.html.

or

http://geta.life.uiuc.edu/~gary/Newicks\_845\_Tree\_Std.html.

A.2  NHX File Format - New Hampshire eXtended

NHX File Format is based on the Newick file format, but embeds additional information about each node in the tree in the comment fields, as follows:

   [&&NHX:TagID1=value1:TagID2=value2]

where TagID1 and TagID2 can specify bootstrap values, species labels, or duplication information. This example has two tags, but NHX comments can have one or more tags. Trees saved in NHX file format include information produced by a reconciliation, including duplications and species labels, but do not record any visual annotations made in Notung. Nor do they record the species tree with which the gene tree was reconciled.

NOTE: The NHX format is case-sensitive.

More information about NHX format, including a complete list of tags used in comment fields, can be obtained at:

http://www.genetics.wustl.edu/eddy/forester/NHX.html.

A.3  Notung File Format

Notung File Format further extends the NHX format. Notung file format can record duplication marks, edge weights, and color annotations. A reconciled gene tree file saved in Notung format will also have a pruned species tree embedded in it. When the reconciled gene tree is reopened in Notung, the pruned species tree can be extracted and used in the same way as any other species tree. A reconciled gene tree saved in Notung file format also stores additional information on parameter values, including edge weight threshold, loss cost, duplication cost, and conditional duplication cost. In addition, a non-binary gene tree reconciled with a binary species tree with more than one optimal history stores information regarding which history was displayed when saved. When the gene tree is reopened in Notung, the tree for that optimal history will be displayed.

To open an embedded species tree in a Notung format gene tree file:

  1. Open the Notung format gene tree file.
  2. Click the Reconciliation tab to enter reconciliation mode.
  3. Click the “Show Pruned Species Tree” button.

NOTE: None of the three file formats used in Notung embed alternate histories for gene trees discovered through rearrangement. When saving after rearrangement, Notung saves only the history that currently appears in the tree panel. To access the other alternate histories when opening such a file, the tree must be rearranged again in Notung.

A.4  Specifying the Species Associated with Each Gene

In order to perform reconciliation, Notung must determine the species from which each leaf taxon in the gene tree was derived. This is achieved by embedding the species name in the gene leaf label or by using information embedded in the NHX comment field.

Notung offers three different conventions for specifying the gene to species mapping, described below. Notung will attempt to guess the naming convention used; you can also specify this in the reconciliation dialog (see Chapter 5 - Reconciliation Mode).

A.5  Punctuation in Species Names

In previous versions of Notung, punctuation (-, /, _, ., \) in species names was used to indicate that Notung should look for a shorter species tag in gene names, rather than looking for the entire species name. For example, given the species name Hu.Homo_Sapiens, Notung would look for the species label “Hu” in gene names.

Because many users found this confusing, this functionality has been removed in Notung 2.6. Notung now looks for entire species names during reconciliation, which also allows users to use species names like Pan_troglodytes and Pan_paniscus in the same tree without creating a conflict. Unfortunately, this means that some trees that were used in previous versions of Notung will not work in the current version. This section explains how to change these trees so that they can be used with Notung 2.6.

How do I tell if I need to convert my trees?

Any species tree with punctuation in the species names, where the full species names are not present in either the gene tree names or in NHX style species tags, will need to be converted. If your species names contain punctuation and you used them with older versions of Notung, then your trees probably fit this description. If Notung 2.6 is used to open an older Notung format tree that needs to be converted, a warning dialog will be shown.

Converting the trees

There are three ways to convert trees with punctuation in species names. The correct method to use depends on your desired outcome.

Shorten species names
This method requires changing only the species tree - gene trees should not need to be modified. Remove any part of the species name after the first punctuation, including the first punctuation. For example, if the leaf labels in the gene tree are of the form “Hu-gene01”, change “Hu.Homo_sapiens” to “Hu” in the species tree. These shorter species names should now match the species labels in the gene names.
Lengthen gene names
This method requires changing the gene tree(s). Replace short species labels in gene names with full species names. For example, change “Hu-gene01” to “Hu.Homo_sapiens-gene01” in the gene tree. This solution will not work in Postfix mode if your species names contain underscores (_).
Add NHX style species tags
This method requires changing the gene trees, but does not change gene names. One benefit of this method is that switching from a very short species label to a long species label will not affect the length of gene names.

If the gene tree is already in NHX or Notung format, modify the NHX comment after each gene name. To modify an existing NHX comment, find the species tag and replace the shorter species label with the full species name. For example, “[&&NHX:S=Hu]” becomes “[&&NHX:S=Hu_Homo_sapiens]”.

If there are no comments in the file (i.e., the tree is in Newick format), add the following after each gene name: “[&&NHX:S=<speciesname>]”, where <speciesname> is the corresponding full species name from the species tree. For example, the gene tree:

    (gene1_Hu,
     (gene2_Hu, gene2_Mu));
  

would become:

    (gene1_Hu[&&NHX:S=Hu_Homo_sapiens],
     (gene2_Hu[&&NHX:S=Hu_Homo_sapiens], gene2_Mu[&&NHX:S=Mu_Mus_musculus]));
  

A.6  Location of Edge Weight Values

Notung uses edge weights to determine which edges are weakly supported and may be rearranged. These edge weights may correspond to bootstrap values, probabilities, branch lengths, or any other numerical indication of support.

Edge weight values can be located in one of three places in a tree file, depending on how the file was created. In Newick format, either the branch length field or the internal node name may be used to specify edge weights. Many programs store bootstrap values in the Newick node name field. In an NHX or Notung format file, edge weights can also be specified using the NHX bootstrap tag in the comment field.

The example below shows a tree with a single edge weight in each of the three tree formats:

Confusion can arise if an input tree has edge weights in more than one type of field. This could occur, for example, in a tree that has both branch lengths and bootstrap values. Notung tries to guess the type of edge weight specification in the file, but it is not always possible for Notung to determine this unequivocally. You can specify the location explicitly using command line options (see Chapter 12 - Command Line Options and Batch Processing) or using the “Select Location of Edge Weights” dialog in the Display Options menu (see Figure A.1).

Click on image to see larger version


Figure A.1: The “Select Location of Edge Weights” dialog box.

To set the location of edge weights in Notung:

  1. Click “Display Options Select Location of Edge Weights.” A dialog box appears.
  2. Select one of the radio buttons (see Figure A.1).

    The gene tree will immediately reflect the change, so you can check the tree panel to verify that the choice you selected gives the desired values.

  3. Click “Apply.”

Previous Up Next