Data analysis hints

Selecting sequences

Taxon sampling: Tree reconstruction methods are typically more accurate when the set of sequences is more or less evenly distributed across the evolutionary distance covered by your data set. For example, sampling gene family members from human, mouse, dog, cow, panda, and platypus is a better strategy, in most cases, than sampling from many primate species and platypus, but nothing in between. Of course, you are at the mercy of sequencing efforts and may have to compromise due to what is available in GenBank.
You may find multiple isoforms of the same protein. In most cases, it is best to use only one isoform. The convention is to use the longest isoform. You may also want to remove incomplete sequences (fragments) from your data set.
Many alignment programs will truncate the Fasta header to the first 10 characters. When constructing sequence files, it is important to start the header with a short, unique name that is easy to remember. Otherwise, you will have a lot of trouble figuring out which sequence is which in your alignment and tree. Some related points:
- GeneDoc will not import a data set that has two or more sequences with the same identifier.
- Fasta headers have useful information, but the first 10 letters are not necessarily the best choice of a short identifier. You may want to put a short label immediately after the '>', but keep the rest of the header.
- The five letter species identification codes used by UNIPROT are often a good way to abbreviate species information. The complete list of such codes and the corresponding binomial names is linked from the resources page.

Multiple Sequence Alignment

A good alignment is essential to obtain a good tree. In addition to Muscle and ClustalW2, try other alignment programs. T-coffee and Mafft are available through the EBI webserver. ProbCons is considered one of the most accurate, so try that if their webserver is up. Inspect the resulting alignments and choose the one that seems best for your family as a starting point for manual refinement. No one program works best on all families. You may also find that one program does well with the N terminal region and a different program works well with the C terminal region. In this case, split your sequences in half, align each separately, and then combine the alignments in the editing step.

In some cases, you may get better results by partitioning the sequences into two subsets, aligning the sets separately, and then aligning the alignments. Clustal is one program that can align alignments. The initial alignment of the subsets can be carried out in any program (I think.)

Check your alignment(s) against what has been reported about conserved features of your family in the literature. If there is a published structure, make use of that, as well. Use these features (1) to decide which program gave you the best alignment and (2) to improve your alignment through manual editing (e.g., in GeneDoc). For a serious, publishable analysis of a small number of families, you should always plan to include manual refinement of the multiple sequence alignment in your data analysis plan.

For some projects, MEME will be useful. For others, less so. If you have strong conservation throughout your alignment, you may not need MEME to guide your alignment. If you have weak conservation or big insertions and deletions in your data, MEME can be very helpful.

Trimming: Multiple alignments should be trimmed before submitting them to tree reconstruction programs. Most of the trimming should take place after manual refinement. However, if you have sequences with unalignable regions, you may want to do some preliminary trimming before alignment. For example, if some of your sequences have a long string of leading or trailing repeats, it may be better to remove those first.
Return to course homepage

Last modified: November 2, 2011. Maintained by Dannie Durand (durand@cs.cmu.edu).