Data analysis hints
Selecting sequences
- Taxon sampling: Tree reconstruction methods are typically
more accurate when the set of sequences is more or less evenly
distributed across the evolutionary distance covered by your data
set. For example, sampling gene family members from human, mouse,
dog, cow, panda, and platypus is a better strategy, in most cases,
than sampling from many primate species and platypus, but nothing
in between. Of course, you are at the mercy of sequencing efforts
and may have to compromise due to what is available in
GenBank.
- You may find multiple isoforms of the same protein. In most
cases, it is best to use only one isoform. The convention is to
use the longest isoform. You may also want to remove incomplete
sequences (fragments) from your data set.
- Many alignment
programs will truncate the Fasta header to the first 10
characters. When constructing sequence files, it is important to
start the header with a short, unique name that is easy to
remember. Otherwise, you will have a lot of trouble figuring out
which sequence is which in your alignment and tree. Some related
points:
- GeneDoc will not import a data set that has two or more
sequences with the same identifier.
- Fasta headers have useful information, but the first 10 letters
are not necessarily the best choice of a short identifier. You
may want to put a short label immediately after the '>', but keep
the rest of the header.
- The five letter species identification codes used by UNIPROT
are often a good way to abbreviate species information. The
complete list of such codes and the corresponding binomial names
is linked from the resources page.
Multiple Sequence Alignment
A good alignment is essential to obtain a good tree. In addition
to Muscle and ClustalW2, try other alignment programs. T-coffee
and Mafft are available through the EBI webserver. ProbCons is
considered one of the most accurate, so try that if their
webserver is up. Inspect the resulting alignments and choose the
one that seems best for your family as a starting point for manual
refinement. No one program works best on all families. You may
also find that one program does well with the N terminal region
and a different program works well with the C terminal region. In
this case, split your sequences in half, align each separately,
and then combine the alignments in the editing step.
In some cases, you may get better results by partitioning the
sequences into two subsets, aligning the sets separately, and then
aligning the alignments. Clustal is one program that can align
alignments. The initial alignment of the subsets can be carried
out in any program (I think.)
Check your alignment(s) against what has been reported about
conserved features of your family in the literature. If there is
a published structure, make use of that, as well. Use these
features (1) to decide which program gave you the best alignment
and (2) to improve your alignment through manual editing (e.g., in
GeneDoc). For a serious, publishable analysis of a small number
of families, you should always plan to include manual refinement
of the multiple sequence alignment in your data analysis plan.
For some projects, MEME will be useful. For others, less so.
If you have strong conservation throughout your alignment, you may
not need MEME to guide your alignment. If you have weak
conservation or big insertions and deletions in your data, MEME
can be very helpful.
Trimming: Multiple alignments should be trimmed before
submitting them to tree reconstruction programs. Most of the
trimming should take place after manual refinement. However, if
you have sequences with unalignable regions, you may want to do
some preliminary trimming before alignment. For example, if some
of your sequences have a long string of leading or trailing
repeats, it may be better to remove those first.
Return to course homepage
Last modified: November 2, 2011. Maintained by Dannie Durand (durand@cs.cmu.edu).