This page contains links to a preliminary version of the CLEAX program written in C++. The program automatically detects population structures, identifies the population history, and learns divergence time and admixture fraction.
Version History:
Version 1.0 | Capable of inferring only three populations admixture/non-admixture scenarios. Assumes that the population with least supporting weight from observed data is the admixed population. | Version 2.0 | Capable of inferring three or more populations adimxture/non-admixture scenarios. Automatically infers population history by incorporating all possible admixed/non-admixed scenarios into the MCMC chain. |
Compatibility: The program has been compiled and tested in both Windows and Linux with GNU C++ compilers and GNU make. While the program has only been tested in Windows and Linux, this should work on machines with ANSI C++.
Source Codes: cleax-2.0.tgz
Compilation: To compile the program, go to build directory and type:
make clean
make all
This should produce a program called cleax (or cleax.exe in Windows) in the build directory.
Mode | Program execution mode. There are currently 4 modes allowed (Normal/ConsensusOnly/MarkovOnly/ComputeOnly). A "Normal" mode allows the program to read a ConsensusInputFile consisting of the SNP data and performs automatic identification of subpopulations and history inference. A "ConsensusOnly" mode performs only the automatically identification of the subpopulation from the SNP dataset by reading a ConsensusInputFile. A "MarkovOnly" mode reads from a specialized MCMCInputFile consisting of model bipartitions and its associated weights and performs the history inference. A "ComputeOnly" mode reads both SNP data from ConsensusInputFile and a model bipartition set data from ModelPartitionsInputFile. Using the SNP data from ConsensusInputFile, the program then computes the weights associated with each model bipartition. (Default: Normal Mode) |
ConsensuInputFile Required for ConsensusOnly/Normal | Location of the genetic variation data. The program assumes that the input is consisted of space-delimited bi-allelic variation dataset where 0 represents one allele and 1 represents another. (See examples/example-0.6-0.05-0.2.hap for example) |
MCMCInputFile Required for MarkovOnly | Location of the input file used for running MarkovOnly mode. The file consisted of two sections: Weights and Models. A Weights section begins with a line with the word "Weights" followed by a line of weights associated with k model bipartitions. Each weight is separated by one or more spaces. A Models section begins with a line with word "Models" followed by k lines of model bipartitions. Each model bipartition line consisted of 0s and 1s without any spaces. |
ModelPartitionsInputFile Required for ComputeOnly | Location of the input file used for running the ComputeOnly model. The file specifies the k model bipartitions the user is interested in computing the weights associated with each model bipartition. Each line in the file represents a model bipartition. A model bipartition is represented with 0 and 1 without any spaces. |
OutputFile (required) | Location of the file to which where the program will write its output. |
NumGenealogies | Number of genealogies, m, that the program assumes are sufficient to describe the entire sequence set. The default value is 30. Ideally, the number of genealogies should be at least as many as the number of recombinant sites. |
NumEMIters | Number of simulated annealing/expectation maximum iterations the program will go through before returning the best scoring consensus tree. The default value is 1000. |
NumMCMCIters | Number of MCMC iterations the program will sample before returning the average expected parameters. The default is 20,000 iterations. |
Penalty | Penalty score added to the tree score that used to penalize large, complicated consensus trees. The default is the number of samples. A large penalty will steer the algorithm to identify simple consensus trees with few subpopulation, while a small penalty will prefer trees with more subpopulations. A small penalty can give rise to over-fitting on small dataset. In it current form, the program assumes there will be 3 model bipartitions (assuming a 3-population evolutionary model). This means that the penalty is not a critical factor in the current iteration |
PopSize* | Effective population size. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain. |
SeqLength* | Sequence length. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain. |
MutationRate* | Neutral mutation rate. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain. |
*The three parameters are used to determine theta that is used to compute the expected number of variant sites. This parameter is by default sampled by CLEAX. If you want to fix theta, the three parameters must be specified in order for the program to not sample theta.
To use the program, one would execute the following command:
./cleax path-to-property-file
Questions, comments, and bug reports may be sent to the authors at mingchit@andrew.cmu.edu or russells@andrew.cmu.edu. Please note, however, that development of this code is a research project which is aimed at creating theoretical methods for computational genomics, not at producing production quality code. This code is being released to allow others to review, experiment with, and improve upon these methods. The code is not suitable for mission critical work and should not be used as if it were. The code and all associated materials are provided as is, with no warranty of any kind, explicit or implicit, and no explicit or implicit promise of support.