Gene Frequencies and Continuous Character Data Programs
© Copyright 1986-2000 by the University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.
The programs in this group use gene frequencies and quantitative character values. One (CONTML) constructs maximum likelihood estimates of the phylogeny, another (GENDIST) computes genetic distances for use in the distance matrix programs, and the third (CONTRAST) examines correlation of traits as they evolve along a given phylogeny.
When the gene frequencies data are used in CONTML or GENDIST, this involves the following assumptions:
How these assumptions affect the methods will be seen in my papers on inference of phylogenies from gene frequency and continuous character data (Felsenstein, 1973b, 1981c, 1985c).
The input formats are fairly similar to the discrete-character programs, but with one difference. When CONTML is used in the gene-frequency mode (its usual, default mode), or when GENDIST is used, the first line contains the number of species (or populations) and the number of loci and the options information. There then follows a line which gives the numbers of alleles at each locus, in order. This must be the full number of alleles, not the number of alleles which will be input: i. e. for a two-allele locus the number should be 2, not 1. There then follow the species (population) data, each species beginning on a new line. The first 10 characters are taken as the name, and thereafter the values of the individual characters are read free-format, preceded and separated by blanks. They can go to a new line if desired, though of course not in the middle of a number. Missing data is not allowed - an important limitation. In the default configuration, for each locus, the numbers should be the frequencies of all but one allele. The menu option A (All) signals that the frequencies of all alleles are provided in the input data -- the program will then automatically ignore the last of them. So without the A option, for a three-allele locus there should be two numbers, the frequencies of two of the alleles (and of course it must always be the same two!). Here is a typical data set without the A option:
whereas here is what it would have to look like if the A option were invoked:
The first line has the number of species (or populations) and the number of loci. The second line has the number of alleles for each of the 3 loci. The species lines have names (filled out to 10 characters with blanks) followed by the gene frequencies of the 2 alleles for the first locus, the 3 alleles for the second locus, and the 2 alleles for the third locus. You can start a new line after any of these allele frequencies, and continue to give the frequencies on that line (without repeating the species name).
If all alleles of a locus are given, it is important to have them add up to 1. Roundoff of the frequencies may cause the program to conclude that the numbers do not sum to 1, and stop with an error message.
While many compilers may be more tolerant, it is probably wise to make sure that each number, including the first, is preceded by a blank, and that there are digits both preceding and following any decimal points.
CONTML and CONTRAST also treat quantitative characters (the continuous-characters mode in CONTML, which is option C). It is assumed that each character is evolving according to a Brownian motion model, at the same rate, and independently. In reality it is almost always impossible to guarantee this. The issue is discussed at length in my review article in Annual Review of Ecology and Systematics (Felsenstein, 1988a), where I point out the difficulty of transforming the characters so that they are not only genetically independent but have independent selection acting on them. If you are going to use CONTML to model evolution of continuous characters, then you should at least make some attempt to remove genetic correlations between the characters (usually all one can do is remove phenotypic correlations by transforming the characters so that there is no within-population covariance and so that the within-population variances of the characters are equal -- this is equivalent to using Canonical Variates). However, this will only guarantee that one has removed phenotypic covariances between characters. Genetic covariances could only be removed by knowing the coheritabilities of the characters, which would require genetic experiments, and selective covariances (covariances due to covariation of selection pressures) would require knowledge of the sources and extent of selection pressure in all variables.
CONTRAST is a program designed to infer, for a given phylogeny that is provided to the program, the covariation between characters in a data set. Thus we have a program in this set that allow us to take information about the covariation and rates of evolution of characters and make an estimate of the phylogeny (CONTML), and a program that takes an estimate of the phylogeny and infers the variances and covariances of the character changes. But we have no program that infers both the phylogenies and the character covariation from the same data set.
In the quantitative characters mode, a typical small data set would be:
Note that in the latter case, there is no line giving the numbers of alleles at each locus. In this latter case no square-root transformation of the coordinates is done: each is assumed to give directly the position on the Brownian motion scale.
For further discussion of options and modifiable constants in CONTML, GENDIST, and CONTRAST see the documentation files for those programs.