Molecular Sequence Programs

(c) Copyright 1986-2000 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

These programs estimate phylogenies from protein sequence or nucleic acid sequence data. PROTPARS uses a parsimony method intermediate between Eck and Dayhoff's method (1966) of allowing transitions between all amino acids and counting those, and Fitch's (1971) method of counting the number of nucleotide changes that would be needed to evolve the protein sequence. DNAPARS uses the parsimony method allowing changes between all bases and counting the number of those. DNAMOVE is an interactive parsimony program allowing the user to rearrange trees by hand and see where characters states change. DNAPENNY uses the branch-and-bound method to search for all most parsimonious trees in the nucleic acid sequence case. DNACOMP adapts to nucleotide sequences the compatibility (largest clique) approach. DNAINVAR does not directly estimate a phylogeny, but computes Lake's (1987) and Cavender's (Cavender and Felsenstein, 1987) phylogenetic invariants, which are quantities whose values depend on the phylogeny. DNAML does a maximum likelihood estimate of the phylogeny (Felsenstein, 1981a). DNAMLK is similar to DNAML but assumes a molecular clock. DNADIST computes distance measures between pairs of species from nucleotide sequences, distances that can then be used by the distance matrix programs FITCH and KITSCH. RESTML does a maximum likelihood estimate from restriction sites data. SEQBOOT allows you to read in a data set and then produce multiple data sets from it by bootstrapping, delete-half jackknifing, or by permuting within sites. This then allows most of these methods to be bootstrapped or jackknifed, and for the Permutation Tail Probability Test of Archie (1989) and Faith and Cranston (1991) to be carried out.

The input and output format for RESTML is described in its document files. In general its input format is similar to those described here, except that the one-letter codes for restriction sites is specific to that program and is described in that document file. Since the input formats for the eight DNA sequence and two protein sequence programs apply to more than one program, they are described here. Their input formats are standard, making use of the IUPAC standards. .sp 2 .ce INTERLEAVED AND SEQUENTIAL FORMATS

The sequences can continue over multiple lines; when this is done the sequences must be either in "interleaved" format, similar to the output of alignment programs, or "sequential" format. These are described in the main document file. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format:

5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT H. SapiensACCGGTTGGC CGTTCAGGGT Chimp AAACCCTTGC CGTTACGCTT Gorilla AAACCCTTGC CGGTACGCTT GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

5 42 Turkey AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA Chimp AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT Gorilla AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA

Note, of course, that a portion of a sequence like this:

300 AAGCGTGAAC GTTGTACTAA TRCAG

is perfectly legal, assuming that the species name has gone before, and is filled out to full length by blanks. The above digits and blanks will be ignored, the sequence being taken as starting at the first base symbol (in this case an A). This should enable you to use output from many multiple-sequence alignment programs with only minimal editing.

In interleaved format the present versions of the programs may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage-return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of this problem are that the programs complain that the sequences are not properly aligned, and you can find no other cause for this complaint.

INPUT FOR THE DNA SEQUENCE PROGRAMS

The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.

These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

Symbol	Meaning

A	Adenine
G	Guanine
C	Cytosine
T	Thymine
U	Uracil
Y	pYrimidine	(C or T)
R	puRine	(A or G)
W	"Weak"	(A or T)
S	"Strong"	(C or G)
K	"Keto"	(T or G)
M	"aMino"	(C or A)
B	not A	(C or G or T)
D	not C	(A or G or T)
H	not G	(A or C or T)
V	not T	(A or C or G)
X,N,?	unknown	(A or C or G or T)
O	deletion
-	deletion

INPUT FOR THE PROTEIN SEQUENCE PROGRAMS

The input for the protein sequence programs is fairly standard. The first line contains the number of species and the number of amino acid positions (counting any stop codons that you want to include). These are followed on the same line by the options. The only options which need information in the input file are U (User Tree) and W (Weights). They are as described in the main documentation file. If the W (Weights) option is used there must be a W in the first line of the input file.

Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.

The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with the IUB standard abbreviations. In the present version it is:

Symbol	Stands for

A	ala
B	asx
C	cys
D	asp
E	glu
F	phe
G	gly
H	his
I	ileu
J	(not used)
K	lys
L	leu
M	met
N	asn
O	(not used)
P	pro
Q	gln
R	arg
S	ser
T	thr
U	(not used)
V	val
W	trp
X	unknown amino acid
Y	tyr
Z	glx
*	nonsense (stop)
?	unknown amino acid or deletion
-	deletion

where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present. Note that if two polypeptide chains are being used that are of different length owing to one terminating before the other, they can be coded as (say)

             HIINMA*????
             HIPNMGVWABT

since after the stop codon we do not definitely know that there has been a deletion, and do not know what amino acid would have been there. If DNA studies tell us that there is DNA sequence in that region, then we could use "X" rather than "?". Note that "X" means an unknown amino acid, but definitely an amino acid, while "?" could mean either that or a deletion. Otherwise one will usually want to use "?" after a stop codon, if one does not know what amino acid is there. If the DNA sequence has been observed there, one probably ought to resist putting in the amino acids that this DNA would code for, and one should use "X" instead, because under the assumptions implicit in this either the parsimony or the distance methods, changes to any noncoding sequence are much easier than changes in a coding region that change the amino acid

Here are the same one-letter codes tabulated the other way 'round:

Amino acid	One-letter code

ala	A
arg	R
asn	N
asp	D
asx	B
cys	C
gln	Q
glu	E
gly	G
glx	Z
his	H
ileu	I
leu	L
lys	K
met	M
phe	F
pro	P
ser	S
thr	T
trp	W
tyr	Y
val	V
deletion	-
nonsense (stop)	*
unknown amino acid	X
unknown (incl. deletion)	?

THE OPTIONS

The programs allow options chosen from their menus. Many of these are as described in the main documentation file, particularly the options J, O, U, T, W, and Y. (Although T has a different meaning in the programs DNAML and DNADIST than in the others).

The U option indicates that user-defined trees are provided at the end of the input file. This happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and DNAMLK, the trees must be strictly bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For DNAML and RESTML it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The root of the tree may in those cases be placed arbitrarily, since the trees needed are actually unrooted, though they look different when printed out. The program RETREE should enable you to reroot the trees without having to hand-edit or retype them. For DNAMOVE the U option is not available (although there is an equivalent feature which uses rooted user trees).

A feature of the nucleotide sequence programs other than DNAMOVE is that they save time and computer memory space by recognizing sites at which the pattern of bases is the same, and doing their computation only once. Thus if we have only four species but a large number of sites, there are (ignoring ambiguous bases) only about 256 different patterns of nucleotides (4 x 4 x 4 x 4) that can occur. The programs automatically count how many occurrences there are of each and then only needs to do as much computation as would be needed with 256 sites, even though the number of sites is actually much larger. If there are ambiguities (such as Y or R nucleotides), these are also handled correctly, and do not cause trouble. The programs store the full sequences but reserve other space for bookkeeping only for the distinct patterns. This saves space. Thus the programs will run very effectively with few species and many sites. On larger numbers of species, if rates of evolution are small, many of the sites will be invariant (such as having all A's) and thus will mostly have one of four patterns. The programs will in this way automatically avoid doing duplicate computations for such sites.