SEQBOOT -- Bootstrap, Jackknife, or Permutation Resampling
Bootstrapping algorithm, version 3.6a3 Settings for this run: D Sequence, Morph, Rest., Gene Freqs? Molecular sequences J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap B Block size for block-bootstrapping? 1 (regular bootstrap) R How many replicates? 100 W Read weights of characters? No C Read categories of sites? No F Write out data sets or just weights? Data sets I Input sequences interleaved? Yes 0 Terminal type (IBM PC, ANSI, none)? (none) 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change
The user selects options by typing one of the letters in the left column, and continues to do so until all options are correctly set. Then the program can be run by typing Y.
It is important to select the correct data type (the D selection). Each time D is typed the program will change data type, proceeding successively through Molecular Sequences, Discrete Morphological Characters, Restriction Sites, and Gene Frequencies. Some of these will cause additional entries to appear in the menu. If Molecular Sequences or Restriction Sites settings and chosen the I (Interleaved) option appears in the menu (and as Molecular Sequences are also the default, it therefore appears in the first menu). It is the usual I option discussed in the Molecular Sequences document file and in the main documentation files for the package, and is on by default.
If the Restriction Sites option is chosen the menu option E appears, which asks whether the input file contains a third number on the first line of the file, for the number of restriction enzymes used to detect these sites. This is necessary because data sets for RESTML need this third number, but other programs do not, and SEQBOOT needs to know what to expect.
If the Gene Frequencies option is chosen an menu option A appears which allows the user to specify that all alleles at each locus are in the input file. The default setting is that one allele is absent at each locus.
The J option allows the user to select Bootstrapping, Delete-Half-Jackknifing, or the Archie-Faith permutation of species within characters. It changes successively among these three each time J is typed.
The B option selects the Block Bootstrap. When you select option B the program will ask you to enter the block length. When the block length is 1, this means that we are doing regular bootstrapping rather than block-bootstrapping.
The R option allows the user to set the number of replicate data sets. This defaults to 100. Most statisticians would be happiest with 1000 to 10,000 replicates in a bootstrap, but 100 gives a rough picture. You will have to decide this based on how long a running time you are willing to tolerate.
The W (Weights) option allows weights to be read from a file whose default name is "weights". The weights follow the format described in the main documentation file. Weights can only be 0 or 1, and act to select the characters (or sites) that will be used in the resampling, the others being ignored and always omitted from the output data sets. Note: At present, if you use W together with the F (just weights) option, you write a file of weights, but with only weights for the sites that had input weights of 1, the others being omitted. Thus if you had 100 characters, and gave 60 of them weights of 1, when you produce the output weights these will only have 60 weights, not 100. Thus they could only be used together with a data file that had been edited to remove the sites that you gave 0 weights to. This is clumsy and we need to correct it.
The C (Categories) option can be used with molecular sequence programs to allow assignment of sites or amino acid positions to user-defined rate categories. The assignment of rates to sites is then made by reading a file whose default name is "categories". It should contain a string of digits 1 through 9. A new line or a blank can occur after any character in this string. Thus the categories file might look like this:
The only use of the Categories information in SEQBOOT is that they are sampled along with the sites (or amino acid positions) and are written out onto a file whose default name is "outcategories", which has one set of categories information for each bootstrap or jackknife replicate.
The F option is a particularly important one. It is used whether to produce multiple output files or multiple weights. If your data set is large, a file with (say) 1000 such data sets can be very large and may use up too much space on your system. If you choose the F option, the program will instead produce a weights file with multiple sets of weights. The default name of this file is "outweights". Except for some programs that cannot handle multiple sets of weights, the programs have an M (multiple data sets) option that asks the user whether to use multiple data sets or multiple sets of weights. If the latter is selected when running those programs, they read one data set, but analyze it multiple times, each time reading a new set of weights. As both bootstrapping and jackknifing can be thought of as reweighting the characters, this accomplishes the same thing (the multiple weights option is not available for Archie/Faith permutation). As the file with multiple sets of weights is much smaller than a file with multiple data sets, this can be an attractive way to save file space. When multiple sets of weights is chosen, they reflect the sampling as well as any set of weights that was read in, so that you can use SEQBOOT's W option as well.
The 0 (Terminal type) option is the usual one.
The data files read by SEQBOOT are the standard ones for the various kinds of data. For molecular sequences the sequences may be either interleaved or sequential, and similarly for restriction sites. Restriction sites data may either have or not have the third argument, the number of restriction enzymes used. Discrete morphological characters are always assumed to be in sequential format. Gene frequencies data start with the number of species and the number of loci, and then follow that by a line with the number of alleles at each locus. The data for each locus may either have one entry for each allele, or omit one allele at each locus. The details of the formats are given in the main documentation file, and in the documentation files for the groups of programs.
The only option that can be present in the input file is F (Factors), the latter only in the case of binary (0,1) characters. The Factors option allows us to specify that groups of binary characters represent one multistate character. When sampling is done they will be sampled or omitted together, and when permutations of species are done they will all have the same permutation, as would happen if they really were just one column in the data matrix. For futher description of the F (Factors) option see the Discrete Characters Programs documentation file.
The output file will contain the data sets generated by the resampling process. Note that, when Gene Frequencies data is used or when Discrete Morphological characters with the Factors option are used, the number of characters in each data set may vary. It may also vary if there are an odd number of characters or sites and the Delete-Half-Jackknife resampling method is used, for then there will be a 50% chance of choosing (n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters.
The order of species in the data sets in the output file will vary randomly. This is a precaution to help the programs that analyze these data avoid any result which is sensitive to the input order of species from showing up repeatedly and thus appearing to have evidence in its favor.
The numerical options 1 and 2 in the menu also affect the output file. If 1 is chosen (it is off by default) the program will print the original input data set on the output file before the resampled data sets. I cannot actually see why anyone would want to do this. Option 2 toggles the feature (on by default) that prints out up to 20 times during the resampling process a notification that the program has completed a certain number of data sets. Thus if 100 resampled data sets are being produced, every 5 data sets a line is printed saying which data set has just been completed. This option should be turned off if the program is running in background and silence is desirable. At the end of execution the program will always (whatever the setting of option 2) print a couple of lines saying that output has been written to the output file.
The program runs moderately quickly, though more slowly when the Permutation resampling method is used than with the others.
I hope in the future to include code to pass on the Ancestors option from the input file (for use in programs MIX and DOLLOP) to the output file, a serious omission in the current version.
5 6 Alpha AACAAC Beta AACCCC Gamma ACCAAC Delta CCACCA Epsilon CCAAAC
(If Replicates are set to 10 and seed to 4333)
5 6 Alpha ACAAAC Beta ACCCCC Gamma ACAAAC Delta CACCCA Epsilon CAAAAC 5 6 Alpha AAAACC Beta AACCCC Gamma CCAACC Delta CCCCAA Epsilon CCAACC 5 6 Alpha ACAAAC Beta ACCCCC Gamma CCAAAC Delta CACCCA Epsilon CAAAAC 5 6 Alpha ACCAAA Beta ACCCCC Gamma ACCAAA Delta CAACCC Epsilon CAAAAA 5 6 Alpha ACAAAC Beta ACCCCC Gamma ACAAAC Delta CACCCA Epsilon CAAAAC 5 6 Alpha AAAACA Beta AAAACC Gamma AAACCA Delta CCCCAC Epsilon CCCCAA 5 6 Alpha AAACCC Beta CCCCCC Gamma AAACCC Delta CCCAAA Epsilon AAACCC 5 6 Alpha AAAACC Beta AACCCC Gamma AAAACC Delta CCCCAA Epsilon CCAACC 5 6 Alpha AAAAAC Beta AACCCC Gamma CCAAAC Delta CCCCCA Epsilon CCAAAC 5 6 Alpha AACCAC Beta AACCCC Gamma AACCAC Delta CCAACA Epsilon CCAAAC