FastGap homepage
Finn Borchsenius, Department of Biological Sciences,
Last updated 07
February 2012
DOWNLOAD FastGap
1.2 (zip file)
If you find the program useful please send me an email
Cite as: Borchsenius, F. 2009. FastGap 1.2.
Department of Biosciences, Aarhus University, Denmark.
Published online at http://www.aubot.dk/FastGap_home.htm
Introduction
FastGap is a Windows executable
program for fast and efficient assembly of DNA sequence alignment files in BioEdit fasta format into #NEXUS
format ready for analysis in programs such as PAUP* or MrBayes. In
the process, gap or indel characters can be coded
using the simple method of Simmons
& Ochoterena 2000 and added to the data file
as separate partitions.
The
program is made using the Borland C++ builder and needs the following resource
file to be present in order to run:
VCL35.BPL
That
file is supplied together with the Windows executable. Place it in the same
folder as the FastGap program.
The
Windows interface of FastGap is simple and largely
self explanatory. One or several sequence alignment files can be added to the
assembly list using a standard Windows file open dialog. The list of files can
also be edited manually if necessary. The maximum number of files that can be
included is 9 in the present version but this can easily be changed if there is
a need to do so. Upon execution of the Make command each sequence alignment is
written to the #NEXUS file as a separate partition defined in a SETS block.
Nucleotide partitions are named region1_nuc, region2_nuc, etc. If
gaps are coded then they are added to the #NEXUS file
as separate partitions following each sequence alignment. Gap partitions are
also defined in the SETS block and named region1_gaps,
region2_gaps, etc.
A set including all gap partitions is also coded (charset
‘gaps’) to facilitate fast inclusion and exclusion of gap characters in PAUP. A list of all gaps that have been coded and their
first occurrence is written to the #NEXUS file as a list of comments. I owe
inspiration for the format of that list to Young
and Healy 2003.
Other program settings include:
From
version 1.2 the program interprets both ‘?’
‘N’ and ‘n’ in
the input file as missing data. i.e., ‘uncertain whether gap
or nucleotide’. A slash ‘-‘ represents a gap. All other characters mean nucleotide.
Prior versions used only a single missing data character.
An
error message will appear if the specified sequence alignment files cannot be
found or opened, or if the number of taxa varies
among different files. The program does not check for consistency of number of
nucleotide characters among lines of the input sequence alignment files. Such
errors will be detected only when you try to execute the generated #NEXUS file
in PAUP*. Upon successful assembly a preview of the
#NEXUS file is displayed for inspection. Note that the file cannot be edited
from the FastGap window. If you detect an error
correct the input-files and assemble them again. If you wish to modify the
#NEXUS file after assembly open do it with your favourite text processor.
One common
source of error is incorrect format of the input file. It must be the BioEdit
standard fasta format. If you experience problems try opening your aligment in BioEdit, save it in
different format (e.g., genbank file ‘.gb’), then
re-open it in BioEdit and save it once more in fasta format. That
should secure that the format is correct relative to FastGap.
Another
common source of error is to have space characters in the taxon
names. This will cause FastGap to interpret text
following the first space as nucleotid characters.
The output format is intended for direct use in PAUP.
If your aim is to analyse the file in MrBayes then
you need to manually delete line:
OPTIONS GAPMODE=MISSING;
in the DATA block You may also
wish to add a MrBayes block with the necessary
specifications for your analysis. The data partitions specified in the SETS
block can be copied to the MrBayes block if you
intend to analyse a partitioned model.
Gap coding algorithm
FastGap scores gap or indel characters according to the simple method described
by Simmons and Ochoterena (2000). The results from FastGap are no different from those obtained with GapCoder (Young
and Healy 2003) or the online Gap Recoder program
(except the latter will place gap characters in a different order). The main
point of FastGap is that it supplies a Windows
interface for concatenation of several independent sequence alignment files
while simultaneously performing gap coding. Furthermore FastGap
reads unmodified BioEdit fasta
files. These two features make FastGap very efficient
for #NEXUS file assembly by users of BioEdit and PAUP/MrBayes on a PC platform,
irrespective of whether gaps are coded or not!
Under
the simple method gaps are considered homologous if and only if they start and
end in the same position in the sequence alignment. The computational approach
to gap coding applied in FastGap is intitiated by a search for the first gap character in the
data matrix. The search starts in position 1 of sequence 1 and proceeds down
across sequences before moving to the next position. When a gap is located its
starting and ending positions are recorded and written to a list of unique gaps
maintained in the program memory. Then a decision on how to code the gap is
made for each sequence in the matrix. The rules governing this procedure are
(see figure):
1) If a sequence
has a gap starting and ending at exactly the same positions as the gap being
coded then the gap is scored as present (default value A)
2) If a sequence
has a nucleotide character at either the starting OR the ending position of the
gap being coded then the gap is coded as absent (default value C)
3) If a sequence
has a gap that starts at the same or an earlier position than the gap being
coded AND ends at the same or a later position then the gap is scored as
unknown (default value -)
4) If a sequence
has a gap that starts and ends at the same positions as the gap being coded but
is bordered by missing data (‘?’, ‘N’, ‘n’) then the gap is scored as unknown.
This is also the case if the sequence has missing data in the gap positions.
Having
coded the first gap the search proceeds for other unique gaps until the last
position of the matrix is reached. Ambiguity codes (incl. N) are handled
identically to nucleotide characters. Leading and trailing gap characters are
not coded as gaps. Likewise gaps with missing data on either side are not coded
since their length and position cannot be defined exactly. The source code that handles file
concatenation and gap coding in FastGap can be
downloaded here. Note that this cannot be compiled
directly – you will need to supply code for your own console (or GUI) interface
to interact with the program.
Fig. 1.
Example of gap coding. Three unique gaps (pink, blue, green) are identified and coded:
GAAC------ATGC 01-
GAAC------TTGC 01-
GAAC---CCTTTGC 001
GAA---------GC 1--