The purpose of this library (which defines namespace Sequence), is to provide a set of routines for handling biological sequence data, with an emphasis on how evolutionary geneticists handle data. The intent is not to provide a means of writing sequence-format conversion routines. In fact, better systems exist for performing those tasks (namely the bioperl project, http://www.bioperl.org). Rather, I intend these libraries to be used as a basis for writing programs for performing many of the computational tasks that are common in evolutionary genetics, a field whose methods are becoming more important to genome analysis and genome comparison.
Most of the routines are written with nucleotide data in mind, since that is what I deal with the most. The fundamental sequence object is defined by the class Sequence::Seq, which declares a sequence interface and a pure virtual interface for I/O. There are also routines to translate sequences into peptides (Sequence::Translate).
In practice, sequence data can come in the form of aligned blocks, and the templates defined in namespace Sequence::Alignment provides the foundation for dealing with such data. The virtual base template class Sequence::AlignStream defines an interface for how alignment I/O must work (as Sequence::SeqStream does for single sequences). An example of alignment I/O is defined for ClustalW format alignments in the template function Sequence::ClustalW.
The library also contains contains several classes to do evolutionary genetic analyses. Classes of particular interest are:
1.) Sequence::PolySNP -- analyze molecular population genetic data
2.) Sequence::Comeron95 -- calculate Ka and Ks by Comeron's (1995) scheme
3.) Sequence::Kimura80 -- calculate divergence by Kimura's (1980) method.
libsequence, copyright Kevin Thornton, University of Chicago, 2002
This library is distrubuted under the terms of the GNU public license (GPL) (http://www.gnu.org). This means its free, and that you have access to the source code. And, if you modify the library, you must distribute those modifications under the same terms. The GPL is included in the file COPYING in the root of the source directory for the project, and you should read it if you have any questions (particularly if you are a commercial user, as it will affect you the most).
Most importantly, this library is distributed with no warranty either explicitly stated or implied.
Development of this library had benefited from discussion with several people. Dick Hudson and Eli Stahl provided feedback and much discussion on calculations of summary statistics when there are more than 2 states at a site, and Sequence::PolySNP is the result of those discussions. Dick Hudson and Jeff Wall contributed C code that was adapted in to namespace Sequence::Recombination. The coalescent simulation engine is only a slight modification of Hudson's original code. Gerry Wyckoff provided a table of Grantham's distances that are the basis for Sequence::Grantham, and he also provided thousands of comparisons using human/mouse divergence to test the output of Sequence::Comeron95. I should also thank my PhD advisor, Manyuan Long, for indulging me the time to work on this when the PCR was running.
This library has been compiled and tested on a wide variety on Unix systems, including various flavors of Linux (http://www.debian.org), Apple's OS X (http://www.apple.com), and Solaris systems using g++ 2.9x (http://www.sun.com). Older versions even compiled under Windows using Visual C++, but I don't have access to that platform, and so I will not track portability to it. The library is known to compile using gcc 2.9x, 3.x, and 4.x compiler platforms.
As of libsequence 1.5.6, compiler optimizations for Apple G4 and G5 processor systems can be used. On a G4, configuring the source code with ./configure --enable-G4=yes sets the options -mcpu=G4, -mpowerpc, and -mpowerpc-gpopt. The option --enable-G5=yes sets -mcpu=G5,-mpowerpc64,-mpowerpc-gpopt
libsequence take advantage of many current features of C++, and you compiler needs to support them. Most important amongst these are namespaces, templates (including STL algorithms).
libsequence requires BOOST (http://www.boost.org) to compile. Note that there are no link-time dependencies on BOOST, only compile-time dependencies. That means that you only need to install the BOOST headers, not the run-time libraries.
Installing from source is done with the standard 3 commands:
If you are not familiar with these commands, please consult your local Unix expert.
Profiling may be enabled by running the configure script with --enable-profile=yes. Please remember that accurate profiling of libraries generally requires static linkage (rather than dynamic).
By default, the library is compiled without debugging symbols, with NDEBUG defined (which disables any assertions), and with -O3 to optimize the code. If you wish to enable debugging capabilities, run ./configure with the flag --enable-debug=yes. The adds -g to the compiler flags, leaves NDEBUG undefined, and does not optimize the resulting object code. Please note that compiling with debugging is only recommended for developers, since it makes the code really big and slow.
For those of you unfamiliar with it, NDEBUG is a special symbol for a C/C++ compiler. It means "not debugging." In C, compiling with NDEBUG defined (gcc -DNDEBUG foo.c) disables all calls to assert(), and this behavior is identical in C++. By default, the library compiles with -DNDEBUG (see Debugging).
All header files in this library define classes/functions/etc. in namespace Sequence. There are also "sub" namespaces, such as Alignment. None of these are brought into scope by default.
In C++, there are 2 built-in methods to deal with error handling. The first method is to use the assert() function from C, and the second is to use C++ exception handling. This library uses both, but with an emphasis on assertions over exceptions. The reason for this has to do with both efficiency (all the checks to see if we need to throw() an exception can get expensive), and code size (including SeqExceptions.h in every file starts to make the library bloated). A better reason, however, has to do with the programming logic. Much of the code to analyze data assumes, for instance, that the data are aligned (implying that all sequences in a data file are the same length). The library provides a function to check if all data read into a vector (a vector<Sequence::Fasta *>, for instance) are sequences of the same length (see Sequence::Alignment::IsAlignment). Thus, it is a programmer error to start analyzing data without first checking that it is aligned, rather than a library error. However, the library will check sequence lengths (and a lot of other things), if it is compiled with debugging enabled. The checks are done by assert(), and the behavior of assert() is to abort() the program if the assertion is false. Thus, the exceptions thrown by the library deal with errors that a programmer cannot reasonably be expected to catch, such as badly formatted data, user input that is unsupported for one reason or another, etc.
As far as I know, everything in this library is up to snuff with respect to ISO C++. All the design methods I use are straight from Stoustrup's "The C++ Programming Language" or Meyer's "Effective C++" (both from Addison-Wesley). The library compiles under g++ 3.1.1 ( http://gcc.gnu.org ) with both -ansi and -pedantic flags, so that's a good sign at least. Reports of any portability problems are appreciated. Emailing me fixes for the problems may actually earn you a beer. In addition, the coalescent simulation code included in this package is implemented in C, and has been modified to successfully compile with both -ansi and -pedantic flags.
I have never programmed an application using threads, so to be safe, one should assume the library is not thread safe. I will look into this in the future, if I have time.
To compile programs using this library, one must obviously include the appropriate headers from the library. Currently, there is no "lazy man's header" that includes all the headers from this package. The reason for this is discussed in Item 34 of Scott Meyer's book "Effective C++". Basically, there are a lot of headers, and including them all everywhere makes things take forever to compile.
To link to the library, use -lsequence when linking up your object code