View
227
Download
0
Category
Preview:
Citation preview
8/2/2019 EST Tutorial
1/24
EESSTT CCLLUUSSTTEERRIINNGG
TTUUTTOORRIIAALL
ISMB 1999
Presenting Tutors:
Win Hide and Alan Christoffels
Authors:
Win Hide, Rob Miller, Andrey Ptitsyn, Janet Kelso, Chellapa
Gopallakrishnan and Alan Christoffels
8/2/2019 EST Tutorial
2/24
1 IntroductionThe tutorial is organized to provide an introductory framework for understanding the conceptsand underlying reasoning for Expressed Sequence Tag (EST) Clustering and consensus
generation and, in addition, to enable readers/attendees to be able to design and implement an
EST consolidation system. It is based on what we have learned over the past three years about
clustering technologies and issues surrounding the field The tutorial includes examples that
have been drawn from the authors' experience, which consist mostly of STACK,
STACK_PACK and the algorithms used in these systems. The methods employed by thesesystems will be the focus of the demonstration and review session in the second half of thetutorial. Many of the tasks we discuss can also be performed by other systems and we will
attempt to compare system strategies as objectively as possible.
Questions on the material or implementation of clustering systems can be directed at theauthors:
Win Hide : winhide@sanbi.ac.za www.sanbi.ac.za/stack
Alan Christoffels: alan@sanbi.ac.za
8/2/2019 EST Tutorial
3/24
1 Introduction................................................................................................................................ 22 The need for EST clustering........................................................................................................ 43 General Aspects of EST data.......................................................................................................4
3.1 Generation of ESTs............................................................................................................. 43.1.1 What is an EST? .......................................................................................................... 4
3.2 EST data quality.................................................................................................................. 54 Overview of clustering and consensus generation ........................................................................5
4.1 What is an EST cluster?....................................................................................................... 5
4.2 Loose and stringent clustering ............................................................................................. 54.3 Data apprehension and input format.................................................................................... 64.4 Pre-processing..................................................................................................................... 64.5 Initial clustering.................................................................................................................. 64.6 Assembly ............................................................................................................................ 64.7 Alignment processing.......................................................................................................... 64.8 Cluster joining..................................................................................................................... 6
4.8.1 Clone joining...............................................................................................................64.8.2 Available parents......................................................................................................... 6
4.9 Output................................................................................................................................. 75 Summary of Introduction ............................................................................................................7
6 Implementation Strategies...........................................................................................................76.1 EST pre-processing strategies.............................................................................................. 7
6.1.1 Screening out repeats and vector sequence ...................................................................76.2 Masking strategies............................................................................................................... 8
7 EST Clustering methods.............................................................................................................. 97.1 Clustering and Statistical cluster analysis............................................................................. 97.2 Using common search engines............................................................................................. 9
7.2.1 Alignment scoring methods: BLAST and FASTA...................................................... 107.3 Purpose-built alignment based clustering methods ............................................................. 107.4 Non-alignment based scoring methods: D2-cluster........................... .................................. 107.5 Pre-indexing methods........................................................................................................ 11
8 Systems and Algorithms............................................................................................................ 118.1 TIGR_ASSEMBLER ........................................................................................................ 118.2 UniGene ........................................................................................................................... 128.3 STACK and STACK_PACK............................................................................................. 13
8.4 Other datasystems ............................................................................................................. 138.5 Strategies for keeping data 'current'.................................................................................... 14
9 Cluster Assembly and Processing.............................................................................................. 149.1 Processing alignments....................................................................................................... 14
10 Clone Linking....................................................................................................................... 15
11 A working clustering system ................................................................................................. 1511.1 Expression counts ............................................................................................................. 1611.2 Consensus sequences......................................................................................................... 1611.3 Alternate expression-form charcterisation........................................ .................................. 1611.4 SNP detection ................................................................................................................... 1611.5 Identification of genes expressed in the cluster project ....................................................... 1611.6 Identification of genes specifically expressed in a chosen library or tissue.......................... 16
12 Brief introduction to the STACK_PACK clustering system.................................................... 16
12.1 STACK processing............................................................................................................ 1612.2 Data bin strategy and masking........................................................................................... 16
12.2.1 Detecting and viewing alternate splicing .................................................................... 1712.2.2 Adding using STACK_PACK.................................................................................. 20
12.3 References in order of mention. ...................................................................................... 2012.4 Appendix ......................................................................................................................... 22
8/2/2019 EST Tutorial
4/24
2 The need for EST clusteringWith the easy access to technology to generate expressed sequence tags (ESTs), several
groups have sequenced from thousands to several hundred thousands of ESTs. Currently the
majority of the coding portion is in the form of expressed sequence tags (ESTs), and the need
to discover the full length cDNAs of each human gene is frustrated by the partial nature of
this data delivery. There is significant value in attempting to consolidate gene sequences as
they are produced, in lieu of a yet-to-be-completed reference sequence. ESTs offer a rapid andinexpensive route to gene discovery, reveal expression and regulation data (Vasmatis, et al,1998), highlight gene sequence diversity and splicing (Wolfberg and Landsman, 1997), and
may identify more than half of known human genes (Hillier, et al, 1996). The price of the
high-volume and high-throughput nature of the data, however, is that ESTs contain high error
rates (Aaronson, et al, 1996), do not have a defined protein product, are not curated in ahighly annotated form and present only a raw substrate for sequence matching. Unfortunately,
most EST data remains unprocessed, and thus does not provide the important high value
sequence consensus information that it contains. The low quality sequence data provided can
be much improved on, and in order to achieve quality information, pre-processing, clustering
and post-processing of the results is required. One goal of these projects is the construction of
gene indices, where all transcripts are partitioned into index classes such that transcripts areput into the same index class if and only if they represent the same gene or gene isoform.
Accurate gene indexing facilitates gene expression studies as well as inexpensive and early
gene sequence discovery through the assembly of ESTs that are derived from genes that have
yet to be positionally cloned or obtained directly through genomic sequencing.
3 General Aspects of EST data3.1 Generation of ESTs3.1.1 What is an EST?
Adams et al (1991) published a widely read article describing use of ESTs in 1991. An EST,Expressed Sequence Tag, is a tiny portion of an entire gene, a fragment of a cDNA clone that
has been sequenced. The process by which ESTs are manufactured requires the construction
of an mRNA library. Baldo et al (1996) have provided a detailed description of how libraries
are constructed and how normalization and library subtraction can be used to increase relative
representation of less abundantly transcribed mRNAs. The reverse transcriptase used to
manufacture each cDNA in the library will eventually fall off the template (Figure 1), and this
will terminate the production of the cDNA. Thus a series of length-differentiated 3' delimitedcDNA fragments may be produced for each mRNA that is a viable template in the library.
The length of the cDNA will vary, and this is an important factor for development of
coverage for each mRNA template of an available gene. Usually, several hundred to several
thousand clones are isolated at random from a given cDNA library. Clones are sequenced a
single time, from one or both ends of the DNA insert, using universal primers which arecomplementary to the vector at the multiple cloning site. The M13 forward primer may be
located near the 5' or the 3' end of the cloned insert, depending on how the inserts weredirectionally cloned. Only 300-500 readable bases are produced from each sequencing read,
and yet a full gene transcript may be several thousands of bases long. ESTs thus provide a
"tag level" association with an expressed gene sequence, trading quality and total sequence
length for the high quantity of genes which can be tagged in a given amount of time.
8/2/2019 EST Tutorial
5/24
Figure 1 Manufacture of an EST
3.2 EST data qualityGeneration of EST data result in 'low quality' sequence information. A single read is
generated for each EST, and as such will contain errors from its generation at each step.
These can include clone orientation, associated clone ID chimeras and missing 3' and 5' reads.
Because data are single-pass unedited sequences, they are also subject to errors caused by
compressions and basecalling problems resulting in frameshifts. Reference to the Washington
University website details common aspects of EST error1
. EST sequence has regions of high
quality very close to regions of low quality, where quality can be defined as the number of
correctly sequenced bases within a known window of reference. It is possible to utilise poorquality sequence as long as relevant strategies for maximising their utility are taken.
4 Overview of clustering and consensus generationEST Clustering is performed as a process that utilises 'clustering information' that is less and
less definitive. Initially sequence identity provides a good guide to cluster membership.
Shared annotation provides joining information that can be of more variable quality. Thus the
number of accurately clustered ESTs is heavily dependent on a strategy that can assign cluster
membership based on verifiable criteria; sequence identity is currently the most useful of
these.
Clustering can be performed with or without sequence consensus generation. It is preferable,
although more difficult, to manufacture a consensus sequence from each cluster. Theclustering overview will briefly describe processes that result in consensus sequence
generation.
4.1 What is an EST cluster?A cluster is fragmented, EST data (DNA or protein) and (if known) gene sequence data,
consolidated, placed in correct context and indexed by gene such that all expressed data
concerning a single gene is in a single index class, and each index class contains the
information for only one gene (Burke, Davison, Hide, Submitted, Genome Research).
4.2 Loose and stringent clusteringESTs by their nature have a degree of erroneous sequence data, complicated by short lengthand some mis-annotation. Stringent one-pass assembly methods tend to result in fewer,shorter consenus sequences. Looser systems for clustering result in larger, more 'sloppy'
clusters, with various expressed forms being represented within each cluster. Each approach
has its advantages and disadvantages. Stringent clustering provides greater initial fidelity, at a
8/2/2019 EST Tutorial
6/24
cost of lower coverage of expressed gene data and a lower inlcusion rate of expressed gene
forms.Loose clustering provides greater coverage, at a cost of possible inclusion of
paralogous expressed genes, lower fidelity data, but at a gain of greater inclusion of alternate
expressed forms. We detail here a loose strategy with comparison to other strategies describedbelow. Stringent clustering is performed by algorithms such as TIGR_ASSEMBLER and
more loose clustering by the process employed in Unigene and STACK manufacture. The
latter methods tend to be broken down into several sub-processes, resulting in more
flexibility, but are also more complicated.
4.3 Data apprehension and input format.Data sources for clustering can be in-house, proprietary, public database or a hybrid of these.
It is therefore quite important to establish an INPUT DATA FORMAT that is consistentbetween these sources. The format needs to embrace the input accession number of a
sequence (or temporary accession, such as a sequence-run id if not yet submitted),
information on location with respect to poly A (3' or 5') and importantly, the CLONE ID from
which the EST was generated. SANBI uses FASTA format input files that contain the above
information in the header.
The steps suggested in EST clustering are as follows :
4.4 Pre-processingSequences are masked for repeats and vector, and formatted for the clustering engine.
Sequence quality is often assessed at this step. A minimum number of residues are accepted
above a known quality threshold. All masked sequence data is accepted for clustering above
50bp in length. SANBI's input acceptance approach has a low input threshold stringency, as
the clustering method(s) can work with more erroneous data more successfully. NCBI
discards ESTs with a window of less than 100bp of 'clean' data.
4.5 Initial clusteringAn initial clustering is performed based on a measure of high sequence identity
4.6 AssemblyAssembly is either part of the initial clustering (as used in TIGR_ASSEMBLER) or separated
into clustering followed by assembly performed by a specialist assembly package such as
PHRAP (with assessment of residue quality turned off) or CAP2 / 32
4.7 Alignment processingAligned clusters, particularly those generated by a loose clustering engine, need to be
processed for errors and alternate forms of expressed sequences. Consensus generation may
be a result of this step (as in STACK), or a consensus can be accepted directly from the
assembly step. Consensi are chosen based on maximal length. See section 9.1.
4.8 Cluster joining.Once clustered, clusters and/or cluster consensi can be further joined by available annotative
approaches.
4.8.1 Clone joining
The most powerful cluster joining method is clone-joining, which utilises the physically
shared clone id between 3' and 5' EST fragments sequenced from the same starting clone. Adifferent approach would be to link clone-related sequences in the pre-processing phase, but
this may increase errors and processing time requirements at the clustering and assembly
stages. By either approach, linking by clone annotation is an error-prone step in the EST
consolidation process as it relies entirely on the accuracy of the sequence annotation and the
uniqueness of clone IDs if data from disparate sources is to be used.
4.8.2 Available parents
If a parent mRNA sequence is available (non-EST) it can be used to physically link EST
cluster(s) via sequence comparison.
8/2/2019 EST Tutorial
7/24
4.9 OutputFigure 2 Cartoon of clustering steps
Defining an output format for the clustering process is problematical. Information required
often includes alignment (alternate splices, polymorphisms and error assessment), raw cluster
membership, and contextual links. Nonetheless, results must be easily incorporated into
existing software packages, which in general have not been designed to support thecomplexity or evolving nature of clustered EST data. For STACK, individual cluster
alignments (including consensi and sub-consensi) are presented in an alignment format(GDE), while consensus sequences and clone-linked sets are stored in FASTA format for use
as BLAST-searchable databases. A distributable FASTA file is in development, as is a
'reduced' format for easy distrbution. Development of data formats and systems for accessing
information about EST clusters is an ongoing project at SANBI.
5 Summary of IntroductionIn this section we have addressed the value of clustering ESTs for the elucidation of complete
sequences not yet available using traditional methods. The nature and quality of EST datahave been assessed with reference to the limitations that these have on sequence processing.
An overview of EST clustering procedures indicating the flow of EST sequence information
through stages of pre-processing, initial clustering, assembly, alignment processing, clusterjoining and output has been discussed. The following section will examine each of these
stages in more detail and address some of the issues encountered.
6 Implementation Strategies6.1 EST pre-processing strategies6.1.1 Screening out repeats and vector sequence
Clustering requires that a certain degree of identity be preserved between each member of a
defined cluster. The most common problem in EST clustering is contamination with a
sequence common to several members of the input EST data set but not unique to a specific
group. Two classes of common contaminating sequence are :
8/2/2019 EST Tutorial
8/24
6.1.1.1 Repeats
Any uncharacterised expressed sequence set is likely to contain repeat sequences. Forinstance in humans the most common element is the ALU repeat element. In newly sequenced
genomes, such as plant and other eukaryote systems, repeat sequences represent a common
and frustrating clustering problem. Repeat databases provide a resource against which repeats
can be detected. The repeat databases are dependent on continuing curation and detection of
novel repeats in genomes and thus provide a valuable resource.
6.1.1.2 Cloning vector fragmentsVector sequences can skew clustering even if a small vector fragment remains in each read.
6.1.1.3 Low complexity sequences
Low complexity sequence (poly A tracts, AT repeats etc) also have the potential to provide an
artifactual basis for cluster membership. The problem is more significant for strategies thatemploy alignable similarity in the first pass cluster assignment. Word based cluster
assignment can be modified to provide low weight to low complexity words. In recent EST
clustering applications being developed at SANBI (d2-cluster, ASSA) and elsewhere, there is
no need for masking of low-complexity DNA as such regions tend to have a highly redundant
oligonucleotide composition. Because these sequence comparison algorithm scale
oligonucleotides according to their potential information content, highly redundant oligos are
given very low weight and low-complexity regions are therefore excluded from consideration.
6.2 Masking strategiesThe most effective method to remove contaminants is to compare each read against a
reference database of repeats (RepBase3) and vector sequences (VecBase
4) using an algorithm
that is reasonably fast and accurate. XBLAST (NCBI tools) and Cross-Match, an
implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green5
have
been used successfully, with Cross-match demonstrating greater flexibility and sensitivity
than XBLAST. Repeat-Masker6contains Cross-match as a tool for masking. DUST is used at
NCBI for this procedure.
In cases where a direct identity is found with a repeat or vector subsequence, a 'mask residue'
can be substituted into the read. The resulting runs of NNNNs (DNA) or XXXXXs (protein)will be ignored by most clustering engines (Figure 3a and 3b).
A problem arises when an EST library is presented that is from a novel organism for which
the repeats have not been characterised. In this instance it may be necessary to employ 'blind'
repeat masking if an algorithm is available. Repeat masking is necessary if the repeats are
large enough to represent a source for artifactual contamination.
An exploitable feature of sequence contamination in loose clustering is that, if the tools work
as intended, then the contaminated sequences and all related sequences will be clusteredtogether. There is no automatic method to identify a contaminated cluster, but once it is
identified only that cluster needs to be decomposed into its original sequences and re-
processed (no other cluster will be affected by the sequences in the contaminated cluster).
This may not be the case for stringent clustering systems, where a more cautious repair
strategy is required.
Figure 3a EST sequence in FASTA format.
>T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE
gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' end
AAGACCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTT
CTAATATCTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGC
ACACAGATGTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTC
TCCACTGAAAAATCCTCTTTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCAC
TGGACGGTGACGTCAGCCATGTACAGGATCCACAGGGGTGGTGTCAAATGCTATT
GAAATTNTGTTGAATTGTATACTTTTTCACTTTTTGATAATTAACCATGTAAAAAATG
8/2/2019 EST Tutorial
9/24
Figure 3b EST sequence read after masking vs BLASTX or Cross-Match and Vecbase + RepBase
>T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE
gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' endxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxTATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAAATGAATG
TAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCTTTC
TTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCAT
GTACAGGATCCACAGxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAACCATGTAAAAAATG
7 EST Clustering methods7.1 Clustering and Statistical cluster analysisClustering is not exactly classical cluster analysis as understood by statisticians. In cluster
analysis practically all statistical aspects are about classification and justification of
classification of the object with complicated relationships. This may be why we use the word"clustering" instead of cluster analysis. In the case of clustering the fundamental problem is
the selection of a metric that is reduced to a simple binary digit, i.e. sequences either match ordon't. To cluster the EST data we look at near identical matches. Essentially, the fragments we
put together belong to the same DNA and possible mismatches are mostly resulting frommisreads and represent noise. The inter-cluster distance, as well as inter-object distance is
also reduced to binary - in an ideal situation, the clusters of ESTs related to one gene should
be infinitely close to each other while those related to different genes are infinitely distant and
no intermediate states should make sense. Following the simplification of a distance measure,
the problem of clustering quality estimation also reduces to a trivial task: the closer we get to
detecting all exactly matching fragments the better. Thus, from the statistical point of view
clustering ESTs is a trivialised version of cluster analysis. But it doesn't make this job anyeasier.
7.2 Using common search enginesSmall projects, clustering of a few dozens or hundreds or even thousands of ESTs are indeed
'trivial' and can be approached by standard tools of contig assembly or even multiple
alignment. The real challenge is the amount of data, available now in public and private
databases and waiting to be clustered. Taking into account these millions of ESTs, obtaining
the trivial binary distance between fragments is far from a trivial job even for availablesupercomputers. Modern tools of sequence comparison (well -known Smith-Waterman,
FASTA, BLAST) are mostly built for a different purpose: searching. They are all different
variations of an alignment algorithm, i.e. correct position of sequence elements (nucleotides
or groups of nucleotides) against each other maximizes some score. The purpose of this
process is to detect and measure quantitively the similarity (distance) between any 2
sequences compared. Smith-Waterman is the most exhaustive and computationally expensivetool, deriving the best sensitivity and detecting weak similarities. FASTA and BLAST are less
sensitive and trade some sensitivity for speed. As mentioned previously, the distance measure
in EST clustering is reduced to binary, it is therefore only necessary to detect a near or perfect
match. Extension penalties and gapping manipulation become less important in an initial
assessment of pairwise identity. It is therefore important to 'head for speed' over sensistive
comparison. Use of a banded Smith-Waterman on already compared clusters is an approach
that is tenable for further consensus generation.
8/2/2019 EST Tutorial
10/24
7.2.1 Alignment scoring methods: BLAST and FASTA
BLAST is an algorithm efficiently implemented on many platforms.Although not developedspecially for clustering, BLAST sequence comparison is widely used initially in EST
clustering because it's readily available and flexible enough to be tuned for the task with a
change of default parameters (wrappers for BLAST iterative comparison exist and are
accessible via the web, see appendix). Using the standard BLAST II application, available
from NCBI anonymous FTP, it is possible to set up a stringent match set of parameters as
follows:
-e expectation value set to 0-G cost to open a gap can be increased,
-E cost to extend a gap can be increased,
-q mismatch penalty increased
-r match reward increased
-a number of processors to use can be adjusted,
-W (word size, found on NCBI web) set up for longer words (default 11)
FASTA is less widely used, but can be also applied for the same purpose. Generally, it allows
the same type of variation in parameters as BLAST - increasing the k-tup parameter to
increase speed and raising a threshold to pick up only the strongest similarity.
7.3 Purpose-built alignment based clustering methodsThe field has recently seen the emergence of many new algorithms in development, but
dedicated production algorithms are still few. We will not attempt at this time, to review the
entirety, as there is a growing level of new material that has been very recently appearing.
ICA tools, distributed last in 1997 by Jeremy Parsons, represents an alignment-based
systematic approach to clustering. This group of algorithms was one of the first to become
available, and has been used at major sequencing centres and is useful for data reduction. The
system is however not as complete as others that have been developed subsequently . ICA
tools were developed as part of the UK human genome mapping project, (Alwen, 1990).
According to Jeremy Parsons, the ICAtools are a set of programs that could be of use to
anyone doing medium-to-large scale DNA sequencing projects. The system has several toolsbut was originally designed for database redundancy and adapted for genomic fragments and
finally EST fragments.ICAss uses a BLASTN type of algorithm to perform 'database pruning' to assess wether one
sequence is a sub-set of another. N2tool consitutes a dedicated clusterng tool that relies onindexing. It uses an indexed file format and local alignment to compare all the submitted
sequences with each other to find those which share any region of similarity. ICAtool indexes
DNA sequences into clusters which share local sequence similarities. ICAass takes a size-
sorted (longest first) file of sequences and searches for those sequences which are
approximately repeated within the length of another. ICAmatches attempts to explain why
sequences have been clustered together by using novel sequence alignment. N2tool has been
utilised in the Washington University Merck EST manufacture, for identification of artifactual
sequences. A new generation of these tools is being developed and employed at EBI7, and
employs CORBA and Java for database interrogation (JESAM: Parsons and Rodriguez
Tome, personal communication)
EST-BUILDS
Another common-sense system recently published by Gill et al (1998) is a 'dynamic build'
system, where a seed EST is used to match with others via a build-blast strategy. This method
works well for single seed clusters, but is not implemented for large scale database building.
7.4 Non-alignment based scoring methods: D2-clusterD2-cluster is a word multiplicity comparison method that utilises an agglomerative algorithm
that has been specifically developed for rapidly and accurately partitioning transcript
databases into index classes by clustering ESTs and full-length sequences according to
minimal linkage or transitive closure rules. Agglomerative clustering method means thatevery sequence begins in its own cluster and the final clustering is constructed through a
series of mergers that may be described in terms of minimal linkage, sometimes called single
linkage or transitive closure". The term transitive closure refers to the property that any two
sequences with a given level of similarity will be in the same cluster, hence A and B are in the
8/2/2019 EST Tutorial
11/24
same cluster even if they share no similarity but there exists a sequence C with enough
similarity to both A and B. The criterion for joining clusters is the detection of two sequences
that share a window of (Window_Size) bases that is (Stringency) percent or more identical.
The only criterion for clustering is sequence overlap and source or annotation information isnot used. To detect the overlap criterion we use the d2 algorithm and set parameters and
threshold values as described in (Torney et al, 1990; Hide, et al, 1994; Wu et al, 1997). The
initial and final state of the algorithm is a partition of the input sequences where each
sequence is in a cluster and no sequence appears in more than one cluster.
D2-cluster uses an approach of word matching within a window, together with a measure ofthe multiplicity (if any) of that word within a window. The principal concept is that it doesn't
attempt an alignment, not even in a reduced form. The results of comparison are derived
directly from the comparison of word composition (word identity and multiplicity) of 2
sequence windows. Thus, the algorithm can be significantly faster than BLAST. Speed comes
with a price: to collect significant statistics, the fragments must be long enough (about 100
bp) and only very high similarities can be detected (above 90% identity within a window).
D2-cluster is used to produce initial loose clusters in STACK clustering system. We have
determined that results of d2_cluster alone are between 8% and 20% less fragmented than
UniGene (Burke, Davison, Hide, submitted, Genome Research) and the STACK datasystem
produces clean clusters that are 16% less fragmented than Unigene (Miller et al, submitted,
Genome Research 1999)
7.5 Pre-indexing methodsThe size of the datasets has effectively precluded their use on workstations architectures.
Indexing is one approach that allows for less computationally intense operations. Indexing of
sequences allows one to store ready tables of words and their positions in a sequence, so the
comparison algorithm has more than half of the work done already when it starts. Publicly
available tools are to be available in the near future, for instance, 'QUASAR' was
announced at RECOMB99, it is termed ' Q-gramm Alignment based on suffix arrays'. Thisalgorithm is designed to quickly detect sequences with high similarity to the query in a
context where many searches are conducted on one database. The database is presented in the
pre-processed (indexed) form and similarity detection is based on the exact matching of short
substrings (q-gramms). (Burkhardt, et al 1999).
8 Systems and Algorithms8.1 TIGR_ASSEMBLERDeveloped for the assembly of large shotgun projects, TIGR assembler has also been
employed to manufacture consensus sequence assemblies fo TIGR-THC from ESTs.
The assembler was originally designed for large shotgun projects, but is suited for EST
assembly. It employs a standard rapid oligonucleotide content comparison to reduce search
time. Pairwise comparions generate a list of potential (end-)overlaps. Non-repeat fragments
seed subsequent assembly of listed overlaps.TIGR Human Gene Index (HGI, http://www.tigr.org) employs the strict assembly method of
TIGR_ASSEMBLER (Sutton, et al, 1995), grouping highly related sequences and
consequently producing very accurate consensus sequences with a minimum of chimerism or
other contamination. This method discards under-represented and divergent or noisysequences in favour of confidence based on transcript redundancy, but in doing so can
eliminate related sequences from clusters which might provide examples of alternative
splicing or other valuable forms of sequence diversity.
For TIGR THC , assemblies and ESTs are clustered according to a two stage process by the
THC_BUILD script (G. Sutton):
1) BLAST and FASTA are used to identify all sequence overlaps,2) All overlaps are stored in a relational database, and
3) Transitive closure groups are formed and subjected to assembly using TIGR assembler(Sutton et al, 1995) such that joining occurs when sequences share at least 95% identityover at least 40 bases (http://www.tigr.org/hgi/hgi_info.html).
The presence of sequence repeats affects both the order and stringency of joining. The
resulting assemblies are referred to as tentative human consensus sequences (THCs). The
8/2/2019 EST Tutorial
12/24
assembler also imposes special matching constraints on the ends of sequences and a minimum
sequence identity within an index group. The strictness of matching criterion has the
advantage of often preventing chimerism and contamination from tainting index groups but
results in more a more fragmented representation of the data that is less able to incorporateerror prone sequence. This strictness often disallows the combination of sequences with
sufficient diversity so that sufficiently divergent ESTs that sample alternative splice forms of
the same gene are kept in different assemblies but they are linked as being splice variants in
those cases where the ESTs match sequenced genes with known isoforms in EGAD. Most
sequence assembly programs share similar properties with TIGR. The PHRAP package,incorporates sequence quality data derived from sequence traces into the assembly process (P.Green, personal communication) allowing for the incorporation of higher error data but the
edge matching and maximum mismatch criterion still must be satisfied. In general, the results
of assembly programs are not invariant with respect to presentation order (Burke, Davison,
Hide,Submitted, Genome Research).
Comparison of TIGR_THC with SANBI STACK_PACK methodology.
Table 1
Methodology Input Sequences Singleton Groups %Singleton Groups
TIGR Gene Index 626 163 135 140 21.83
STACK_PACK 415 833 58 070 13.96
STACK_PACK analysis of UniGene clusters resulted in a fragmentation rate just over half of the
TIGR index.
8.2 UniGeneThe most widely known effort is UniGene (Boguski et al, 1995) from NCBI. UniGene will be
replaced by RefSeq and LocusLink8 projects as the availability of genomic sequence
increases.
The UniGene project originally took the fingerprinting characteristic of 3 UTRs (and hence
3 ESTs) as a paradigm and indexed genes by clustering 3 end EST sequences with mRNA
data extracted from GenBank (Benson et al, 1994). 5 end ESTs were added to the clusters
using clone information. To enhance the speed of the clustering phase of the project a two-
phase searching process was used for overlap detection. First, two sequences were marked
for further comparison if they shared two common words of length 13 separated by no more
than 2 bases. The sequences selected for further comparison were then compared with a
constrained Smith-Waterman local alignment algorithm. The membership of clusters in
recent versions of UniGene, however, suggests that additional criteria are being used to
determine cluster membership (Burke, Davison and Hide, Submitted, Genome Research)
UniGene clustering proceeds in several stages, with each stage adding less reliable data to theresults of the preceding stage. 'Reliable' refers to pairwise identity>annotation>shared clone
etc.
Contaminant screening of vector sequences and repetitive elements and mitochondrial andribosomal sequences is performed using NCBI's DUST. After screening, a sequence must
contain at least 100 informative bp to be a candidate for entry into UniGene (informative
length). This initial criterion differs from the 'all in' strategy employed at SANBI where
alignment is not required for clustering. Sources for UniGene include GenBank Genomic,
dbESt and GenBank mRNAs. GenBank Genomic sequences are electronically spliced exons
(vGenes).
Initial clustering is performed by comparing the set of gene sequences (mRNA or genomic
sequences, many of which are complete CDSs) with itself. Sequence pairs which are
sufficiently similar are grouped together to form initial clusters. EST to gene links and EST
to EST links are added to these clusters. The set of ESTs is compared with the set of genes
using WHALE (http://nucleus.cshl.org/meetings/98genome_absstat.htm), and sufficiently
similar sequence pairs are added to the clusters (MEGABLAST Unpublished, figure 4). Any
8/2/2019 EST Tutorial
13/24
new links which would join two distinct clusters from the preceding stage (that is, join two
sets of genes not linked to form one cluster without the addition of ESTs) are discarded.
Figure 4from Wagner et al. Abstracts 1999 Genome Sequencing and Biology. P340. Cold Spring Harbor
Laboratory.
Any resulting cluster which does not contain a sequence with a polyadenylation signal or twolabelled 3' ESTs is discarded. Clusters which pass these criteria are called anchored clusters,
since their 3' end is presumed to be known.The drawbacks of this method match the benefits of the strict matching technique in that
chimerism and other artefacts can join unrelated clusters, and furthermore the diversity withina grouping can make the generation of a consensus sequence complex (Schuler, et al, 1996),
often resulting in accepting only the longest representative of an index class as its consensus.
8.3 STACK and STACK_PACKAs described in Hide et al (1998), and (in Miller, et al, Genome Research, Submitted)
STACK_PACK makes use of dbEST and clusters ESTs within tissue source categories.
STACK_PACK extends the application of the loose clustering approach to a global level,
defining index classes by the total number of (possibly disconnected) matching 6-base words
rather than by alignment to previously identified class members. The approach is robust withrespect to EST quality data and increases the capacity to link splice-related ESTs without
concern for non-overlapping regions. The related but loose clusters are subsequentlyprocessed by strict assembly and analysis tools to identify, characterise and isolate any
sequence divergence. The clustering algorithm (d2_cluster) is discrete from the assembly tool
(PHRAP) and identifies ESTs that are greater than 96% identical over a window of 150
bases. PHRAP subsequently aligns them to provide an assembly for subsequent alignmentconsensus processing. STACK uses all available raw sequence data during cluster generation
and uses CRAW (Burke, et al, 1998) to generate consensi, to annotate polymorphic regions
and alternative splicing forms in the clusters. Contigproc then chooses the longest consensus
of highest quality for output. Clusters are viewed with VIZ to allow initial analytical
assessment.
8.4 Other datasystemsThe Genexpress Index (Houlgatte, et al, 1995),and The Merck Gene Index (Williamson, et al,1995), group sequences into clusters based on sequence overlap above a given alignment
threshold as UniGene does. The index was originally manufactured using Fasta comparisons
8/2/2019 EST Tutorial
14/24
with 3' sequence only. No searchable index can be found on the Web for Merck Gene Index.
8.5 Strategies for keeping data 'current'.The high throughput nature of EST production ensures that the available databases grow
almost daily, and the value of a consolidated EST database is a direct function of how current
the represented dataset is. UniGene's use of full-length gene sequences to seed the initial
clusters and rejection of linking ESTs ensures that new EST data will only add to existing
clusters, while the stringent clustering approach of TIGR Human Gene Index similarly limitsthe cluster joining effect new sequences may have. Purely loose clustering strategies such as
employed by STACK are more vulnerable to cluster joining as database revisions are
released. In either case, production time requirements for the consolidated database haveresulted in the need for a dyamic "ADD" facility whereby new sequences can be added
without regenerating all clusters. Of note in this regard is that ADDs will perpetuate any
previous processing errors, while a full production run increases confidence in the final result.
Every STACK (tissue-based) generation requires the clustering of the entire GenBank EST
dataset. dbEST is increasing enormously with each GenBank release and this has impacted
database generation because the input data to the clustering step has exceeded 100 000
sequences per tissue subset. Crossmatch finds overlapping regions at the ends of
approximately 30% of D2-clusters. These newly formed clusters are confirmed by closerexamination of the sequence alignments.
We reduce the input data to D2 cluster using crossmatch. New ESTs are extracted from
Genbank after each bimonthly release. These ESTs are partitioned into tissue subdivisions.The ESTs are initially compared to existing STACK consensi using crossmatch. The STACK
consensi that find matching ESTs are collapsed to their individual ESTs and combined with
their corresponding new ESTs. The ESTs are thus reduced by 20-60% prior to D2 cluster
processing. The approximately 20-40% ESTs that do not find matching members usingcrossmatch are fed into D2 cluster. The D2 clusters are renamed taking into account the
STACK-Ids that are present already. These new clusters together with the crossmatch
expanded clusters are assembled using PHRAP. The stack clusters that were expanded using
crossmatch are removed from the alignment file for the previous STACK release. This
reduced file is appended to the new alignments for the crossmatch and D2 clusters. At this
stage, the alignments are fed into the STACK_PACK system.
9 Cluster Assembly and ProcessingMultiple sequence alignment is a complex problem which has been tackled by many groups.
UniGene avoids it entirely by only listing the accession IDs of related sequences in thedatabase. A current standard for assembly of fragments from shotgun cloning of a single
gene is Phil Green's PHRAP, but even this cannot be expected to cope with the problems of
quality and sequence divergence observed in loose EST clusters. PHRAP will reject
sequences it considers to be "unrelated" and output them in separate files, regardless of the
findings of the clustering engine.
9.1 Processing alignmentsIn the real world of noisy data, no alignment engine performs as well in every case as the
human scientist's eye might be able to suggest; for high-volume, high-throughput applications
this problem can be compensated for by further processing of cluster assemblies, while
alignment editors such as Genetic Data Environment (GDE/SeqLab from GCG) can beinvaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST
clusters provide confidence and quality information - even in the absence of original trace
data - from which high accuracy, extended length consensus sequences can be constructed for
the vast majority of base positions. Given the realities of loose EST clusters, we find it
necessary to post-process cluster assemblies with CRAW, a tool which groups relatedsequences within an assembly and generates consensus sequences for each subset.
How is a consenus decision made? In the light of the significant divergence that can occur
between alternately spliced forms of a gene, majority consensus rule generation is
inappropriate. Rather, consensus sub-forms need to be clearly derived and extracted using free
8/2/2019 EST Tutorial
15/24
publically available viewer VIZ9 and on licensing, CRAWVIEW (figure 5) .
From Burke et al 1998:
Figure 5Alternate Splicing and chimersim provide subsequence alignments that bear valuableinformation. An alternate form of expressed sequence is detected in ovarian tumor library using
sub-consensus analysis.
10Clone LinkingAll ESTs generated from the same cDNA clone correspond to a single gene. Each EST
obtained from GenBank is searched for clone identification so that we can trace the
transcripts corresponding to the same gene. 87% of ESTs currently have documented clone
information. We utilise this information to extend the length of the cluster consensi by
joining clusters that have ESTs that share clone-IDs.
For a gene that is not yet fully sequenced, achievement of a representative consensus
sequence from clustered EST data requires the joining the available 5' and 3' read consensi.
Given that the clone ID information is solely annotation based and may have namespaceoverlaps depending on the data source(s), this step is best handled near the end of the
processing pipeline so that errors detected in the future can be repaired with a minimum of re-processing. Furthermore, unless a specific 5'-3' pair can be identified as a seed for each gene
consensus, the procedure is transitive in nature and may lead to extensive clone-linked
networks whose biological significance remains to be explored. The basic algorithm for clone
linking used in STACK is:
form a queue consisting of an initial cluster
do {
for each EST with a clone ID, add any cluster containing an EST
with a matching clone ID to the queue
} until no new clusters are added
When a closed set of clone-linked consensi has been identified, they may be ordered 5'-
unassigned-3' based on a majority rule from the EST annotations in each cluster. To form a
final consensus sequence in STACK, the non-redundant best (see Cluster Assembly and
Processing above) cluster consensi are joined by linker segments of 20 Ns. This choice was
made based on the word size employed by BLAST, so that alignment breaks would be
preferentially inserted at these linker regions.
11A working clustering systemWhy is clustering being performed? What is the primary output required? Designing a system
for clustering has to be tightly linked to the immediately desired outcome. Commonrequirements include:
8/2/2019 EST Tutorial
16/24
11.1Expression counts11.2Consensus sequences11.3Alternate expression-form charcterisation11.4SNP detection11.5Identification of genes expressed in the cluster project11.6Identification of genes specifically expressed in a chosen library or tissue.Each of the above require specific data manipulations. Direct clustering will yield expressioncounts, assembly will yield consensi, alignment processing and viewing will yield alternate
splices and SNPs, and improved consensus length generation and consensus joining will yield
more correctly identified gene expression forms. In the design of STACK_PACK we have
tried to address each of these direct outputs.
12Brief introduction to the STACK_PACK clustering system
In this section we will familiarise the user with the STACK_PACK structure and describe the
major components and processes in the light of the implementation issues we have descibed
above. The principal design we have chosen centres around a hierarchical clustering
approach. We initially chose this approach for CPU limiting reasons. We have foundsubsequently that hierarchcial clustering allows for flexible scaling with centralised or
distributed processing, viewing and analysis of consensus clusters at any defined level: from
original EST library/tissue cluster to whole index cluster view. The approach lends itself well
to adding of new ESTs at the appropriate level. We have found it particlarly powerful in
analysis of alternate splice events specific to particular tissue or expression 'bins'. (With the
advent of index based and tree-based methods, faster initial clustering implementations will
allow for non-hierarchical aproaches. Without a sophisticated, dynamic 'ADD'
implementation any approach loses effectivenes).We detail tools specific to the STACK_PACK process. Tools used by the system such as
PHRAP are described elsewhere.
12.1 STACK processing12.2Data bin strategy and maskingData splitting: The initial step is to partition the input set into manageable subsets. This is a
function of available CPU time and machine capacity, currently up to 500,000 sequences may
be clustered if one has access to SGI/Cray resources. For SANBI's STACK database, weperform initial partitioning on the basis of available bin (usually tissue) annotations for each
sequence, for a phase I process into bin consensi. Later the resulting consensus sequences canbe clustered and the final assemblies built into indices.
Masking: Cross-Match, XBLAST, comparison mask-databases. Input sequences are Fasta
format with headers that contain accession number of the EST and CLONEID.
Clustering: D2-Cluster (as described above) and Crossmatch.
Assembly: PHRAP (our assembler of choice) but any assembler that is preferred can be
successfully employed aslong as the input data is Fasta format. Output data can be parsed
from PHRAP .ace files and also from .gde files for processing into CRAW which requires
.gde format.
Alignment analysis: CRAW processes the alignments for groups that share consensus and
processes them into quality/difference partitions.
The CRAW tool allows for selective assessment of alignments. It measures and displaysconsistency. As a significant proportion of alignments resulting from assembled clusters can
constain non-optimal conseni and 'junk' as well as chimeras, alternate splices and
polymorphisms, it is important to perform directed analysis of cluster alignments.
CRAW consistencyA group of more than two sequences is consistent if and only if every sequence in the group is
8/2/2019 EST Tutorial
17/24
pairwise-consistent with the consensus sequence derived for the group. If several unrelated
sequences result in a poor quality alignment, a simple majority consensus generation rule
might sample the sequences such that a consensus is generated that is consistent with each
individual sequence in the group even when some sequence pairs or subsets contain a highdegree of mismatching. To prevent this, the consensus sequences are generated by an early
bias weighted method. We partition a group of sequences into sub-groups such that every
sub-group is consistent. Note that we must look for a maximal sub-grouping because the
trivial solution of breaking a group of N sequences into N singleton sub-groups is consistent
by our definitions. Thus, we find an optimal partition such that as many sequences aspossible are assigned to sub-groups. In our method, a sequence group is input and multiplealignments and consensus sequences are output. The amount of information that can be lost
to artifact or alternate gene forms is bounded by the fact that no sequence contains a window
of length W of less than (100*SIM) percent identity with the consensus. Regions of internal
sequence are fond by looking for regions in the alignment where windows of gapped region
size (G_R_S) contain over ceiling (GAP_PERCENT*G_R_S) gaps. An analysis in the paper
describing CRAW used parameters: G_R_S = 15 and GAP_PERCENT = 0.9. (Burke, et al,
1998)
12.2.1 Detecting and viewing alternate splicing
ALIGNMENT CONTAINS INCONSISTENCY:Strong Secondary Consensus Found.One position equals 20 bases.
X if more than 2 bases ( 10 percent) disagree with consensus sequences.
N if more than 2 positions are unknown.
"-" if more than 14 positions are gap characters.
0 200 400 600 800 1000 1141
| | | | | | |
------------------1112226666666666666666----------------- 6 AA205280 (brain)
------------------1112226-------------------------------- 6 AA205467 (brain)
------------------1112226666666666666666----------------- 6 cons. for 6
-111111111111NNNNN44444444------------------------------- 4 N62129
-1111111111144441---------------------------------------- 4 AI262897
-1111111111114441N44444444------------------------------- 4 cons. for 4
---------------11111111111111313333333------------------- 3 AA476710 (repro)
-----------------311111111111313333333333333------------- 3 R.C.AA902555(hemat)NCI
Kid3
-----------------13111111111131333333333333333----------- 3 R.C.AA128258(repro)
----------------------111111131333333333333333----------- 3 R.C.N94727
-------------------------11113133333333333333333--------- 3 AA129065
------------------------------1333333333333333----------- 3 R.C.AA243505
-------------------------------133333333333333----------- 3 R.C.AA127007(repro)
---------------------------------3333333333333----------- 3 R.C.N43925
-----------------------------------------33333----------- 3 AA442275 (repro)
---------------111111111111113133333333333333333--------- 3 cons. for 3
-1111111111111111111122---------------------------------- 2 AA262073
-11111111111111111111222--------------------------------- 2 AA831176
-11111111111111111111222222222--------------------------- 2 AA810752(hemato)NCI_GCB1
-11111111111111111111222222222--------------------------- 2 cons. for 2
NN11111-------------------------------------------------- 1 AA199678
11111111111111111111------------------------------------- 1 AA768208
-11111111111--------------------------------------------- 1 AA126628
-1111111111111111111111111111---------------------------- 1 N63563
-111111111111111111111----------------------------------- 1 AA730260
-111111111111111111111----------------------------------- 1 AA872419(hemato) tumors
-11111111111111111111111111111111------------------------ 1 AA128314
-1111111111111111---------------------------------------- 1 AA732483
-11111111111111111111------------------------------------ 1 AA993331
Title:
stackflow2.pdf
Creator:
Preview:
This EPS picture was not saved
with a preview included in it.Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
8/2/2019 EST Tutorial
18/24
--111111111111111111111111------------------------------- 1 N34852
---11111111111111111------------------------------------- 1 AA322575
------------1111111111----------------------------------- 1 AI267543
111111111111111111111111111111111------------------------ 1 cons. for 1
-1111111111111111111155---------------------------------- 0 AA743074(hemato)NCI GCB1
Figure 6 CRAW analysis of a STACK whole body index cluster that shows similarity toa genomic clone AC004106. Tissue origins are shown in brackets and clone origin is
underlined. Primary consensus(11111) shows significant similarity to clathrin coat
adaptor complex sigma1B protein. The secondary consensi(2222, 333333, 666666) show
significant similarity to two exons 3400036000 and 5410054200 (putative alternate
splice regions in genomic clone AC004106). Three ESTs (AA743074, AA205280 and
AA128258)(bold) were reported to represent putative alternate splice transcripts on
genomic clone AC004106 (Bouck et al., 1999). This craw analysis would suggest that
there are more ESTs in addition to the three reported by Bouck that samples the
putative alternate transcripts for the clathrin coat complex.
Consensus processing: CONTIGPROC
Once consensus sequence(s) have been generated by CRAW, it is necessary to attempt to
choose the most approriate consensus according to gene isoform.
CONTIGPROC independently partitions the aligned sequences amongst the CRAW consensi,
then ranks the consensi according to number of assigned sequences and number of calledbases. Contigproc reads a cluster assembly and the associated consensus sequences generatedby CRAW, then assigns each EST in the assembly to its 'best consensus'- the consensus that
has the most contributing ESTs. A round of elimination follows, in which consensi
representing only a single (or even no) ESTs are removed, then the consensi are ranked
according to the number of ESTs assigned to each (ties are broken by consensus sequence
length considering only clear ATCG base calls). The remaining consensi are logged with the
best consensus in the GIO and GDE file formats which support representation of sequence
alignment data. Finally, the internal cluster representation is output in the supported file
formats (GIO-NCGR, STACK-FASTA, SANIGENE-FASTA, STACK-GDE, SANIGENE-
GDE). The 5' or 3' orientation of each cluster is determined by a vote of the individual EST
annotations, and all output consensi are arranged to read 5' to 3'. Low quality consensus
regions, defined as two N's followed by at least thirteen IUPAC codes with four or less clear
A, T, C, or G calls, are replaced by a single run of 10 Ns.
Clone-linking:
More than 1 end may exist for each clone ID, and rather than tracing from EST1 -> clone ID -
> EST2, we simply form groups of ESTs for each clone ID and then check for any linksbetween these sets for extended clone link networks.
Output system
GDE format is appropriate for the presentation of cluster assembly data, but does not support
significant regions of non-overlapping sequences as in the case of the clone-linked data. Such
features are well supported by GSDB's GIO format, but this is not widely accepted by
software in the sequence processing field. FASTA format sequence data is ubiquitously
accepted, but captures only a single sequence in each record. As a result, SANBI supplies
appropriate data in all three formats, and further distinguishes clusters which are/are notjoined together in the clone linking phase for the GIO and FASTA output sets.
Data Structures
12.2.1.1 Cluster assembly for maximum consensus quality
Further processing of cluster assemblies (see CONTIGPROC/CRAW below), ensures highthroughput consistency of clustering alignment while alignment editors such as GDE can be
invaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST
clusters provide confidence and quality information - even in the absence of original trace
data - from which high accuracy, extended length consensus sequences can be constructed for
8/2/2019 EST Tutorial
19/24
the vast majority of base positions.
Output restriction for generation of higher quality sequence:
SANIGENE dataset generation strategies
The exact sanigene membership criteria is:
At each base position of the consensus sequence
A = number of EST bases that agree with consensusB = number of EST bases that disagree with consensus
(A-B) >= 2
only ESTs assigned to the top CRAW consensus are considered, i.e. hopefully only the "good
sequences" from a cluster are in this set to begin with. (for any STACK or SANIGENE
cluster we mainly only work with the top CRAW consensus and the ESTs assigned to it, i.e.CRAW subconsensi and their ESTs are dumped off to the GDE files)
Generation of high confidence consensi
By restricting the final output sequence to only those regions represented by at least two
reads, as in the SANBI SANIGENE dataset, high confidence consensi can be generated with
a longer length (due to the rejection of clusters with small numbers of ESTs) in the average
case.
Rejection strategies.
To avoid redundancy in the final sets for BLAST searching, only the best consensus from
CRAW processing is used in the FASTA data sets (see Cluster Assembly and Processing
above)
The simple rule of thumb is that ESTs contain regions of very low quality, but the savvy
bioinformaticist can make up for this deficiency and even improve confidence overall by
studying the alignment of related sequences. In the real world of noisy data, no alignment
engine performs as well in every case as the human scientist's eye might be able to suggest;
for high-volume, high-throughput applications this problem can be compensated for by
further processing of cluster assemblies (see CONTIGPROC/CRAW below), while alignment
editors such as GDE can be invaluable for improvement "by hand" of individual cases.
Nonetheless, assembled EST clusters provide confidence and quality information - even in
the absence of original trace data - from which high accuracy, extended length consensus
sequences can be constructed for the vast majority of base positions. By restricting the final
output sequence to only those regions represented by at least two reads, as in the SANBI
SANIGENE dataset, high confidence consensi can be generated with a longer length (due to
the rejection of clusters with small numbers of ESTs) in the average case. The judicious
definition of linker segments to join regions of high-confidence, multiple read sequence data
in the final consensus sequence can actually improve results for BLAST searches, by allowing
the natural separation of High-scoring Segment Pairs (HSPs) in the search algorithm. This
result is over and above the improvement obtained by eliminating regions of low quality readsand other production errors. Even when utilising an assembly engine prone to mis-alignment
and error when confronted with the problems commonly observed in EST data, cluster
assembly provides a net benefit in maximising consensus sequence quality: the cleanest
regions will be identified and highlighted, while the remainder are isolated in preparation for
appropriate strategies for cluster alignment processing.
8/2/2019 EST Tutorial
20/24
12.2.2 Adding using STACK_PACK
See above.
12.2.2.1 Accession number management under dynamic addition
ESTs are passed against existing singletons and consensi. As ESTs are added to an existing
clustered database, it is likely that new sequences will join previously separate clusters. The
STACK strategy is to decompose the existing clusters to their constituent ESTs, then
reprocess these complete, new clusters from the assembly step onwards. In this process the
previous cluster IDs are removed from the database and new IDs created, but a problemdevelops in that existing sequence analysis and visualisation tools rely on accession numbers
to remain unchanged as the sequence databases grow (the original targets, experimentally
determined sequences, would not be expected to change significantly over time). The only
method we have identified to cope with this problem is the annotation of "secondaryaccession" fields for each cluster, listing the older cluster IDs each has subsumed. This
strategy is mirrored in the system used with th IS databank consortium by the use ofaccession
versions. This is a new field where the first number is the never changing accession number,
followed by a period and a version number. The version number starts at one, and increases
by one each time the sequence changes (ie is merged, where the largest cluster keeps its
primary accession, and the smaller cluster subsumes to a secondary accession.) We remainable to extract clusters based on constituent accessioned ESTs.
Virtual protein manufacture: Currently, EST_SCAN
(http://www.ch.embnet.org/software/ESTScan.html), FRAMEFINDER (Guy Slater, personal
communication) and other tools are being evaluated for efficacy.
12.3References in order of mention.Vasmatzis, G., M. Essand, U. Brinkmann, B. Lee, and I. Pastan. 1998. Discovery of threegenes specifically expressed in human prostate by expressed sequence tag database analysis.
Proc. Natl. Acad. Sci. USA 95(1):300-304 .
Wolfberg, T.G. and D. Landsman. 1997. A comparison of expressed sequence tags (ESTs) to
human genomic sequences. Nucleic Acids Research 25(8):1626-1632.
Hillier,L., N. Clark, T. Dubuque, K. Elliston, M. Hawkins, M. Holman, M. Hultman, T.
Kucaba, M. Le, G. Lennon, M. Marra, J. Parsons, L. Rifkin, T. Rohlfing, M. Soares, F. Tan,
E. Trevaskis, R. Waterston, A. Williamson, P. Wohldmann, and R. Wilson. 1996. Generation
and Analysis of 280,000 Human Expressed Sequence Tags. Genome Research 6:807-828.
Aaronson, J.S., B. Eckman, R.A. Blevins, J.A. Borkowski, J. Myerson, S. Imran, and K.O.
Elliston. 1996. Toward the Development of a Gene Index to the Human Genome: An
Assessment of the Nature of High-throughput EST Sequence Data. Genome Research 6:829-
-845.
Adams, M.D., J.M. Kelley, J.D. Gocayne, M. Dubnick, M.H. Polymeropoulos, H. Xiao, C.R.Merril, A. Wu, B. Olde, R.F. Moreno, A.R. Kerlavage, W.R. McConbie, and J.C. Venter.
1991. Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome
Project. Science 252: 1651-1656.
Baldo, M.F., G. Lennon, and M.B. Soares. 1996. Normalization and Subtraction: TwoApproaches to FacilitateGene Discovery. Genome Research 6: 791 - 806..
Alwen,, J. United Kindom human genome mapping project: development, components,
coordination and management, and international links of the project. 1990. Genomics. 6. 386-
388.
Gill RW, Hodgman TC, Littler CB, Oxer MD, Montgomery DS, Taylor S, Sanseau P.Advanced Technology & Informatics Unit, Glaxo-Wellcome Medicines Research Centre,
Stevenage, Heris, UK. rwg23944@ggr.co.uk
Torney, D.C., et al. 1990. Computation of d2: A Measure of Sequence Dissimilarity.
8/2/2019 EST Tutorial
21/24
Computers and DNA, SFI Studies in the Sciences of Complexity, vol. VII, eds. G. Bell and T.
Marr, Addison-Wesley.
Hide, W., J. Burke, and D.Davison. 1994. Biological Evaluation of d^2, an Algorithm for
high-performance Sequence Comparison. J. Comp. Bio. 1:199-215.
Wu, T.J., J.P. Burke, and D.B. Davison. 1997. A Measure of DNA Sequence Dissimilarity
Based on Mahalanobis Distance Between Frequencies of Words. Biometrics 53:1431-1439.
S. Burkhardt, A.Crauser, P.Ferragina, H-P. Lenhof, E. Rivals, M. Virgon q-gramm BasedDatabase Searching Using a Suffix Array (QUASAR), Proceedings of the 3rd Inrenational
Conference on Computational Molecular Biology (RECOMB99), Lyon, France, 1999.
Strelets, V.B., Ptitsyn, A.A., Milanesi, L., Lim, H.A. (1994) Data bank homology search
algorithm with linear computation complexity. Comp.Appl. Biosci., v.10, n. 3 (1994), pp.
319-322
Sutton, G., O. White, M.D.Adams, and A.R. Kerlavage. 1995. TIGR Assembler: A New Tool
for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology 1:9-
18.
Burke, J.P., H. Wang, W. Hide, and D. Davison. 1998. Alternative Gene Form Discoveryand Candidate Gene Selection from Gene Indexing Projects. Genome Research 8: 276-290.
Boguski, M.S. and G.D. Schuler. 1995. ESTablishing a Human Transcript Map. Nature
Genetics 10:369-371.
Benson D.A., M.S. Boguski, D.J. Lipman, and J. Ostell. 1994. GenBank. Nucleic Acids
Research 22: 3441-3444.
Hide, W. ,Burke, J., Christoffels, A., Miller, R., 1997. A novel approach towards acomprehensive consensus representation of the expressed human genome. In Genome
Informatics 1997, pp187-196.Satoru Miyano and Toshihisa Takagi Eds. Universal Academy
Press Inc. Tokyo, Japan. ISSN 0919-9454
Houlgatte R., R. Mariage-Samson, S. Duprat, A. Tesslier, S. Bentolila, B. Lamy, C. Auffray.
1995. The GenExpress Index: A Resource for Gene Discovery and the Genic Map of the
Human Genome. Genome Research 5: 272-304.
Williamson et al. 1995. Merck Gene Index
Green 1996, PHRAP
Bouck et al 1999 Trends in Genetics 15 (4): 159-162
8/2/2019 EST Tutorial
22/24
12.4 AppendixSTACK Schema.
Unigene WHALE toolhttp://www.ncbi.nlm.nih.gov/Sicotte/whale.html
Unigene and d2-cluster comparisons: Fragmentation as a result of stricter clustering.
There is significant improvement in clustering with d2-cluster vs Unigene.
d2_cluster and UniGene produce results that are between 83% and 90% identical, and
d2_cluster results are between 8% and 20% less fragmented than UniGene. The above figure
8/2/2019 EST Tutorial
23/24
demonstrates a cluster that has been fragmented by Unigene and is kept together by d2-
cluster.
D2_cluster measurably decreases the rate of fragmentation: producing around 20% fewer
singleton sequences (6463 to 5198) and reducing the overall number of clusters by 10%(15225 to 13756). Generally the numbers of smaller clusters are reduced while larger clusters
appear with slightly larger frequency. (see table: Burke, Davison, Hide, Submitted, Genome
Research))
Table of Cluster Sizes
Cluster size UniGene
(Rat-build 19)
d2_cluster
Singleton clusters 6463 5189
2 3496 3298
3-4 3002 2971
5-8 1635 1602
9-16 491 531
17-32 111 127
33-64 21 27
65-128 3 8
129-256 3 2
257-512 1 0
Total Clusters 15225 13756
EST links./
STACK tools:
www.sanbi.ac.za/stack
www.sanbi.ac.za/stack_pack
* ICA Tools http://sunny.ebi.ac.uk/ebi/EST/
* basic explanation of clustering
http://sunny.ebi.ac.uk/~jparsons/work/est_assembly/est_assembly_intro
.html
* clustering rat sequences (good overview) http://goliath.ifrc.mcw.edu/EST/contents.html
* the UniGene database, probably the most well-known EST
8/2/2019 EST Tutorial
24/24
database: http://www.ncbi.nlm.nih.gov/UniGene/index.html
* starting information when creating STACK: http://www.ncbi.nlm.nih.gov/dbEST/index.html
the TIGR THC database, another view of the consensus world.
http://www.tigr.org/tdb/hgi/hgi.html
* other EST links: http://industry.ebi.ac.uk/~muilu/EST/EST_links.html
* integrate EST information http://www.ncbi.nlm.nih.gov/LocusLink/index.html
Websites referenced:
1 http://genome.wustl.edu/est/esthmpg.html
2 http://hercules.tigem.it/ASSEMBLY/capdoc.html3 http://www.girinst.org/4 ftp://ncbi.nlm.nih.gov/repository/vector/
567 http://sunny.ebi.ac.uk/EST/
8 http://www.ncbi.nlm.nih.gov/LocusLink/9 ftp.sanbi.ac.za/stack/software/viz.tar
Recommended