EST Tutorial

8/2/2019 EST Tutorial

1/24

EESSTT CCLLUUSSTTEERRIINNGG

TTUUTTOORRIIAALL

ISMB 1999

Presenting Tutors:

Win Hide and Alan Christoffels

Authors:

Win Hide, Rob Miller, Andrey Ptitsyn, Janet Kelso, Chellapa

Gopallakrishnan and Alan Christoffels


2/24

1 IntroductionThe tutorial is organized to provide an introductory framework for understanding the conceptsand underlying reasoning for Expressed Sequence Tag (EST) Clustering and consensus

generation and, in addition, to enable readers/attendees to be able to design and implement an

EST consolidation system. It is based on what we have learned over the past three years about

clustering technologies and issues surrounding the field The tutorial includes examples that

have been drawn from the authors' experience, which consist mostly of STACK,

STACK_PACK and the algorithms used in these systems. The methods employed by thesesystems will be the focus of the demonstration and review session in the second half of thetutorial. Many of the tasks we discuss can also be performed by other systems and we will

attempt to compare system strategies as objectively as possible.

Questions on the material or implementation of clustering systems can be directed at theauthors:

Win Hide : [email protected] www.sanbi.ac.za/stack

Alan Christoffels: [email protected]


3/24

1 Introduction................................................................................................................................ 22 The need for EST clustering........................................................................................................ 43 General Aspects of EST data.......................................................................................................4

3.1 Generation of ESTs............................................................................................................. 43.1.1 What is an EST? .......................................................................................................... 4

3.2 EST data quality.................................................................................................................. 54 Overview of clustering and consensus generation ........................................................................5

4.1 What is an EST cluster?....................................................................................................... 5

4.2 Loose and stringent clustering ............................................................................................. 54.3 Data apprehension and input format.................................................................................... 64.4 Pre-processing..................................................................................................................... 64.5 Initial clustering.................................................................................................................. 64.6 Assembly ............................................................................................................................ 64.7 Alignment processing.......................................................................................................... 64.8 Cluster joining..................................................................................................................... 6

4.8.1 Clone joining...............................................................................................................64.8.2 Available parents......................................................................................................... 6

4.9 Output................................................................................................................................. 75 Summary of Introduction ............................................................................................................7

6 Implementation Strategies...........................................................................................................76.1 EST pre-processing strategies.............................................................................................. 7

6.1.1 Screening out repeats and vector sequence ...................................................................76.2 Masking strategies............................................................................................................... 8

7 EST Clustering methods.............................................................................................................. 97.1 Clustering and Statistical cluster analysis............................................................................. 97.2 Using common search engines............................................................................................. 9

7.2.1 Alignment scoring methods: BLAST and FASTA...................................................... 107.3 Purpose-built alignment based clustering methods ............................................................. 107.4 Non-alignment based scoring methods: D2-cluster........................... .................................. 107.5 Pre-indexing methods........................................................................................................ 11

8 Systems and Algorithms............................................................................................................ 118.1 TIGR_ASSEMBLER ........................................................................................................ 118.2 UniGene ........................................................................................................................... 128.3 STACK and STACK_PACK............................................................................................. 13

8.4 Other datasystems ............................................................................................................. 138.5 Strategies for keeping data 'current'.................................................................................... 14

9 Cluster Assembly and Processing.............................................................................................. 149.1 Processing alignments....................................................................................................... 14

10 Clone Linking....................................................................................................................... 15

11 A working clustering system ................................................................................................. 1511.1 Expression counts ............................................................................................................. 1611.2 Consensus sequences......................................................................................................... 1611.3 Alternate expression-form charcterisation........................................ .................................. 1611.4 SNP detection ................................................................................................................... 1611.5 Identification of genes expressed in the cluster project ....................................................... 1611.6 Identification of genes specifically expressed in a chosen library or tissue.......................... 16

12 Brief introduction to the STACK_PACK clustering system.................................................... 16

12.1 STACK processing............................................................................................................ 1612.2 Data bin strategy and masking........................................................................................... 16

12.2.1 Detecting and viewing alternate splicing .................................................................... 1712.2.2 Adding using STACK_PACK.................................................................................. 20

12.3 References in order of mention. ...................................................................................... 2012.4 Appendix ......................................................................................................................... 22


4/24

2 The need for EST clusteringWith the easy access to technology to generate expressed sequence tags (ESTs), several

groups have sequenced from thousands to several hundred thousands of ESTs. Currently the

majority of the coding portion is in the form of expressed sequence tags (ESTs), and the need

to discover the full length cDNAs of each human gene is frustrated by the partial nature of

this data delivery. There is significant value in attempting to consolidate gene sequences as

they are produced, in lieu of a yet-to-be-completed reference sequence. ESTs offer a rapid andinexpensive route to gene discovery, reveal expression and regulation data (Vasmatis, et al,1998), highlight gene sequence diversity and splicing (Wolfberg and Landsman, 1997), and

may identify more than half of known human genes (Hillier, et al, 1996). The price of the

high-volume and high-throughput nature of the data, however, is that ESTs contain high error

rates (Aaronson, et al, 1996), do not have a defined protein product, are not curated in ahighly annotated form and present only a raw substrate for sequence matching. Unfortunately,

most EST data remains unprocessed, and thus does not provide the important high value

sequence consensus information that it contains. The low quality sequence data provided can

be much improved on, and in order to achieve quality information, pre-processing, clustering

and post-processing of the results is required. One goal of these projects is the construction of

gene indices, where all transcripts are partitioned into index classes such that transcripts areput into the same index class if and only if they represent the same gene or gene isoform.

Accurate gene indexing facilitates gene expression studies as well as inexpensive and early

gene sequence discovery through the assembly of ESTs that are derived from genes that have

yet to be positionally cloned or obtained directly through genomic sequencing.

3 General Aspects of EST data3.1 Generation of ESTs3.1.1 What is an EST?

Adams et al (1991) published a widely read article describing use of ESTs in 1991. An EST,Expressed Sequence Tag, is a tiny portion of an entire gene, a fragment of a cDNA clone that

has been sequenced. The process by which ESTs are manufactured requires the construction

of an mRNA library. Baldo et al (1996) have provided a detailed description of how libraries

are constructed and how normalization and library subtraction can be used to increase relative

representation of less abundantly transcribed mRNAs. The reverse transcriptase used to

manufacture each cDNA in the library will eventually fall off the template (Figure 1), and this

will terminate the production of the cDNA. Thus a series of length-differentiated 3' delimitedcDNA fragments may be produced for each mRNA that is a viable template in the library.

The length of the cDNA will vary, and this is an important factor for development of

coverage for each mRNA template of an available gene. Usually, several hundred to several

thousand clones are isolated at random from a given cDNA library. Clones are sequenced a

single time, from one or both ends of the DNA insert, using universal primers which arecomplementary to the vector at the multiple cloning site. The M13 forward primer may be

located near the 5' or the 3' end of the cloned insert, depending on how the inserts weredirectionally cloned. Only 300-500 readable bases are produced from each sequencing read,

and yet a full gene transcript may be several thousands of bases long. ESTs thus provide a

"tag level" association with an expressed gene sequence, trading quality and total sequence

length for the high quantity of genes which can be tagged in a given amount of time.


5/24

Figure 1 Manufacture of an EST

3.2 EST data qualityGeneration of EST data result in 'low quality' sequence information. A single read is

generated for each EST, and as such will contain errors from its generation at each step.

These can include clone orientation, associated clone ID chimeras and missing 3' and 5' reads.

Because data are single-pass unedited sequences, they are also subject to errors caused by

compressions and basecalling problems resulting in frameshifts. Reference to the Washington

University website details common aspects of EST error1

. EST sequence has regions of high

quality very close to regions of low quality, where quality can be defined as the number of

correctly sequenced bases within a known window of reference. It is possible to utilise poorquality sequence as long as relevant strategies for maximising their utility are taken.

4 Overview of clustering and consensus generationEST Clustering is performed as a process that utilises 'clustering information' that is less and

less definitive. Initially sequence identity provides a good guide to cluster membership.

Shared annotation provides joining information that can be of more variable quality. Thus the

number of accurately clustered ESTs is heavily dependent on a strategy that can assign cluster

membership based on verifiable criteria; sequence identity is currently the most useful of

these.

Clustering can be performed with or without sequence consensus generation. It is preferable,

although more difficult, to manufacture a consensus sequence from each cluster. Theclustering overview will briefly describe processes that result in consensus sequence

generation.

4.1 What is an EST cluster?A cluster is fragmented, EST data (DNA or protein) and (if known) gene sequence data,

consolidated, placed in correct context and indexed by gene such that all expressed data

concerning a single gene is in a single index class, and each index class contains the

information for only one gene (Burke, Davison, Hide, Submitted, Genome Research).

4.2 Loose and stringent clusteringESTs by their nature have a degree of erroneous sequence data, complicated by short lengthand some mis-annotation. Stringent one-pass assembly methods tend to result in fewer,shorter consenus sequences. Looser systems for clustering result in larger, more 'sloppy'

clusters, with various expressed forms being represented within each cluster. Each approach

has its advantages and disadvantages. Stringent clustering provides greater initial fidelity, at a


6/24

cost of lower coverage of expressed gene data and a lower inlcusion rate of expressed gene

forms.Loose clustering provides greater coverage, at a cost of possible inclusion of

paralogous expressed genes, lower fidelity data, but at a gain of greater inclusion of alternate

expressed forms. We detail here a loose strategy with comparison to other strategies describedbelow. Stringent clustering is performed by algorithms such as TIGR_ASSEMBLER and

more loose clustering by the process employed in Unigene and STACK manufacture. The

latter methods tend to be broken down into several sub-processes, resulting in more

flexibility, but are also more complicated.

4.3 Data apprehension and input format.Data sources for clustering can be in-house, proprietary, public database or a hybrid of these.

It is therefore quite important to establish an INPUT DATA FORMAT that is consistentbetween these sources. The format needs to embrace the input accession number of a

sequence (or temporary accession, such as a sequence-run id if not yet submitted),

information on location with respect to poly A (3' or 5') and importantly, the CLONE ID from

which the EST was generated. SANBI uses FASTA format input files that contain the above

information in the header.

The steps suggested in EST clustering are as follows :

4.4 Pre-processingSequences are masked for repeats and vector, and formatted for the clustering engine.

Sequence quality is often assessed at this step. A minimum number of residues are accepted

above a known quality threshold. All masked sequence data is accepted for clustering above

50bp in length. SANBI's input acceptance approach has a low input threshold stringency, as

the clustering method(s) can work with more erroneous data more successfully. NCBI

discards ESTs with a window of less than 100bp of 'clean' data.

4.5 Initial clusteringAn initial clustering is performed based on a measure of high sequence identity

4.6 AssemblyAssembly is either part of the initial clustering (as used in TIGR_ASSEMBLER) or separated

into clustering followed by assembly performed by a specialist assembly package such as

PHRAP (with assessment of residue quality turned off) or CAP2 / 32

4.7 Alignment processingAligned clusters, particularly those generated by a loose clustering engine, need to be

processed for errors and alternate forms of expressed sequences. Consensus generation may

be a result of this step (as in STACK), or a consensus can be accepted directly from the

assembly step. Consensi are chosen based on maximal length. See section 9.1.

4.8 Cluster joining.Once clustered, clusters and/or cluster consensi can be further joined by available annotative

approaches.

4.8.1 Clone joining

The most powerful cluster joining method is clone-joining, which utilises the physically

shared clone id between 3' and 5' EST fragments sequenced from the same starting clone. Adifferent approach would be to link clone-related sequences in the pre-processing phase, but

this may increase errors and processing time requirements at the clustering and assembly

stages. By either approach, linking by clone annotation is an error-prone step in the EST

consolidation process as it relies entirely on the accuracy of the sequence annotation and the

uniqueness of clone IDs if data from disparate sources is to be used.

4.8.2 Available parents

If a parent mRNA sequence is available (non-EST) it can be used to physically link EST

cluster(s) via sequence comparison.


7/24

4.9 OutputFigure 2 Cartoon of clustering steps

Defining an output format for the clustering process is problematical. Information required

often includes alignment (alternate splices, polymorphisms and error assessment), raw cluster

membership, and contextual links. Nonetheless, results must be easily incorporated into

existing software packages, which in general have not been designed to support thecomplexity or evolving nature of clustered EST data. For STACK, individual cluster

alignments (including consensi and sub-consensi) are presented in an alignment format(GDE), while consensus sequences and clone-linked sets are stored in FASTA format for use

as BLAST-searchable databases. A distributable FASTA file is in development, as is a

'reduced' format for easy distrbution. Development of data formats and systems for accessing

information about EST clusters is an ongoing project at SANBI.

5 Summary of IntroductionIn this section we have addressed the value of clustering ESTs for the elucidation of complete

sequences not yet available using traditional methods. The nature and quality of EST datahave been assessed with reference to the limitations that these have on sequence processing.

An overview of EST clustering procedures indicating the flow of EST sequence information

through stages of pre-processing, initial clustering, assembly, alignment processing, clusterjoining and output has been discussed. The following section will examine each of these

stages in more detail and address some of the issues encountered.

6 Implementation Strategies6.1 EST pre-processing strategies6.1.1 Screening out repeats and vector sequence

Clustering requires that a certain degree of identity be preserved between each member of a

defined cluster. The most common problem in EST clustering is contamination with a

sequence common to several members of the input EST data set but not unique to a specific

group. Two classes of common contaminating sequence are :


8/24

6.1.1.1 Repeats

Any uncharacterised expressed sequence set is likely to contain repeat sequences. Forinstance in humans the most common element is the ALU repeat element. In newly sequenced

genomes, such as plant and other eukaryote systems, repeat sequences represent a common

and frustrating clustering problem. Repeat databases provide a resource against which repeats

can be detected. The repeat databases are dependent on continuing curation and detection of

novel repeats in genomes and thus provide a valuable resource.

6.1.1.2 Cloning vector fragmentsVector sequences can skew clustering even if a small vector fragment remains in each read.

6.1.1.3 Low complexity sequences

Low complexity sequence (poly A tracts, AT repeats etc) also have the potential to provide an

artifactual basis for cluster membership. The problem is more significant for strategies thatemploy alignable similarity in the first pass cluster assignment. Word based cluster

assignment can be modified to provide low weight to low complexity words. In recent EST

clustering applications being developed at SANBI (d2-cluster, ASSA) and elsewhere, there is

no need for masking of low-complexity DNA as such regions tend to have a highly redundant

oligonucleotide composition. Because these sequence comparison algorithm scale

oligonucleotides according to their potential information content, highly redundant oligos are

given very low weight and low-complexity regions are therefore excluded from consideration.

6.2 Masking strategiesThe most effective method to remove contaminants is to compare each read against a

reference database of repeats (RepBase3) and vector sequences (VecBase

4) using an algorithm

that is reasonably fast and accurate. XBLAST (NCBI tools) and Cross-Match, an

implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green5

have

been used successfully, with Cross-match demonstrating greater flexibility and sensitivity

than XBLAST. Repeat-Masker6contains Cross-match as a tool for masking. DUST is used at

NCBI for this procedure.

In cases where a direct identity is found with a repeat or vector subsequence, a 'mask residue'

can be substituted into the read. The resulting runs of NNNNs (DNA) or XXXXXs (protein)will be ignored by most clustering engines (Figure 3a and 3b).

A problem arises when an EST library is presented that is from a novel organism for which

the repeats have not been characterised. In this instance it may be necessary to employ 'blind'

repeat masking if an algorithm is available. Repeat masking is necessary if the repeats are

large enough to represent a source for artifactual contamination.

An exploitable feature of sequence contamination in loose clustering is that, if the tools work

as intended, then the contaminated sequences and all related sequences will be clusteredtogether. There is no automatic method to identify a contaminated cluster, but once it is

identified only that cluster needs to be decomposed into its original sequences and re-

processed (no other cluster will be affected by the sequences in the contaminated cluster).

This may not be the case for stringent clustering systems, where a more cautious repair

strategy is required.

Figure 3a EST sequence in FASTA format.

>T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE

gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' end

AAGACCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTT

CTAATATCTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGC

ACACAGATGTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTC

TCCACTGAAAAATCCTCTTTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCAC

TGGACGGTGACGTCAGCCATGTACAGGATCCACAGGGGTGGTGTCAAATGCTATT

GAAATTNTGTTGAATTGTATACTTTTTCACTTTTTGATAATTAACCATGTAAAAAATG


9/24

Figure 3b EST sequence read after masking vs BLASTX or Cross-Match and Vecbase + RepBase

>T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE

gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' endxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxTATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAAATGAATG

TAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCTTTC

TTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCAT

GTACAGGATCCACAGxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAACCATGTAAAAAATG

7 EST Clustering methods7.1 Clustering and Statistical cluster analysisClustering is not exactly classical cluster analysis as understood by statisticians. In cluster

analysis practically all statistical aspects are about classification and justification of

classification of the object with complicated relationships. This may be why we use the word"clustering" instead of cluster analysis. In the case of clustering the fundamental problem is

the selection of a metric that is reduced to a simple binary digit, i.e. sequences either match ordon't. To cluster the EST data we look at near identical matches. Essentially, the fragments we

put together belong to the same DNA and possible mismatches are mostly resulting frommisreads and represent noise. The inter-cluster distance, as well as inter-object distance is

also reduced to binary - in an ideal situation, the clusters of ESTs related to one gene should

be infinitely close to each other while those related to different genes are infinitely distant and

no intermediate states should make sense. Following the simplification of a distance measure,

the problem of clustering quality estimation also reduces to a trivial task: the closer we get to

detecting all exactly matching fragments the better. Thus, from the statistical point of view

clustering ESTs is a trivialised version of cluster analysis. But it doesn't make this job anyeasier.

7.2 Using common search enginesSmall projects, clustering of a few dozens or hundreds or even thousands of ESTs are indeed

'trivial' and can be approached by standard tools of contig assembly or even multiple

alignment. The real challenge is the amount of data, available now in public and private

databases and waiting to be clustered. Taking into account these millions of ESTs, obtaining

the trivial binary distance between fragments is far from a trivial job even for availablesupercomputers. Modern tools of sequence comparison (well -known Smith-Waterman,

FASTA, BLAST) are mostly built for a different purpose: searching. They are all different

variations of an alignment algorithm, i.e. correct position of sequence elements (nucleotides

or groups of nucleotides) against each other maximizes some score. The purpose of this

process is to detect and measure quantitively the similarity (distance) between any 2

sequences compared. Smith-Waterman is the most exhaustive and computationally expensivetool, deriving the best sensitivity and detecting weak similarities. FASTA and BLAST are less

sensitive and trade some sensitivity for speed. As mentioned previously, the distance measure

in EST clustering is reduced to binary, it is therefore only necessary to detect a near or perfect

match. Extension penalties and gapping manipulation become less important in an initial

assessment of pairwise identity. It is therefore important to 'head for speed' over sensistive

comparison. Use of a banded Smith-Waterman on already compared clusters is an approach

that is tenable for further consensus generation.


10/24

7.2.1 Alignment scoring methods: BLAST and FASTA

BLAST is an algorithm efficiently implemented on many platforms.Although not developedspecially for clustering, BLAST sequence comparison is widely used initially in EST

clustering because it's readily available and flexible enough to be tuned for the task with a

change of default parameters (wrappers for BLAST iterative comparison exist and are

accessible via the web, see appendix). Using the standard BLAST II application, available

from NCBI anonymous FTP, it is possible to set up a stringent match set of parameters as

follows:

-e expectation value set to 0-G cost to open a gap can be increased,

-E cost to extend a gap can be increased,

-q mismatch penalty increased

-r match reward increased

-a number of processors to use can be adjusted,

-W (word size, found on NCBI web) set up for longer words (default 11)

FASTA is less widely used, but can be also applied for the same purpose. Generally, it allows

the same type of variation in parameters as BLAST - increasing the k-tup parameter to

increase speed and raising a threshold to pick up only the strongest similarity.

7.3 Purpose-built alignment based clustering methodsThe field has recently seen the emergence of many new algorithms in development, but

dedicated production algorithms are still few. We will not attempt at this time, to review the

entirety, as there is a growing level of new material that has been very recently appearing.

ICA tools, distributed last in 1997 by Jeremy Parsons, represents an alignment-based

systematic approach to clustering. This group of algorithms was one of the first to become

available, and has been used at major sequencing centres and is useful for data reduction. The

system is however not as complete as others that have been developed subsequently . ICA

tools were developed as part of the UK human genome mapping project, (Alwen, 1990).

According to Jeremy Parsons, the ICAtools are a set of programs that could be of use to

anyone doing medium-to-large scale DNA sequencing projects. The system has several toolsbut was originally designed for database redundancy and adapted for genomic fragments and

finally EST fragments.ICAss uses a BLASTN type of algorithm to perform 'database pruning' to assess wether one

sequence is a sub-set of another. N2tool consitutes a dedicated clusterng tool that relies onindexing. It uses an indexed file format and local alignment to compare all the submitted

sequences with each other to find those which share any region of similarity. ICAtool indexes

DNA sequences into clusters which share local sequence similarities. ICAass takes a size-

sorted (longest first) file of sequences and searches for those sequences which are

approximately repeated within the length of another. ICAmatches attempts to explain why

sequences have been clustered together by using novel sequence alignment. N2tool has been

utilised in the Washington University Merck EST manufacture, for identification of artifactual

sequences. A new generation of these tools is being developed and employed at EBI7, and

employs CORBA and Java for database interrogation (JESAM: Parsons and Rodriguez

Tome, personal communication)

EST-BUILDS

Another common-sense system recently published by Gill et al (1998) is a 'dynamic build'

system, where a seed EST is used to match with others via a build-blast strategy. This method

works well for single seed clusters, but is not implemented for large scale database building.

7.4 Non-alignment based scoring methods: D2-clusterD2-cluster is a word multiplicity comparison method that utilises an agglomerative algorithm

that has been specifically developed for rapidly and accurately partitioning transcript

databases into index classes by clustering ESTs and full-length sequences according to

minimal linkage or transitive closure rules. Agglomerative clustering method means thatevery sequence begins in its own cluster and the final clustering is constructed through a

series of mergers that may be described in terms of minimal linkage, sometimes called single

linkage or transitive closure". The term transitive closure refers to the property that any two

sequences with a given level of similarity will be in the same cluster, hence A and B are in the


11/24

same cluster even if they share no similarity but there exists a sequence C with enough

similarity to both A and B. The criterion for joining clusters is the detection of two sequences

that share a window of (Window_Size) bases that is (Stringency) percent or more identical.

The only criterion for clustering is sequence overlap and source or annotation information isnot used. To detect the overlap criterion we use the d2 algorithm and set parameters and

threshold values as described in (Torney et al, 1990; Hide, et al, 1994; Wu et al, 1997). The

initial and final state of the algorithm is a partition of the input sequences where each

sequence is in a cluster and no sequence appears in more than one cluster.

D2-cluster uses an approach of word matching within a window, together with a measure ofthe multiplicity (if any) of that word within a window. The principal concept is that it doesn't

attempt an alignment, not even in a reduced form. The results of comparison are derived

directly from the comparison of word composition (word identity and multiplicity) of 2

sequence windows. Thus, the algorithm can be significantly faster than BLAST. Speed comes

with a price: to collect significant statistics, the fragments must be long enough (about 100

bp) and only very high similarities can be detected (above 90% identity within a window).

D2-cluster is used to produce initial loose clusters in STACK clustering system. We have

determined that results of d2_cluster alone are between 8% and 20% less fragmented than

UniGene (Burke, Davison, Hide, submitted, Genome Research) and the STACK datasystem

produces clean clusters that are 16% less fragmented than Unigene (Miller et al, submitted,

Genome Research 1999)

7.5 Pre-indexing methodsThe size of the datasets has effectively precluded their use on workstations architectures.

Indexing is one approach that allows for less computationally intense operations. Indexing of

sequences allows one to store ready tables of words and their positions in a sequence, so the

comparison algorithm has more than half of the work done already when it starts. Publicly

available tools are to be available in the near future, for instance, 'QUASAR' was

announced at RECOMB99, it is termed ' Q-gramm Alignment based on suffix arrays'. Thisalgorithm is designed to quickly detect sequences with high similarity to the query in a

context where many searches are conducted on one database. The database is presented in the

pre-processed (indexed) form and similarity detection is based on the exact matching of short

substrings (q-gramms). (Burkhardt, et al 1999).

8 Systems and Algorithms8.1 TIGR_ASSEMBLERDeveloped for the assembly of large shotgun projects, TIGR assembler has also been

employed to manufacture consensus sequence assemblies fo TIGR-THC from ESTs.

The assembler was originally designed for large shotgun projects, but is suited for EST

assembly. It employs a standard rapid oligonucleotide content comparison to reduce search

time. Pairwise comparions generate a list of potential (end-)overlaps. Non-repeat fragments

seed subsequent assembly of listed overlaps.TIGR Human Gene Index (HGI, http://www.tigr.org) employs the strict assembly method of

TIGR_ASSEMBLER (Sutton, et al, 1995), grouping highly related sequences and

consequently producing very accurate consensus sequences with a minimum of chimerism or

other contamination. This method discards under-represented and divergent or noisysequences in favour of confidence based on transcript redundancy, but in doing so can

eliminate related sequences from clusters which might provide examples of alternative

splicing or other valuable forms of sequence diversity.

For TIGR THC , assemblies and ESTs are clustered according to a two stage process by the

THC_BUILD script (G. Sutton):

1) BLAST and FASTA are used to identify all sequence overlaps,2) All overlaps are stored in a relational database, and

3) Transitive closure groups are formed and subjected to assembly using TIGR assembler(Sutton et al, 1995) such that joining occurs when sequences share at least 95% identityover at least 40 bases (http://www.tigr.org/hgi/hgi_info.html).

The presence of sequence repeats affects both the order and stringency of joining. The

resulting assemblies are referred to as tentative human consensus sequences (THCs). The


12/24

assembler also imposes special matching constraints on the ends of sequences and a minimum

sequence identity within an index group. The strictness of matching criterion has the

advantage of often preventing chimerism and contamination from tainting index groups but

results in more a more fragmented representation of the data that is less able to incorporateerror prone sequence. This strictness often disallows the combination of sequences with

sufficient diversity so that sufficiently divergent ESTs that sample alternative splice forms of

the same gene are kept in different assemblies but they are linked as being splice variants in

those cases where the ESTs match sequenced genes with known isoforms in EGAD. Most

sequence assembly programs share similar properties with TIGR. The PHRAP package,incorporates sequence quality data derived from sequence traces into the assembly process (P.Green, personal communication) allowing for the incorporation of higher error data but the

edge matching and maximum mismatch criterion still must be satisfied. In general, the results

of assembly programs are not invariant with respect to presentation order (Burke, Davison,

Hide,Submitted, Genome Research).

Comparison of TIGR_THC with SANBI STACK_PACK methodology.

Table 1

Methodology Input Sequences Singleton Groups %Singleton Groups

TIGR Gene Index 626 163 135 140 21.83

STACK_PACK 415 833 58 070 13.96

STACK_PACK analysis of UniGene clusters resulted in a fragmentation rate just over half of the

TIGR index.

8.2 UniGeneThe most widely known effort is UniGene (Boguski et al, 1995) from NCBI. UniGene will be

replaced by RefSeq and LocusLink8 projects as the availability of genomic sequence

increases.

The UniGene project originally took the fingerprinting characteristic of 3 UTRs (and hence

3 ESTs) as a paradigm and indexed genes by clustering 3 end EST sequences with mRNA

data extracted from GenBank (Benson et al, 1994). 5 end ESTs were added to the clusters

using clone information. To enhance the speed of the clustering phase of the project a two-

phase searching process was used for overlap detection. First, two sequences were marked

for further comparison if they shared two common words of length 13 separated by no more

than 2 bases. The sequences selected for further comparison were then compared with a

constrained Smith-Waterman local alignment algorithm. The membership of clusters in

recent versions of UniGene, however, suggests that additional criteria are being used to

determine cluster membership (Burke, Davison and Hide, Submitted, Genome Research)

UniGene clustering proceeds in several stages, with each stage adding less reliable data to theresults of the preceding stage. 'Reliable' refers to pairwise identity>annotation>shared clone

etc.

Contaminant screening of vector sequences and repetitive elements and mitochondrial andribosomal sequences is performed using NCBI's DUST. After screening, a sequence must

contain at least 100 informative bp to be a candidate for entry into UniGene (informative

length). This initial criterion differs from the 'all in' strategy employed at SANBI where

alignment is not required for clustering. Sources for UniGene include GenBank Genomic,

dbESt and GenBank mRNAs. GenBank Genomic sequences are electronically spliced exons

(vGenes).

Initial clustering is performed by comparing the set of gene sequences (mRNA or genomic

sequences, many of which are complete CDSs) with itself. Sequence pairs which are

sufficiently similar are grouped together to form initial clusters. EST to gene links and EST

to EST links are added to these clusters. The set of ESTs is compared with the set of genes

using WHALE (http://nucleus.cshl.org/meetings/98genome_absstat.htm), and sufficiently

similar sequence pairs are added to the clusters (MEGABLAST Unpublished, figure 4). Any


13/24

new links which would join two distinct clusters from the preceding stage (that is, join two

sets of genes not linked to form one cluster without the addition of ESTs) are discarded.

Figure 4from Wagner et al. Abstracts 1999 Genome Sequencing and Biology. P340. Cold Spring Harbor

Laboratory.

Any resulting cluster which does not contain a sequence with a polyadenylation signal or twolabelled 3' ESTs is discarded. Clusters which pass these criteria are called anchored clusters,

since their 3' end is presumed to be known.The drawbacks of this method match the benefits of the strict matching technique in that

chimerism and other artefacts can join unrelated clusters, and furthermore the diversity withina grouping can make the generation of a consensus sequence complex (Schuler, et al, 1996),

often resulting in accepting only the longest representative of an index class as its consensus.

8.3 STACK and STACK_PACKAs described in Hide et al (1998), and (in Miller, et al, Genome Research, Submitted)

STACK_PACK makes use of dbEST and clusters ESTs within tissue source categories.

STACK_PACK extends the application of the loose clustering approach to a global level,

defining index classes by the total number of (possibly disconnected) matching 6-base words

rather than by alignment to previously identified class members. The approach is robust withrespect to EST quality data and increases the capacity to link splice-related ESTs without

concern for non-overlapping regions. The related but loose clusters are subsequentlyprocessed by strict assembly and analysis tools to identify, characterise and isolate any

sequence divergence. The clustering algorithm (d2_cluster) is discrete from the assembly tool

(PHRAP) and identifies ESTs that are greater than 96% identical over a window of 150

bases. PHRAP subsequently aligns them to provide an assembly for subsequent alignmentconsensus processing. STACK uses all available raw sequence data during cluster generation

and uses CRAW (Burke, et al, 1998) to generate consensi, to annotate polymorphic regions

and alternative splicing forms in the clusters. Contigproc then chooses the longest consensus

of highest quality for output. Clusters are viewed with VIZ to allow initial analytical

assessment.

8.4 Other datasystemsThe Genexpress Index (Houlgatte, et al, 1995),and The Merck Gene Index (Williamson, et al,1995), group sequences into clusters based on sequence overlap above a given alignment

threshold as UniGene does. The index was originally manufactured using Fasta comparisons


14/24

with 3' sequence only. No searchable index can be found on the Web for Merck Gene Index.

8.5 Strategies for keeping data 'current'.The high throughput nature of EST production ensures that the available databases grow

almost daily, and the value of a consolidated EST database is a direct function of how current

the represented dataset is. UniGene's use of full-length gene sequences to seed the initial

clusters and rejection of linking ESTs ensures that new EST data will only add to existing

clusters, while the stringent clustering approach of TIGR Human Gene Index similarly limitsthe cluster joining effect new sequences may have. Purely loose clustering strategies such as

employed by STACK are more vulnerable to cluster joining as database revisions are

released. In either case, production time requirements for the consolidated database haveresulted in the need for a dyamic "ADD" facility whereby new sequences can be added

without regenerating all clusters. Of note in this regard is that ADDs will perpetuate any

previous processing errors, while a full production run increases confidence in the final result.

Every STACK (tissue-based) generation requires the clustering of the entire GenBank EST

dataset. dbEST is increasing enormously with each GenBank release and this has impacted

database generation because the input data to the clustering step has exceeded 100 000

sequences per tissue subset. Crossmatch finds overlapping regions at the ends of

approximately 30% of D2-clusters. These newly formed clusters are confirmed by closerexamination of the sequence alignments.

We reduce the input data to D2 cluster using crossmatch. New ESTs are extracted from

Genbank after each bimonthly release. These ESTs are partitioned into tissue subdivisions.The ESTs are initially compared to existing STACK consensi using crossmatch. The STACK

consensi that find matching ESTs are collapsed to their individual ESTs and combined with

their corresponding new ESTs. The ESTs are thus reduced by 20-60% prior to D2 cluster

processing. The approximately 20-40% ESTs that do not find matching members usingcrossmatch are fed into D2 cluster. The D2 clusters are renamed taking into account the

STACK-Ids that are present already. These new clusters together with the crossmatch

expanded clusters are assembled using PHRAP. The stack clusters that were expanded using

crossmatch are removed from the alignment file for the previous STACK release. This

reduced file is appended to the new alignments for the crossmatch and D2 clusters. At this

stage, the alignments are fed into the STACK_PACK system.

9 Cluster Assembly and ProcessingMultiple sequence alignment is a complex problem which has been tackled by many groups.

UniGene avoids it entirely by only listing the accession IDs of related sequences in thedatabase. A current standard for assembly of fragments from shotgun cloning of a single

gene is Phil Green's PHRAP, but even this cannot be expected to cope with the problems of

quality and sequence divergence observed in loose EST clusters. PHRAP will reject

sequences it considers to be "unrelated" and output them in separate files, regardless of the

findings of the clustering engine.

9.1 Processing alignmentsIn the real world of noisy data, no alignment engine performs as well in every case as the

human scientist's eye might be able to suggest; for high-volume, high-throughput applications

this problem can be compensated for by further processing of cluster assemblies, while

alignment editors such as Genetic Data Environment (GDE/SeqLab from GCG) can beinvaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST

clusters provide confidence and quality information - even in the absence of original trace

data - from which high accuracy, extended length consensus sequences can be constructed for

the vast majority of base positions. Given the realities of loose EST clusters, we find it

necessary to post-process cluster assemblies with CRAW, a tool which groups relatedsequences within an assembly and generates consensus sequences for each subset.

How is a consenus decision made? In the light of the significant divergence that can occur

between alternately spliced forms of a gene, majority consensus rule generation is

inappropriate. Rather, consensus sub-forms need to be clearly derived and extracted using free


15/24

publically available viewer VIZ9 and on licensing, CRAWVIEW (figure 5) .

From Burke et al 1998:

Figure 5Alternate Splicing and chimersim provide subsequence alignments that bear valuableinformation. An alternate form of expressed sequence is detected in ovarian tumor library using

sub-consensus analysis.

10Clone LinkingAll ESTs generated from the same cDNA clone correspond to a single gene. Each EST

obtained from GenBank is searched for clone identification so that we can trace the

transcripts corresponding to the same gene. 87% of ESTs currently have documented clone

information. We utilise this information to extend the length of the cluster consensi by

joining clusters that have ESTs that share clone-IDs.

For a gene that is not yet fully sequenced, achievement of a representative consensus

sequence from clustered EST data requires the joining the available 5' and 3' read consensi.

Given that the clone ID information is solely annotation based and may have namespaceoverlaps depending on the data source(s), this step is best handled near the end of the

processing pipeline so that errors detected in the future can be repaired with a minimum of re-processing. Furthermore, unless a specific 5'-3' pair can be identified as a seed for each gene

consensus, the procedure is transitive in nature and may lead to extensive clone-linked

networks whose biological significance remains to be explored. The basic algorithm for clone

linking used in STACK is:

form a queue consisting of an initial cluster

do {

for each EST with a clone ID, add any cluster containing an EST

with a matching clone ID to the queue

} until no new clusters are added

When a closed set of clone-linked consensi has been identified, they may be ordered 5'-

unassigned-3' based on a majority rule from the EST annotations in each cluster. To form a

final consensus sequence in STACK, the non-redundant best (see Cluster Assembly and

Processing above) cluster consensi are joined by linker segments of 20 Ns. This choice was

made based on the word size employed by BLAST, so that alignment breaks would be

preferentially inserted at these linker regions.

11A working clustering systemWhy is clustering being performed? What is the primary output required? Designing a system

for clustering has to be tightly linked to the immediately desired outcome. Commonrequirements include:


16/24

11.1Expression counts11.2Consensus sequences11.3Alternate expression-form charcterisation11.4SNP detection11.5Identification of genes expressed in the cluster project11.6Identification of genes specifically expressed in a chosen library or tissue.Each of the above require specific data manipulations. Direct clustering will yield expressioncounts, assembly will yield consensi, alignment processing and viewing will yield alternate

splices and SNPs, and improved consensus length generation and consensus joining will yield

more correctly identified gene expression forms. In the design of STACK_PACK we have

tried to address each of these direct outputs.

12Brief introduction to the STACK_PACK clustering system

In this section we will familiarise the user with the STACK_PACK structure and describe the

major components and processes in the light of the implementation issues we have descibed

above. The principal design we have chosen centres around a hierarchical clustering

approach. We initially chose this approach for CPU limiting reasons. We have foundsubsequently that hierarchcial clustering allows for flexible scaling with centralised or

distributed processing, viewing and analysis of consensus clusters at any defined level: from

original EST library/tissue cluster to whole index cluster view. The approach lends itself well

to adding of new ESTs at the appropriate level. We have found it particlarly powerful in

analysis of alternate splice events specific to particular tissue or expression 'bins'. (With the

advent of index based and tree-based methods, faster initial clustering implementations will

allow for non-hierarchical aproaches. Without a sophisticated, dynamic 'ADD'

implementation any approach loses effectivenes).We detail tools specific to the STACK_PACK process. Tools used by the system such as

PHRAP are described elsewhere.

12.1 STACK processing12.2Data bin strategy and maskingData splitting: The initial step is to partition the input set into manageable subsets. This is a

function of available CPU time and machine capacity, currently up to 500,000 sequences may

be clustered if one has access to SGI/Cray resources. For SANBI's STACK database, weperform initial partitioning on the basis of available bin (usually tissue) annotations for each

sequence, for a phase I process into bin consensi. Later the resulting consensus sequences canbe clustered and the final assemblies built into indices.

Masking: Cross-Match, XBLAST, comparison mask-databases. Input sequences are Fasta

format with headers that contain accession number of the EST and CLONEID.

Clustering: D2-Cluster (as described above) and Crossmatch.

Assembly: PHRAP (our assembler of choice) but any assembler that is preferred can be

successfully employed aslong as the input data is Fasta format. Output data can be parsed

from PHRAP .ace files and also from .gde files for processing into CRAW which requires

.gde format.

Alignment analysis: CRAW processes the alignments for groups that share consensus and

processes them into quality/difference partitions.

The CRAW tool allows for selective assessment of alignments. It measures and displaysconsistency. As a significant proportion of alignments resulting from assembled clusters can

constain non-optimal conseni and 'junk' as well as chimeras, alternate splices and

polymorphisms, it is important to perform directed analysis of cluster alignments.

CRAW consistencyA group of more than two sequences is consistent if and only if every sequence in the group is


17/24

pairwise-consistent with the consensus sequence derived for the group. If several unrelated

sequences result in a poor quality alignment, a simple majority consensus generation rule

might sample the sequences such that a consensus is generated that is consistent with each

individual sequence in the group even when some sequence pairs or subsets contain a highdegree of mismatching. To prevent this, the consensus sequences are generated by an early

bias weighted method. We partition a group of sequences into sub-groups such that every

sub-group is consistent. Note that we must look for a maximal sub-grouping because the

trivial solution of breaking a group of N sequences into N singleton sub-groups is consistent

by our definitions. Thus, we find an optimal partition such that as many sequences aspossible are assigned to sub-groups. In our method, a sequence group is input and multiplealignments and consensus sequences are output. The amount of information that can be lost

to artifact or alternate gene forms is bounded by the fact that no sequence contains a window

of length W of less than (100*SIM) percent identity with the consensus. Regions of internal

sequence are fond by looking for regions in the alignment where windows of gapped region

size (G_R_S) contain over ceiling (GAP_PERCENT*G_R_S) gaps. An analysis in the paper

describing CRAW used parameters: G_R_S = 15 and GAP_PERCENT = 0.9. (Burke, et al,

1998)

12.2.1 Detecting and viewing alternate splicing

ALIGNMENT CONTAINS INCONSISTENCY:Strong Secondary Consensus Found.One position equals 20 bases.

X if more than 2 bases ( 10 percent) disagree with consensus sequences.

N if more than 2 positions are unknown.

"-" if more than 14 positions are gap characters.

0 200 400 600 800 1000 1141

| | | | | | |

------------------1112226666666666666666----------------- 6 AA205280 (brain)

------------------1112226-------------------------------- 6 AA205467 (brain)

------------------1112226666666666666666----------------- 6 cons. for 6

-111111111111NNNNN44444444------------------------------- 4 N62129

-1111111111144441---------------------------------------- 4 AI262897

-1111111111114441N44444444------------------------------- 4 cons. for 4

---------------11111111111111313333333------------------- 3 AA476710 (repro)

-----------------311111111111313333333333333------------- 3 R.C.AA902555(hemat)NCI

Kid3

-----------------13111111111131333333333333333----------- 3 R.C.AA128258(repro)

----------------------111111131333333333333333----------- 3 R.C.N94727

-------------------------11113133333333333333333--------- 3 AA129065

------------------------------1333333333333333----------- 3 R.C.AA243505

-------------------------------133333333333333----------- 3 R.C.AA127007(repro)

---------------------------------3333333333333----------- 3 R.C.N43925

-----------------------------------------33333----------- 3 AA442275 (repro)

---------------111111111111113133333333333333333--------- 3 cons. for 3

-1111111111111111111122---------------------------------- 2 AA262073

-11111111111111111111222--------------------------------- 2 AA831176

-11111111111111111111222222222--------------------------- 2 AA810752(hemato)NCI_GCB1

-11111111111111111111222222222--------------------------- 2 cons. for 2

NN11111-------------------------------------------------- 1 AA199678

11111111111111111111------------------------------------- 1 AA768208

-11111111111--------------------------------------------- 1 AA126628

-1111111111111111111111111111---------------------------- 1 N63563

-111111111111111111111----------------------------------- 1 AA730260

-111111111111111111111----------------------------------- 1 AA872419(hemato) tumors

-11111111111111111111111111111111------------------------ 1 AA128314

-1111111111111111---------------------------------------- 1 AA732483

-11111111111111111111------------------------------------ 1 AA993331

Title:

stackflow2.pdf

Creator:

Preview:

This EPS picture was not saved

with a preview included in it.Comment:

This EPS picture will print to a

PostScript printer, but not to

other types of printers.


18/24

--111111111111111111111111------------------------------- 1 N34852

---11111111111111111------------------------------------- 1 AA322575

------------1111111111----------------------------------- 1 AI267543

111111111111111111111111111111111------------------------ 1 cons. for 1

-1111111111111111111155---------------------------------- 0 AA743074(hemato)NCI GCB1

Figure 6 CRAW analysis of a STACK whole body index cluster that shows similarity toa genomic clone AC004106. Tissue origins are shown in brackets and clone origin is

underlined. Primary consensus(11111) shows significant similarity to clathrin coat

adaptor complex sigma1B protein. The secondary consensi(2222, 333333, 666666) show

significant similarity to two exons 3400036000 and 5410054200 (putative alternate

splice regions in genomic clone AC004106). Three ESTs (AA743074, AA205280 and

AA128258)(bold) were reported to represent putative alternate splice transcripts on

genomic clone AC004106 (Bouck et al., 1999). This craw analysis would suggest that

there are more ESTs in addition to the three reported by Bouck that samples the

putative alternate transcripts for the clathrin coat complex.

Consensus processing: CONTIGPROC

Once consensus sequence(s) have been generated by CRAW, it is necessary to attempt to

choose the most approriate consensus according to gene isoform.

CONTIGPROC independently partitions the aligned sequences amongst the CRAW consensi,

then ranks the consensi according to number of assigned sequences and number of calledbases. Contigproc reads a cluster assembly and the associated consensus sequences generatedby CRAW, then assigns each EST in the assembly to its 'best consensus'- the consensus that

has the most contributing ESTs. A round of elimination follows, in which consensi

representing only a single (or even no) ESTs are removed, then the consensi are ranked

according to the number of ESTs assigned to each (ties are broken by consensus sequence

length considering only clear ATCG base calls). The remaining consensi are logged with the

best consensus in the GIO and GDE file formats which support representation of sequence

alignment data. Finally, the internal cluster representation is output in the supported file

formats (GIO-NCGR, STACK-FASTA, SANIGENE-FASTA, STACK-GDE, SANIGENE-

GDE). The 5' or 3' orientation of each cluster is determined by a vote of the individual EST

annotations, and all output consensi are arranged to read 5' to 3'. Low quality consensus

regions, defined as two N's followed by at least thirteen IUPAC codes with four or less clear

A, T, C, or G calls, are replaced by a single run of 10 Ns.

Clone-linking:

More than 1 end may exist for each clone ID, and rather than tracing from EST1 -> clone ID -

> EST2, we simply form groups of ESTs for each clone ID and then check for any linksbetween these sets for extended clone link networks.

Output system

GDE format is appropriate for the presentation of cluster assembly data, but does not support

significant regions of non-overlapping sequences as in the case of the clone-linked data. Such

features are well supported by GSDB's GIO format, but this is not widely accepted by

software in the sequence processing field. FASTA format sequence data is ubiquitously

accepted, but captures only a single sequence in each record. As a result, SANBI supplies

appropriate data in all three formats, and further distinguishes clusters which are/are notjoined together in the clone linking phase for the GIO and FASTA output sets.

Data Structures

12.2.1.1 Cluster assembly for maximum consensus quality

Further processing of cluster assemblies (see CONTIGPROC/CRAW below), ensures highthroughput consistency of clustering alignment while alignment editors such as GDE can be

invaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST

clusters provide confidence and quality information - even in the absence of original trace

data - from which high accuracy, extended length consensus sequences can be constructed for


19/24

the vast majority of base positions.

Output restriction for generation of higher quality sequence:

SANIGENE dataset generation strategies

The exact sanigene membership criteria is:

At each base position of the consensus sequence

A = number of EST bases that agree with consensusB = number of EST bases that disagree with consensus

(A-B) >= 2

only ESTs assigned to the top CRAW consensus are considered, i.e. hopefully only the "good

sequences" from a cluster are in this set to begin with. (for any STACK or SANIGENE

cluster we mainly only work with the top CRAW consensus and the ESTs assigned to it, i.e.CRAW subconsensi and their ESTs are dumped off to the GDE files)

Generation of high confidence consensi

By restricting the final output sequence to only those regions represented by at least two

reads, as in the SANBI SANIGENE dataset, high confidence consensi can be generated with

a longer length (due to the rejection of clusters with small numbers of ESTs) in the average

case.

Rejection strategies.

To avoid redundancy in the final sets for BLAST searching, only the best consensus from

CRAW processing is used in the FASTA data sets (see Cluster Assembly and Processing

above)

The simple rule of thumb is that ESTs contain regions of very low quality, but the savvy

bioinformaticist can make up for this deficiency and even improve confidence overall by

studying the alignment of related sequences. In the real world of noisy data, no alignment

engine performs as well in every case as the human scientist's eye might be able to suggest;

for high-volume, high-throughput applications this problem can be compensated for by

further processing of cluster assemblies (see CONTIGPROC/CRAW below), while alignment

editors such as GDE can be invaluable for improvement "by hand" of individual cases.

Nonetheless, assembled EST clusters provide confidence and quality information - even in

the absence of original trace data - from which high accuracy, extended length consensus

sequences can be constructed for the vast majority of base positions. By restricting the final

output sequence to only those regions represented by at least two reads, as in the SANBI

SANIGENE dataset, high confidence consensi can be generated with a longer length (due to

the rejection of clusters with small numbers of ESTs) in the average case. The judicious

definition of linker segments to join regions of high-confidence, multiple read sequence data

in the final consensus sequence can actually improve results for BLAST searches, by allowing

the natural separation of High-scoring Segment Pairs (HSPs) in the search algorithm. This

result is over and above the improvement obtained by eliminating regions of low quality readsand other production errors. Even when utilising an assembly engine prone to mis-alignment

and error when confronted with the problems commonly observed in EST data, cluster

assembly provides a net benefit in maximising consensus sequence quality: the cleanest

regions will be identified and highlighted, while the remainder are isolated in preparation for

appropriate strategies for cluster alignment processing.


20/24

12.2.2 Adding using STACK_PACK

See above.

12.2.2.1 Accession number management under dynamic addition

ESTs are passed against existing singletons and consensi. As ESTs are added to an existing

clustered database, it is likely that new sequences will join previously separate clusters. The

STACK strategy is to decompose the existing clusters to their constituent ESTs, then

reprocess these complete, new clusters from the assembly step onwards. In this process the

previous cluster IDs are removed from the database and new IDs created, but a problemdevelops in that existing sequence analysis and visualisation tools rely on accession numbers

to remain unchanged as the sequence databases grow (the original targets, experimentally

determined sequences, would not be expected to change significantly over time). The only

method we have identified to cope with this problem is the annotation of "secondaryaccession" fields for each cluster, listing the older cluster IDs each has subsumed. This

strategy is mirrored in the system used with th IS databank consortium by the use ofaccession

versions. This is a new field where the first number is the never changing accession number,

followed by a period and a version number. The version number starts at one, and increases

by one each time the sequence changes (ie is merged, where the largest cluster keeps its

primary accession, and the smaller cluster subsumes to a secondary accession.) We remainable to extract clusters based on constituent accessioned ESTs.

Virtual protein manufacture: Currently, EST_SCAN

(http://www.ch.embnet.org/software/ESTScan.html), FRAMEFINDER (Guy Slater, personal

communication) and other tools are being evaluated for efficacy.

12.3References in order of mention.Vasmatzis, G., M. Essand, U. Brinkmann, B. Lee, and I. Pastan. 1998. Discovery of threegenes specifically expressed in human prostate by expressed sequence tag database analysis.

Proc. Natl. Acad. Sci. USA 95(1):300-304 .

Wolfberg, T.G. and D. Landsman. 1997. A comparison of expressed sequence tags (ESTs) to

human genomic sequences. Nucleic Acids Research 25(8):1626-1632.

Hillier,L., N. Clark, T. Dubuque, K. Elliston, M. Hawkins, M. Holman, M. Hultman, T.

Kucaba, M. Le, G. Lennon, M. Marra, J. Parsons, L. Rifkin, T. Rohlfing, M. Soares, F. Tan,

E. Trevaskis, R. Waterston, A. Williamson, P. Wohldmann, and R. Wilson. 1996. Generation

and Analysis of 280,000 Human Expressed Sequence Tags. Genome Research 6:807-828.

Aaronson, J.S., B. Eckman, R.A. Blevins, J.A. Borkowski, J. Myerson, S. Imran, and K.O.

Elliston. 1996. Toward the Development of a Gene Index to the Human Genome: An

Assessment of the Nature of High-throughput EST Sequence Data. Genome Research 6:829-

-845.

Adams, M.D., J.M. Kelley, J.D. Gocayne, M. Dubnick, M.H. Polymeropoulos, H. Xiao, C.R.Merril, A. Wu, B. Olde, R.F. Moreno, A.R. Kerlavage, W.R. McConbie, and J.C. Venter.

1991. Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome

Project. Science 252: 1651-1656.

Baldo, M.F., G. Lennon, and M.B. Soares. 1996. Normalization and Subtraction: TwoApproaches to FacilitateGene Discovery. Genome Research 6: 791 - 806..

Alwen,, J. United Kindom human genome mapping project: development, components,

coordination and management, and international links of the project. 1990. Genomics. 6. 386-

388.

Gill RW, Hodgman TC, Littler CB, Oxer MD, Montgomery DS, Taylor S, Sanseau P.Advanced Technology & Informatics Unit, Glaxo-Wellcome Medicines Research Centre,

Stevenage, Heris, UK. [email protected]

Torney, D.C., et al. 1990. Computation of d2: A Measure of Sequence Dissimilarity.


21/24

Computers and DNA, SFI Studies in the Sciences of Complexity, vol. VII, eds. G. Bell and T.

Marr, Addison-Wesley.

Hide, W., J. Burke, and D.Davison. 1994. Biological Evaluation of d^2, an Algorithm for

high-performance Sequence Comparison. J. Comp. Bio. 1:199-215.

Wu, T.J., J.P. Burke, and D.B. Davison. 1997. A Measure of DNA Sequence Dissimilarity

Based on Mahalanobis Distance Between Frequencies of Words. Biometrics 53:1431-1439.

S. Burkhardt, A.Crauser, P.Ferragina, H-P. Lenhof, E. Rivals, M. Virgon q-gramm BasedDatabase Searching Using a Suffix Array (QUASAR), Proceedings of the 3rd Inrenational

Conference on Computational Molecular Biology (RECOMB99), Lyon, France, 1999.

Strelets, V.B., Ptitsyn, A.A., Milanesi, L., Lim, H.A. (1994) Data bank homology search

algorithm with linear computation complexity. Comp.Appl. Biosci., v.10, n. 3 (1994), pp.

319-322

Sutton, G., O. White, M.D.Adams, and A.R. Kerlavage. 1995. TIGR Assembler: A New Tool

for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology 1:9-

18.

Burke, J.P., H. Wang, W. Hide, and D. Davison. 1998. Alternative Gene Form Discoveryand Candidate Gene Selection from Gene Indexing Projects. Genome Research 8: 276-290.

Boguski, M.S. and G.D. Schuler. 1995. ESTablishing a Human Transcript Map. Nature

Genetics 10:369-371.

Benson D.A., M.S. Boguski, D.J. Lipman, and J. Ostell. 1994. GenBank. Nucleic Acids

Research 22: 3441-3444.

Hide, W. ,Burke, J., Christoffels, A., Miller, R., 1997. A novel approach towards acomprehensive consensus representation of the expressed human genome. In Genome

Informatics 1997, pp187-196.Satoru Miyano and Toshihisa Takagi Eds. Universal Academy

Press Inc. Tokyo, Japan. ISSN 0919-9454

Houlgatte R., R. Mariage-Samson, S. Duprat, A. Tesslier, S. Bentolila, B. Lamy, C. Auffray.

1995. The GenExpress Index: A Resource for Gene Discovery and the Genic Map of the

Human Genome. Genome Research 5: 272-304.

Williamson et al. 1995. Merck Gene Index

Green 1996, PHRAP

Bouck et al 1999 Trends in Genetics 15 (4): 159-162


22/24

12.4 AppendixSTACK Schema.

Unigene WHALE toolhttp://www.ncbi.nlm.nih.gov/Sicotte/whale.html

Unigene and d2-cluster comparisons: Fragmentation as a result of stricter clustering.

There is significant improvement in clustering with d2-cluster vs Unigene.

d2_cluster and UniGene produce results that are between 83% and 90% identical, and

d2_cluster results are between 8% and 20% less fragmented than UniGene. The above figure


23/24

demonstrates a cluster that has been fragmented by Unigene and is kept together by d2-

cluster.

D2_cluster measurably decreases the rate of fragmentation: producing around 20% fewer

singleton sequences (6463 to 5198) and reducing the overall number of clusters by 10%(15225 to 13756). Generally the numbers of smaller clusters are reduced while larger clusters

appear with slightly larger frequency. (see table: Burke, Davison, Hide, Submitted, Genome

Research))

Table of Cluster Sizes

Cluster size UniGene

(Rat-build 19)

d2_cluster

Singleton clusters 6463 5189

2 3496 3298

3-4 3002 2971

5-8 1635 1602

9-16 491 531

17-32 111 127

33-64 21 27

65-128 3 8

129-256 3 2

257-512 1 0

Total Clusters 15225 13756

EST links./

STACK tools:

www.sanbi.ac.za/stack

www.sanbi.ac.za/stack_pack

* ICA Tools http://sunny.ebi.ac.uk/ebi/EST/

* basic explanation of clustering

http://sunny.ebi.ac.uk/~jparsons/work/est_assembly/est_assembly_intro

.html

* clustering rat sequences (good overview) http://goliath.ifrc.mcw.edu/EST/contents.html

* the UniGene database, probably the most well-known EST


24/24

database: http://www.ncbi.nlm.nih.gov/UniGene/index.html

* starting information when creating STACK: http://www.ncbi.nlm.nih.gov/dbEST/index.html

the TIGR THC database, another view of the consensus world.

http://www.tigr.org/tdb/hgi/hgi.html

* other EST links: http://industry.ebi.ac.uk/~muilu/EST/EST_links.html

* integrate EST information http://www.ncbi.nlm.nih.gov/LocusLink/index.html

Websites referenced:

1 http://genome.wustl.edu/est/esthmpg.html

2 http://hercules.tigem.it/ASSEMBLY/capdoc.html3 http://www.girinst.org/4 ftp://ncbi.nlm.nih.gov/repository/vector/

567 http://sunny.ebi.ac.uk/EST/

8 http://www.ncbi.nlm.nih.gov/LocusLink/9 ftp.sanbi.ac.za/stack/software/viz.tar

Documents

EST Tutorial