EST Tutorial

Embed Size (px)

Citation preview

  • 8/2/2019 EST Tutorial

    1/24

    EESSTT CCLLUUSSTTEERRIINNGG

    TTUUTTOORRIIAALL

    ISMB 1999

    Presenting Tutors:

    Win Hide and Alan Christoffels

    Authors:

    Win Hide, Rob Miller, Andrey Ptitsyn, Janet Kelso, Chellapa

    Gopallakrishnan and Alan Christoffels

  • 8/2/2019 EST Tutorial

    2/24

    1 IntroductionThe tutorial is organized to provide an introductory framework for understanding the conceptsand underlying reasoning for Expressed Sequence Tag (EST) Clustering and consensus

    generation and, in addition, to enable readers/attendees to be able to design and implement an

    EST consolidation system. It is based on what we have learned over the past three years about

    clustering technologies and issues surrounding the field The tutorial includes examples that

    have been drawn from the authors' experience, which consist mostly of STACK,

    STACK_PACK and the algorithms used in these systems. The methods employed by thesesystems will be the focus of the demonstration and review session in the second half of thetutorial. Many of the tasks we discuss can also be performed by other systems and we will

    attempt to compare system strategies as objectively as possible.

    Questions on the material or implementation of clustering systems can be directed at theauthors:

    Win Hide : [email protected] www.sanbi.ac.za/stack

    Alan Christoffels: [email protected]

  • 8/2/2019 EST Tutorial

    3/24

    1 Introduction................................................................................................................................ 22 The need for EST clustering........................................................................................................ 43 General Aspects of EST data.......................................................................................................4

    3.1 Generation of ESTs............................................................................................................. 43.1.1 What is an EST? .......................................................................................................... 4

    3.2 EST data quality.................................................................................................................. 54 Overview of clustering and consensus generation ........................................................................5

    4.1 What is an EST cluster?....................................................................................................... 5

    4.2 Loose and stringent clustering ............................................................................................. 54.3 Data apprehension and input format.................................................................................... 64.4 Pre-processing..................................................................................................................... 64.5 Initial clustering.................................................................................................................. 64.6 Assembly ............................................................................................................................ 64.7 Alignment processing.......................................................................................................... 64.8 Cluster joining..................................................................................................................... 6

    4.8.1 Clone joining...............................................................................................................64.8.2 Available parents......................................................................................................... 6

    4.9 Output................................................................................................................................. 75 Summary of Introduction ............................................................................................................7

    6 Implementation Strategies...........................................................................................................76.1 EST pre-processing strategies.............................................................................................. 7

    6.1.1 Screening out repeats and vector sequence ...................................................................76.2 Masking strategies............................................................................................................... 8

    7 EST Clustering methods.............................................................................................................. 97.1 Clustering and Statistical cluster analysis............................................................................. 97.2 Using common search engines............................................................................................. 9

    7.2.1 Alignment scoring methods: BLAST and FASTA...................................................... 107.3 Purpose-built alignment based clustering methods ............................................................. 107.4 Non-alignment based scoring methods: D2-cluster........................... .................................. 107.5 Pre-indexing methods........................................................................................................ 11

    8 Systems and Algorithms............................................................................................................ 118.1 TIGR_ASSEMBLER ........................................................................................................ 118.2 UniGene ........................................................................................................................... 128.3 STACK and STACK_PACK............................................................................................. 13

    8.4 Other datasystems ............................................................................................................. 138.5 Strategies for keeping data 'current'.................................................................................... 14

    9 Cluster Assembly and Processing.............................................................................................. 149.1 Processing alignments....................................................................................................... 14

    10 Clone Linking....................................................................................................................... 15

    11 A working clustering system ................................................................................................. 1511.1 Expression counts ............................................................................................................. 1611.2 Consensus sequences......................................................................................................... 1611.3 Alternate expression-form charcterisation........................................ .................................. 1611.4 SNP detection ................................................................................................................... 1611.5 Identification of genes expressed in the cluster project ....................................................... 1611.6 Identification of genes specifically expressed in a chosen library or tissue.......................... 16

    12 Brief introduction to the STACK_PACK clustering system.................................................... 16

    12.1 STACK processing............................................................................................................ 1612.2 Data bin strategy and masking........................................................................................... 16

    12.2.1 Detecting and viewing alternate splicing .................................................................... 1712.2.2 Adding using STACK_PACK.................................................................................. 20

    12.3 References in order of mention. ...................................................................................... 2012.4 Appendix ......................................................................................................................... 22

  • 8/2/2019 EST Tutorial

    4/24

    2 The need for EST clusteringWith the easy access to technology to generate expressed sequence tags (ESTs), several

    groups have sequenced from thousands to several hundred thousands of ESTs. Currently the

    majority of the coding portion is in the form of expressed sequence tags (ESTs), and the need

    to discover the full length cDNAs of each human gene is frustrated by the partial nature of

    this data delivery. There is significant value in attempting to consolidate gene sequences as

    they are produced, in lieu of a yet-to-be-completed reference sequence. ESTs offer a rapid andinexpensive route to gene discovery, reveal expression and regulation data (Vasmatis, et al,1998), highlight gene sequence diversity and splicing (Wolfberg and Landsman, 1997), and

    may identify more than half of known human genes (Hillier, et al, 1996). The price of the

    high-volume and high-throughput nature of the data, however, is that ESTs contain high error

    rates (Aaronson, et al, 1996), do not have a defined protein product, are not curated in ahighly annotated form and present only a raw substrate for sequence matching. Unfortunately,

    most EST data remains unprocessed, and thus does not provide the important high value

    sequence consensus information that it contains. The low quality sequence data provided can

    be much improved on, and in order to achieve quality information, pre-processing, clustering

    and post-processing of the results is required. One goal of these projects is the construction of

    gene indices, where all transcripts are partitioned into index classes such that transcripts areput into the same index class if and only if they represent the same gene or gene isoform.

    Accurate gene indexing facilitates gene expression studies as well as inexpensive and early

    gene sequence discovery through the assembly of ESTs that are derived from genes that have

    yet to be positionally cloned or obtained directly through genomic sequencing.

    3 General Aspects of EST data3.1 Generation of ESTs3.1.1 What is an EST?

    Adams et al (1991) published a widely read article describing use of ESTs in 1991. An EST,Expressed Sequence Tag, is a tiny portion of an entire gene, a fragment of a cDNA clone that

    has been sequenced. The process by which ESTs are manufactured requires the construction

    of an mRNA library. Baldo et al (1996) have provided a detailed description of how libraries

    are constructed and how normalization and library subtraction can be used to increase relative

    representation of less abundantly transcribed mRNAs. The reverse transcriptase used to

    manufacture each cDNA in the library will eventually fall off the template (Figure 1), and this

    will terminate the production of the cDNA. Thus a series of length-differentiated 3' delimitedcDNA fragments may be produced for each mRNA that is a viable template in the library.

    The length of the cDNA will vary, and this is an important factor for development of

    coverage for each mRNA template of an available gene. Usually, several hundred to several

    thousand clones are isolated at random from a given cDNA library. Clones are sequenced a

    single time, from one or both ends of the DNA insert, using universal primers which arecomplementary to the vector at the multiple cloning site. The M13 forward primer may be

    located near the 5' or the 3' end of the cloned insert, depending on how the inserts weredirectionally cloned. Only 300-500 readable bases are produced from each sequencing read,

    and yet a full gene transcript may be several thousands of bases long. ESTs thus provide a

    "tag level" association with an expressed gene sequence, trading quality and total sequence

    length for the high quantity of genes which can be tagged in a given amount of time.

  • 8/2/2019 EST Tutorial

    5/24

    Figure 1 Manufacture of an EST

    3.2 EST data qualityGeneration of EST data result in 'low quality' sequence information. A single read is

    generated for each EST, and as such will contain errors from its generation at each step.

    These can include clone orientation, associated clone ID chimeras and missing 3' and 5' reads.

    Because data are single-pass unedited sequences, they are also subject to errors caused by

    compressions and basecalling problems resulting in frameshifts. Reference to the Washington

    University website details common aspects of EST error1

    . EST sequence has regions of high

    quality very close to regions of low quality, where quality can be defined as the number of

    correctly sequenced bases within a known window of reference. It is possible to utilise poorquality sequence as long as relevant strategies for maximising their utility are taken.

    4 Overview of clustering and consensus generationEST Clustering is performed as a process that utilises 'clustering information' that is less and

    less definitive. Initially sequence identity provides a good guide to cluster membership.

    Shared annotation provides joining information that can be of more variable quality. Thus the

    number of accurately clustered ESTs is heavily dependent on a strategy that can assign cluster

    membership based on verifiable criteria; sequence identity is currently the most useful of

    these.

    Clustering can be performed with or without sequence consensus generation. It is preferable,

    although more difficult, to manufacture a consensus sequence from each cluster. Theclustering overview will briefly describe processes that result in consensus sequence

    generation.

    4.1 What is an EST cluster?A cluster is fragmented, EST data (DNA or protein) and (if known) gene sequence data,

    consolidated, placed in correct context and indexed by gene such that all expressed data

    concerning a single gene is in a single index class, and each index class contains the

    information for only one gene (Burke, Davison, Hide, Submitted, Genome Research).

    4.2 Loose and stringent clusteringESTs by their nature have a degree of erroneous sequence data, complicated by short lengthand some mis-annotation. Stringent one-pass assembly methods tend to result in fewer,shorter consenus sequences. Looser systems for clustering result in larger, more 'sloppy'

    clusters, with various expressed forms being represented within each cluster. Each approach

    has its advantages and disadvantages. Stringent clustering provides greater initial fidelity, at a

  • 8/2/2019 EST Tutorial

    6/24

    cost of lower coverage of expressed gene data and a lower inlcusion rate of expressed gene

    forms.Loose clustering provides greater coverage, at a cost of possible inclusion of

    paralogous expressed genes, lower fidelity data, but at a gain of greater inclusion of alternate

    expressed forms. We detail here a loose strategy with comparison to other strategies describedbelow. Stringent clustering is performed by algorithms such as TIGR_ASSEMBLER and

    more loose clustering by the process employed in Unigene and STACK manufacture. The

    latter methods tend to be broken down into several sub-processes, resulting in more

    flexibility, but are also more complicated.

    4.3 Data apprehension and input format.Data sources for clustering can be in-house, proprietary, public database or a hybrid of these.

    It is therefore quite important to establish an INPUT DATA FORMAT that is consistentbetween these sources. The format needs to embrace the input accession number of a

    sequence (or temporary accession, such as a sequence-run id if not yet submitted),

    information on location with respect to poly A (3' or 5') and importantly, the CLONE ID from

    which the EST was generated. SANBI uses FASTA format input files that contain the above

    information in the header.

    The steps suggested in EST clustering are as follows :

    4.4 Pre-processingSequences are masked for repeats and vector, and formatted for the clustering engine.

    Sequence quality is often assessed at this step. A minimum number of residues are accepted

    above a known quality threshold. All masked sequence data is accepted for clustering above

    50bp in length. SANBI's input acceptance approach has a low input threshold stringency, as

    the clustering method(s) can work with more erroneous data more successfully. NCBI

    discards ESTs with a window of less than 100bp of 'clean' data.

    4.5 Initial clusteringAn initial clustering is performed based on a measure of high sequence identity

    4.6 AssemblyAssembly is either part of the initial clustering (as used in TIGR_ASSEMBLER) or separated

    into clustering followed by assembly performed by a specialist assembly package such as

    PHRAP (with assessment of residue quality turned off) or CAP2 / 32

    4.7 Alignment processingAligned clusters, particularly those generated by a loose clustering engine, need to be

    processed for errors and alternate forms of expressed sequences. Consensus generation may

    be a result of this step (as in STACK), or a consensus can be accepted directly from the

    assembly step. Consensi are chosen based on maximal length. See section 9.1.

    4.8 Cluster joining.Once clustered, clusters and/or cluster consensi can be further joined by available annotative

    approaches.

    4.8.1 Clone joining

    The most powerful cluster joining method is clone-joining, which utilises the physically

    shared clone id between 3' and 5' EST fragments sequenced from the same starting clone. Adifferent approach would be to link clone-related sequences in the pre-processing phase, but

    this may increase errors and processing time requirements at the clustering and assembly

    stages. By either approach, linking by clone annotation is an error-prone step in the EST

    consolidation process as it relies entirely on the accuracy of the sequence annotation and the

    uniqueness of clone IDs if data from disparate sources is to be used.

    4.8.2 Available parents

    If a parent mRNA sequence is available (non-EST) it can be used to physically link EST

    cluster(s) via sequence comparison.

  • 8/2/2019 EST Tutorial

    7/24

    4.9 OutputFigure 2 Cartoon of clustering steps

    Defining an output format for the clustering process is problematical. Information required

    often includes alignment (alternate splices, polymorphisms and error assessment), raw cluster

    membership, and contextual links. Nonetheless, results must be easily incorporated into

    existing software packages, which in general have not been designed to support thecomplexity or evolving nature of clustered EST data. For STACK, individual cluster

    alignments (including consensi and sub-consensi) are presented in an alignment format(GDE), while consensus sequences and clone-linked sets are stored in FASTA format for use

    as BLAST-searchable databases. A distributable FASTA file is in development, as is a

    'reduced' format for easy distrbution. Development of data formats and systems for accessing

    information about EST clusters is an ongoing project at SANBI.

    5 Summary of IntroductionIn this section we have addressed the value of clustering ESTs for the elucidation of complete

    sequences not yet available using traditional methods. The nature and quality of EST datahave been assessed with reference to the limitations that these have on sequence processing.

    An overview of EST clustering procedures indicating the flow of EST sequence information

    through stages of pre-processing, initial clustering, assembly, alignment processing, clusterjoining and output has been discussed. The following section will examine each of these

    stages in more detail and address some of the issues encountered.

    6 Implementation Strategies6.1 EST pre-processing strategies6.1.1 Screening out repeats and vector sequence

    Clustering requires that a certain degree of identity be preserved between each member of a

    defined cluster. The most common problem in EST clustering is contamination with a

    sequence common to several members of the input EST data set but not unique to a specific

    group. Two classes of common contaminating sequence are :

  • 8/2/2019 EST Tutorial

    8/24

    6.1.1.1 Repeats

    Any uncharacterised expressed sequence set is likely to contain repeat sequences. Forinstance in humans the most common element is the ALU repeat element. In newly sequenced

    genomes, such as plant and other eukaryote systems, repeat sequences represent a common

    and frustrating clustering problem. Repeat databases provide a resource against which repeats

    can be detected. The repeat databases are dependent on continuing curation and detection of

    novel repeats in genomes and thus provide a valuable resource.

    6.1.1.2 Cloning vector fragmentsVector sequences can skew clustering even if a small vector fragment remains in each read.

    6.1.1.3 Low complexity sequences

    Low complexity sequence (poly A tracts, AT repeats etc) also have the potential to provide an

    artifactual basis for cluster membership. The problem is more significant for strategies thatemploy alignable similarity in the first pass cluster assignment. Word based cluster

    assignment can be modified to provide low weight to low complexity words. In recent EST

    clustering applications being developed at SANBI (d2-cluster, ASSA) and elsewhere, there is

    no need for masking of low-complexity DNA as such regions tend to have a highly redundant

    oligonucleotide composition. Because these sequence comparison algorithm scale

    oligonucleotides according to their potential information content, highly redundant oligos are

    given very low weight and low-complexity regions are therefore excluded from consideration.

    6.2 Masking strategiesThe most effective method to remove contaminants is to compare each read against a

    reference database of repeats (RepBase3) and vector sequences (VecBase

    4) using an algorithm

    that is reasonably fast and accurate. XBLAST (NCBI tools) and Cross-Match, an

    implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green5

    have

    been used successfully, with Cross-match demonstrating greater flexibility and sensitivity

    than XBLAST. Repeat-Masker6contains Cross-match as a tool for masking. DUST is used at

    NCBI for this procedure.

    In cases where a direct identity is found with a repeat or vector subsequence, a 'mask residue'

    can be substituted into the read. The resulting runs of NNNNs (DNA) or XXXXXs (protein)will be ignored by most clustering engines (Figure 3a and 3b).

    A problem arises when an EST library is presented that is from a novel organism for which

    the repeats have not been characterised. In this instance it may be necessary to employ 'blind'

    repeat masking if an algorithm is available. Repeat masking is necessary if the repeats are

    large enough to represent a source for artifactual contamination.

    An exploitable feature of sequence contamination in loose clustering is that, if the tools work

    as intended, then the contaminated sequences and all related sequences will be clusteredtogether. There is no automatic method to identify a contaminated cluster, but once it is

    identified only that cluster needs to be decomposed into its original sequences and re-

    processed (no other cluster will be affected by the sequences in the contaminated cluster).

    This may not be the case for stringent clustering systems, where a more cautious repair

    strategy is required.

    Figure 3a EST sequence in FASTA format.

    >T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE

    gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' end

    AAGACCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTT

    CTAATATCTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGC

    ACACAGATGTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTC

    TCCACTGAAAAATCCTCTTTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCAC

    TGGACGGTGACGTCAGCCATGTACAGGATCCACAGGGGTGGTGTCAAATGCTATT

    GAAATTNTGTTGAATTGTATACTTTTTCACTTTTTGATAATTAACCATGTAAAAAATG

  • 8/2/2019 EST Tutorial

    9/24

    Figure 3b EST sequence read after masking vs BLASTX or Cross-Match and Vecbase + RepBase

    >T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE

    gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5' endxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    xxxxxTATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAAATGAATG

    TAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCTTTC

    TTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCAT

    GTACAGGATCCACAGxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAACCATGTAAAAAATG

    7 EST Clustering methods7.1 Clustering and Statistical cluster analysisClustering is not exactly classical cluster analysis as understood by statisticians. In cluster

    analysis practically all statistical aspects are about classification and justification of

    classification of the object with complicated relationships. This may be why we use the word"clustering" instead of cluster analysis. In the case of clustering the fundamental problem is

    the selection of a metric that is reduced to a simple binary digit, i.e. sequences either match ordon't. To cluster the EST data we look at near identical matches. Essentially, the fragments we

    put together belong to the same DNA and possible mismatches are mostly resulting frommisreads and represent noise. The inter-cluster distance, as well as inter-object distance is

    also reduced to binary - in an ideal situation, the clusters of ESTs related to one gene should

    be infinitely close to each other while those related to different genes are infinitely distant and

    no intermediate states should make sense. Following the simplification of a distance measure,

    the problem of clustering quality estimation also reduces to a trivial task: the closer we get to

    detecting all exactly matching fragments the better. Thus, from the statistical point of view

    clustering ESTs is a trivialised version of cluster analysis. But it doesn't make this job anyeasier.

    7.2 Using common search enginesSmall projects, clustering of a few dozens or hundreds or even thousands of ESTs are indeed

    'trivial' and can be approached by standard tools of contig assembly or even multiple

    alignment. The real challenge is the amount of data, available now in public and private

    databases and waiting to be clustered. Taking into account these millions of ESTs, obtaining

    the trivial binary distance between fragments is far from a trivial job even for availablesupercomputers. Modern tools of sequence comparison (well -known Smith-Waterman,

    FASTA, BLAST) are mostly built for a different purpose: searching. They are all different

    variations of an alignment algorithm, i.e. correct position of sequence elements (nucleotides

    or groups of nucleotides) against each other maximizes some score. The purpose of this

    process is to detect and measure quantitively the similarity (distance) between any 2

    sequences compared. Smith-Waterman is the most exhaustive and computationally expensivetool, deriving the best sensitivity and detecting weak similarities. FASTA and BLAST are less

    sensitive and trade some sensitivity for speed. As mentioned previously, the distance measure

    in EST clustering is reduced to binary, it is therefore only necessary to detect a near or perfect

    match. Extension penalties and gapping manipulation become less important in an initial

    assessment of pairwise identity. It is therefore important to 'head for speed' over sensistive

    comparison. Use of a banded Smith-Waterman on already compared clusters is an approach

    that is tenable for further consensus generation.

  • 8/2/2019 EST Tutorial

    10/24

    7.2.1 Alignment scoring methods: BLAST and FASTA

    BLAST is an algorithm efficiently implemented on many platforms.Although not developedspecially for clustering, BLAST sequence comparison is widely used initially in EST

    clustering because it's readily available and flexible enough to be tuned for the task with a

    change of default parameters (wrappers for BLAST iterative comparison exist and are

    accessible via the web, see appendix). Using the standard BLAST II application, available

    from NCBI anonymous FTP, it is possible to set up a stringent match set of parameters as

    follows:

    -e expectation value set to 0-G cost to open a gap can be increased,

    -E cost to extend a gap can be increased,

    -q mismatch penalty increased

    -r match reward increased

    -a number of processors to use can be adjusted,

    -W (word size, found on NCBI web) set up for longer words (default 11)

    FASTA is less widely used, but can be also applied for the same purpose. Generally, it allows

    the same type of variation in parameters as BLAST - increasing the k-tup parameter to

    increase speed and raising a threshold to pick up only the strongest similarity.

    7.3 Purpose-built alignment based clustering methodsThe field has recently seen the emergence of many new algorithms in development, but

    dedicated production algorithms are still few. We will not attempt at this time, to review the

    entirety, as there is a growing level of new material that has been very recently appearing.

    ICA tools, distributed last in 1997 by Jeremy Parsons, represents an alignment-based

    systematic approach to clustering. This group of algorithms was one of the first to become

    available, and has been used at major sequencing centres and is useful for data reduction. The

    system is however not as complete as others that have been developed subsequently . ICA

    tools were developed as part of the UK human genome mapping project, (Alwen, 1990).

    According to Jeremy Parsons, the ICAtools are a set of programs that could be of use to

    anyone doing medium-to-large scale DNA sequencing projects. The system has several toolsbut was originally designed for database redundancy and adapted for genomic fragments and

    finally EST fragments.ICAss uses a BLASTN type of algorithm to perform 'database pruning' to assess wether one

    sequence is a sub-set of another. N2tool consitutes a dedicated clusterng tool that relies onindexing. It uses an indexed file format and local alignment to compare all the submitted

    sequences with each other to find those which share any region of similarity. ICAtool indexes

    DNA sequences into clusters which share local sequence similarities. ICAass takes a size-

    sorted (longest first) file of sequences and searches for those sequences which are

    approximately repeated within the length of another. ICAmatches attempts to explain why

    sequences have been clustered together by using novel sequence alignment. N2tool has been

    utilised in the Washington University Merck EST manufacture, for identification of artifactual

    sequences. A new generation of these tools is being developed and employed at EBI7, and

    employs CORBA and Java for database interrogation (JESAM: Parsons and Rodriguez

    Tome, personal communication)

    EST-BUILDS

    Another common-sense system recently published by Gill et al (1998) is a 'dynamic build'

    system, where a seed EST is used to match with others via a build-blast strategy. This method

    works well for single seed clusters, but is not implemented for large scale database building.

    7.4 Non-alignment based scoring methods: D2-clusterD2-cluster is a word multiplicity comparison method that utilises an agglomerative algorithm

    that has been specifically developed for rapidly and accurately partitioning transcript

    databases into index classes by clustering ESTs and full-length sequences according to

    minimal linkage or transitive closure rules. Agglomerative clustering method means thatevery sequence begins in its own cluster and the final clustering is constructed through a

    series of mergers that may be described in terms of minimal linkage, sometimes called single

    linkage or transitive closure". The term transitive closure refers to the property that any two

    sequences with a given level of similarity will be in the same cluster, hence A and B are in the

  • 8/2/2019 EST Tutorial

    11/24

    same cluster even if they share no similarity but there exists a sequence C with enough

    similarity to both A and B. The criterion for joining clusters is the detection of two sequences

    that share a window of (Window_Size) bases that is (Stringency) percent or more identical.

    The only criterion for clustering is sequence overlap and source or annotation information isnot used. To detect the overlap criterion we use the d2 algorithm and set parameters and

    threshold values as described in (Torney et al, 1990; Hide, et al, 1994; Wu et al, 1997). The

    initial and final state of the algorithm is a partition of the input sequences where each

    sequence is in a cluster and no sequence appears in more than one cluster.

    D2-cluster uses an approach of word matching within a window, together with a measure ofthe multiplicity (if any) of that word within a window. The principal concept is that it doesn't

    attempt an alignment, not even in a reduced form. The results of comparison are derived

    directly from the comparison of word composition (word identity and multiplicity) of 2

    sequence windows. Thus, the algorithm can be significantly faster than BLAST. Speed comes

    with a price: to collect significant statistics, the fragments must be long enough (about 100

    bp) and only very high similarities can be detected (above 90% identity within a window).

    D2-cluster is used to produce initial loose clusters in STACK clustering system. We have

    determined that results of d2_cluster alone are between 8% and 20% less fragmented than

    UniGene (Burke, Davison, Hide, submitted, Genome Research) and the STACK datasystem

    produces clean clusters that are 16% less fragmented than Unigene (Miller et al, submitted,

    Genome Research 1999)

    7.5 Pre-indexing methodsThe size of the datasets has effectively precluded their use on workstations architectures.

    Indexing is one approach that allows for less computationally intense operations. Indexing of

    sequences allows one to store ready tables of words and their positions in a sequence, so the

    comparison algorithm has more than half of the work done already when it starts. Publicly

    available tools are to be available in the near future, for instance, 'QUASAR' was

    announced at RECOMB99, it is termed ' Q-gramm Alignment based on suffix arrays'. Thisalgorithm is designed to quickly detect sequences with high similarity to the query in a

    context where many searches are conducted on one database. The database is presented in the

    pre-processed (indexed) form and similarity detection is based on the exact matching of short

    substrings (q-gramms). (Burkhardt, et al 1999).

    8 Systems and Algorithms8.1 TIGR_ASSEMBLERDeveloped for the assembly of large shotgun projects, TIGR assembler has also been

    employed to manufacture consensus sequence assemblies fo TIGR-THC from ESTs.

    The assembler was originally designed for large shotgun projects, but is suited for EST

    assembly. It employs a standard rapid oligonucleotide content comparison to reduce search

    time. Pairwise comparions generate a list of potential (end-)overlaps. Non-repeat fragments

    seed subsequent assembly of listed overlaps.TIGR Human Gene Index (HGI, http://www.tigr.org) employs the strict assembly method of

    TIGR_ASSEMBLER (Sutton, et al, 1995), grouping highly related sequences and

    consequently producing very accurate consensus sequences with a minimum of chimerism or

    other contamination. This method discards under-represented and divergent or noisysequences in favour of confidence based on transcript redundancy, but in doing so can

    eliminate related sequences from clusters which might provide examples of alternative

    splicing or other valuable forms of sequence diversity.

    For TIGR THC , assemblies and ESTs are clustered according to a two stage process by the

    THC_BUILD script (G. Sutton):

    1) BLAST and FASTA are used to identify all sequence overlaps,2) All overlaps are stored in a relational database, and

    3) Transitive closure groups are formed and subjected to assembly using TIGR assembler(Sutton et al, 1995) such that joining occurs when sequences share at least 95% identityover at least 40 bases (http://www.tigr.org/hgi/hgi_info.html).

    The presence of sequence repeats affects both the order and stringency of joining. The

    resulting assemblies are referred to as tentative human consensus sequences (THCs). The

  • 8/2/2019 EST Tutorial

    12/24

    assembler also imposes special matching constraints on the ends of sequences and a minimum

    sequence identity within an index group. The strictness of matching criterion has the

    advantage of often preventing chimerism and contamination from tainting index groups but

    results in more a more fragmented representation of the data that is less able to incorporateerror prone sequence. This strictness often disallows the combination of sequences with

    sufficient diversity so that sufficiently divergent ESTs that sample alternative splice forms of

    the same gene are kept in different assemblies but they are linked as being splice variants in

    those cases where the ESTs match sequenced genes with known isoforms in EGAD. Most

    sequence assembly programs share similar properties with TIGR. The PHRAP package,incorporates sequence quality data derived from sequence traces into the assembly process (P.Green, personal communication) allowing for the incorporation of higher error data but the

    edge matching and maximum mismatch criterion still must be satisfied. In general, the results

    of assembly programs are not invariant with respect to presentation order (Burke, Davison,

    Hide,Submitted, Genome Research).

    Comparison of TIGR_THC with SANBI STACK_PACK methodology.

    Table 1

    Methodology Input Sequences Singleton Groups %Singleton Groups

    TIGR Gene Index 626 163 135 140 21.83

    STACK_PACK 415 833 58 070 13.96

    STACK_PACK analysis of UniGene clusters resulted in a fragmentation rate just over half of the

    TIGR index.

    8.2 UniGeneThe most widely known effort is UniGene (Boguski et al, 1995) from NCBI. UniGene will be

    replaced by RefSeq and LocusLink8 projects as the availability of genomic sequence

    increases.

    The UniGene project originally took the fingerprinting characteristic of 3 UTRs (and hence

    3 ESTs) as a paradigm and indexed genes by clustering 3 end EST sequences with mRNA

    data extracted from GenBank (Benson et al, 1994). 5 end ESTs were added to the clusters

    using clone information. To enhance the speed of the clustering phase of the project a two-

    phase searching process was used for overlap detection. First, two sequences were marked

    for further comparison if they shared two common words of length 13 separated by no more

    than 2 bases. The sequences selected for further comparison were then compared with a

    constrained Smith-Waterman local alignment algorithm. The membership of clusters in

    recent versions of UniGene, however, suggests that additional criteria are being used to

    determine cluster membership (Burke, Davison and Hide, Submitted, Genome Research)

    UniGene clustering proceeds in several stages, with each stage adding less reliable data to theresults of the preceding stage. 'Reliable' refers to pairwise identity>annotation>shared clone

    etc.

    Contaminant screening of vector sequences and repetitive elements and mitochondrial andribosomal sequences is performed using NCBI's DUST. After screening, a sequence must

    contain at least 100 informative bp to be a candidate for entry into UniGene (informative

    length). This initial criterion differs from the 'all in' strategy employed at SANBI where

    alignment is not required for clustering. Sources for UniGene include GenBank Genomic,

    dbESt and GenBank mRNAs. GenBank Genomic sequences are electronically spliced exons

    (vGenes).

    Initial clustering is performed by comparing the set of gene sequences (mRNA or genomic

    sequences, many of which are complete CDSs) with itself. Sequence pairs which are

    sufficiently similar are grouped together to form initial clusters. EST to gene links and EST

    to EST links are added to these clusters. The set of ESTs is compared with the set of genes

    using WHALE (http://nucleus.cshl.org/meetings/98genome_absstat.htm), and sufficiently

    similar sequence pairs are added to the clusters (MEGABLAST Unpublished, figure 4). Any

  • 8/2/2019 EST Tutorial

    13/24

    new links which would join two distinct clusters from the preceding stage (that is, join two

    sets of genes not linked to form one cluster without the addition of ESTs) are discarded.

    Figure 4from Wagner et al. Abstracts 1999 Genome Sequencing and Biology. P340. Cold Spring Harbor

    Laboratory.

    Any resulting cluster which does not contain a sequence with a polyadenylation signal or twolabelled 3' ESTs is discarded. Clusters which pass these criteria are called anchored clusters,

    since their 3' end is presumed to be known.The drawbacks of this method match the benefits of the strict matching technique in that

    chimerism and other artefacts can join unrelated clusters, and furthermore the diversity withina grouping can make the generation of a consensus sequence complex (Schuler, et al, 1996),

    often resulting in accepting only the longest representative of an index class as its consensus.

    8.3 STACK and STACK_PACKAs described in Hide et al (1998), and (in Miller, et al, Genome Research, Submitted)

    STACK_PACK makes use of dbEST and clusters ESTs within tissue source categories.

    STACK_PACK extends the application of the loose clustering approach to a global level,

    defining index classes by the total number of (possibly disconnected) matching 6-base words

    rather than by alignment to previously identified class members. The approach is robust withrespect to EST quality data and increases the capacity to link splice-related ESTs without

    concern for non-overlapping regions. The related but loose clusters are subsequentlyprocessed by strict assembly and analysis tools to identify, characterise and isolate any

    sequence divergence. The clustering algorithm (d2_cluster) is discrete from the assembly tool

    (PHRAP) and identifies ESTs that are greater than 96% identical over a window of 150

    bases. PHRAP subsequently aligns them to provide an assembly for subsequent alignmentconsensus processing. STACK uses all available raw sequence data during cluster generation

    and uses CRAW (Burke, et al, 1998) to generate consensi, to annotate polymorphic regions

    and alternative splicing forms in the clusters. Contigproc then chooses the longest consensus

    of highest quality for output. Clusters are viewed with VIZ to allow initial analytical

    assessment.

    8.4 Other datasystemsThe Genexpress Index (Houlgatte, et al, 1995),and The Merck Gene Index (Williamson, et al,1995), group sequences into clusters based on sequence overlap above a given alignment

    threshold as UniGene does. The index was originally manufactured using Fasta comparisons

  • 8/2/2019 EST Tutorial

    14/24

    with 3' sequence only. No searchable index can be found on the Web for Merck Gene Index.

    8.5 Strategies for keeping data 'current'.The high throughput nature of EST production ensures that the available databases grow

    almost daily, and the value of a consolidated EST database is a direct function of how current

    the represented dataset is. UniGene's use of full-length gene sequences to seed the initial

    clusters and rejection of linking ESTs ensures that new EST data will only add to existing

    clusters, while the stringent clustering approach of TIGR Human Gene Index similarly limitsthe cluster joining effect new sequences may have. Purely loose clustering strategies such as

    employed by STACK are more vulnerable to cluster joining as database revisions are

    released. In either case, production time requirements for the consolidated database haveresulted in the need for a dyamic "ADD" facility whereby new sequences can be added

    without regenerating all clusters. Of note in this regard is that ADDs will perpetuate any

    previous processing errors, while a full production run increases confidence in the final result.

    Every STACK (tissue-based) generation requires the clustering of the entire GenBank EST

    dataset. dbEST is increasing enormously with each GenBank release and this has impacted

    database generation because the input data to the clustering step has exceeded 100 000

    sequences per tissue subset. Crossmatch finds overlapping regions at the ends of

    approximately 30% of D2-clusters. These newly formed clusters are confirmed by closerexamination of the sequence alignments.

    We reduce the input data to D2 cluster using crossmatch. New ESTs are extracted from

    Genbank after each bimonthly release. These ESTs are partitioned into tissue subdivisions.The ESTs are initially compared to existing STACK consensi using crossmatch. The STACK

    consensi that find matching ESTs are collapsed to their individual ESTs and combined with

    their corresponding new ESTs. The ESTs are thus reduced by 20-60% prior to D2 cluster

    processing. The approximately 20-40% ESTs that do not find matching members usingcrossmatch are fed into D2 cluster. The D2 clusters are renamed taking into account the

    STACK-Ids that are present already. These new clusters together with the crossmatch

    expanded clusters are assembled using PHRAP. The stack clusters that were expanded using

    crossmatch are removed from the alignment file for the previous STACK release. This

    reduced file is appended to the new alignments for the crossmatch and D2 clusters. At this

    stage, the alignments are fed into the STACK_PACK system.

    9 Cluster Assembly and ProcessingMultiple sequence alignment is a complex problem which has been tackled by many groups.

    UniGene avoids it entirely by only listing the accession IDs of related sequences in thedatabase. A current standard for assembly of fragments from shotgun cloning of a single

    gene is Phil Green's PHRAP, but even this cannot be expected to cope with the problems of

    quality and sequence divergence observed in loose EST clusters. PHRAP will reject

    sequences it considers to be "unrelated" and output them in separate files, regardless of the

    findings of the clustering engine.

    9.1 Processing alignmentsIn the real world of noisy data, no alignment engine performs as well in every case as the

    human scientist's eye might be able to suggest; for high-volume, high-throughput applications

    this problem can be compensated for by further processing of cluster assemblies, while

    alignment editors such as Genetic Data Environment (GDE/SeqLab from GCG) can beinvaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST

    clusters provide confidence and quality information - even in the absence of original trace

    data - from which high accuracy, extended length consensus sequences can be constructed for

    the vast majority of base positions. Given the realities of loose EST clusters, we find it

    necessary to post-process cluster assemblies with CRAW, a tool which groups relatedsequences within an assembly and generates consensus sequences for each subset.

    How is a consenus decision made? In the light of the significant divergence that can occur

    between alternately spliced forms of a gene, majority consensus rule generation is

    inappropriate. Rather, consensus sub-forms need to be clearly derived and extracted using free

  • 8/2/2019 EST Tutorial

    15/24

    publically available viewer VIZ9 and on licensing, CRAWVIEW (figure 5) .

    From Burke et al 1998:

    Figure 5Alternate Splicing and chimersim provide subsequence alignments that bear valuableinformation. An alternate form of expressed sequence is detected in ovarian tumor library using

    sub-consensus analysis.

    10Clone LinkingAll ESTs generated from the same cDNA clone correspond to a single gene. Each EST

    obtained from GenBank is searched for clone identification so that we can trace the

    transcripts corresponding to the same gene. 87% of ESTs currently have documented clone

    information. We utilise this information to extend the length of the cluster consensi by

    joining clusters that have ESTs that share clone-IDs.

    For a gene that is not yet fully sequenced, achievement of a representative consensus

    sequence from clustered EST data requires the joining the available 5' and 3' read consensi.

    Given that the clone ID information is solely annotation based and may have namespaceoverlaps depending on the data source(s), this step is best handled near the end of the

    processing pipeline so that errors detected in the future can be repaired with a minimum of re-processing. Furthermore, unless a specific 5'-3' pair can be identified as a seed for each gene

    consensus, the procedure is transitive in nature and may lead to extensive clone-linked

    networks whose biological significance remains to be explored. The basic algorithm for clone

    linking used in STACK is:

    form a queue consisting of an initial cluster

    do {

    for each EST with a clone ID, add any cluster containing an EST

    with a matching clone ID to the queue

    } until no new clusters are added

    When a closed set of clone-linked consensi has been identified, they may be ordered 5'-

    unassigned-3' based on a majority rule from the EST annotations in each cluster. To form a

    final consensus sequence in STACK, the non-redundant best (see Cluster Assembly and

    Processing above) cluster consensi are joined by linker segments of 20 Ns. This choice was

    made based on the word size employed by BLAST, so that alignment breaks would be

    preferentially inserted at these linker regions.

    11A working clustering systemWhy is clustering being performed? What is the primary output required? Designing a system

    for clustering has to be tightly linked to the immediately desired outcome. Commonrequirements include:

  • 8/2/2019 EST Tutorial

    16/24

    11.1Expression counts11.2Consensus sequences11.3Alternate expression-form charcterisation11.4SNP detection11.5Identification of genes expressed in the cluster project11.6Identification of genes specifically expressed in a chosen library or tissue.Each of the above require specific data manipulations. Direct clustering will yield expressioncounts, assembly will yield consensi, alignment processing and viewing will yield alternate

    splices and SNPs, and improved consensus length generation and consensus joining will yield

    more correctly identified gene expression forms. In the design of STACK_PACK we have

    tried to address each of these direct outputs.

    12Brief introduction to the STACK_PACK clustering system

    In this section we will familiarise the user with the STACK_PACK structure and describe the

    major components and processes in the light of the implementation issues we have descibed

    above. The principal design we have chosen centres around a hierarchical clustering

    approach. We initially chose this approach for CPU limiting reasons. We have foundsubsequently that hierarchcial clustering allows for flexible scaling with centralised or

    distributed processing, viewing and analysis of consensus clusters at any defined level: from

    original EST library/tissue cluster to whole index cluster view. The approach lends itself well

    to adding of new ESTs at the appropriate level. We have found it particlarly powerful in

    analysis of alternate splice events specific to particular tissue or expression 'bins'. (With the

    advent of index based and tree-based methods, faster initial clustering implementations will

    allow for non-hierarchical aproaches. Without a sophisticated, dynamic 'ADD'

    implementation any approach loses effectivenes).We detail tools specific to the STACK_PACK process. Tools used by the system such as

    PHRAP are described elsewhere.

    12.1 STACK processing12.2Data bin strategy and maskingData splitting: The initial step is to partition the input set into manageable subsets. This is a

    function of available CPU time and machine capacity, currently up to 500,000 sequences may

    be clustered if one has access to SGI/Cray resources. For SANBI's STACK database, weperform initial partitioning on the basis of available bin (usually tissue) annotations for each

    sequence, for a phase I process into bin consensi. Later the resulting consensus sequences canbe clustered and the final assemblies built into indices.

    Masking: Cross-Match, XBLAST, comparison mask-databases. Input sequences are Fasta

    format with headers that contain accession number of the EST and CLONEID.

    Clustering: D2-Cluster (as described above) and Crossmatch.

    Assembly: PHRAP (our assembler of choice) but any assembler that is preferred can be

    successfully employed aslong as the input data is Fasta format. Output data can be parsed

    from PHRAP .ace files and also from .gde files for processing into CRAW which requires

    .gde format.

    Alignment analysis: CRAW processes the alignments for groups that share consensus and

    processes them into quality/difference partitions.

    The CRAW tool allows for selective assessment of alignments. It measures and displaysconsistency. As a significant proportion of alignments resulting from assembled clusters can

    constain non-optimal conseni and 'junk' as well as chimeras, alternate splices and

    polymorphisms, it is important to perform directed analysis of cluster alignments.

    CRAW consistencyA group of more than two sequences is consistent if and only if every sequence in the group is

  • 8/2/2019 EST Tutorial

    17/24

    pairwise-consistent with the consensus sequence derived for the group. If several unrelated

    sequences result in a poor quality alignment, a simple majority consensus generation rule

    might sample the sequences such that a consensus is generated that is consistent with each

    individual sequence in the group even when some sequence pairs or subsets contain a highdegree of mismatching. To prevent this, the consensus sequences are generated by an early

    bias weighted method. We partition a group of sequences into sub-groups such that every

    sub-group is consistent. Note that we must look for a maximal sub-grouping because the

    trivial solution of breaking a group of N sequences into N singleton sub-groups is consistent

    by our definitions. Thus, we find an optimal partition such that as many sequences aspossible are assigned to sub-groups. In our method, a sequence group is input and multiplealignments and consensus sequences are output. The amount of information that can be lost

    to artifact or alternate gene forms is bounded by the fact that no sequence contains a window

    of length W of less than (100*SIM) percent identity with the consensus. Regions of internal

    sequence are fond by looking for regions in the alignment where windows of gapped region

    size (G_R_S) contain over ceiling (GAP_PERCENT*G_R_S) gaps. An analysis in the paper

    describing CRAW used parameters: G_R_S = 15 and GAP_PERCENT = 0.9. (Burke, et al,

    1998)

    12.2.1 Detecting and viewing alternate splicing

    ALIGNMENT CONTAINS INCONSISTENCY:Strong Secondary Consensus Found.One position equals 20 bases.

    X if more than 2 bases ( 10 percent) disagree with consensus sequences.

    N if more than 2 positions are unknown.

    "-" if more than 14 positions are gap characters.

    0 200 400 600 800 1000 1141

    | | | | | | |

    ------------------1112226666666666666666----------------- 6 AA205280 (brain)

    ------------------1112226-------------------------------- 6 AA205467 (brain)

    ------------------1112226666666666666666----------------- 6 cons. for 6

    -111111111111NNNNN44444444------------------------------- 4 N62129

    -1111111111144441---------------------------------------- 4 AI262897

    -1111111111114441N44444444------------------------------- 4 cons. for 4

    ---------------11111111111111313333333------------------- 3 AA476710 (repro)

    -----------------311111111111313333333333333------------- 3 R.C.AA902555(hemat)NCI

    Kid3

    -----------------13111111111131333333333333333----------- 3 R.C.AA128258(repro)

    ----------------------111111131333333333333333----------- 3 R.C.N94727

    -------------------------11113133333333333333333--------- 3 AA129065

    ------------------------------1333333333333333----------- 3 R.C.AA243505

    -------------------------------133333333333333----------- 3 R.C.AA127007(repro)

    ---------------------------------3333333333333----------- 3 R.C.N43925

    -----------------------------------------33333----------- 3 AA442275 (repro)

    ---------------111111111111113133333333333333333--------- 3 cons. for 3

    -1111111111111111111122---------------------------------- 2 AA262073

    -11111111111111111111222--------------------------------- 2 AA831176

    -11111111111111111111222222222--------------------------- 2 AA810752(hemato)NCI_GCB1

    -11111111111111111111222222222--------------------------- 2 cons. for 2

    NN11111-------------------------------------------------- 1 AA199678

    11111111111111111111------------------------------------- 1 AA768208

    -11111111111--------------------------------------------- 1 AA126628

    -1111111111111111111111111111---------------------------- 1 N63563

    -111111111111111111111----------------------------------- 1 AA730260

    -111111111111111111111----------------------------------- 1 AA872419(hemato) tumors

    -11111111111111111111111111111111------------------------ 1 AA128314

    -1111111111111111---------------------------------------- 1 AA732483

    -11111111111111111111------------------------------------ 1 AA993331

    Title:

    stackflow2.pdf

    Creator:

    Preview:

    This EPS picture was not saved

    with a preview included in it.Comment:

    This EPS picture will print to a

    PostScript printer, but not to

    other types of printers.

  • 8/2/2019 EST Tutorial

    18/24

    --111111111111111111111111------------------------------- 1 N34852

    ---11111111111111111------------------------------------- 1 AA322575

    ------------1111111111----------------------------------- 1 AI267543

    111111111111111111111111111111111------------------------ 1 cons. for 1

    -1111111111111111111155---------------------------------- 0 AA743074(hemato)NCI GCB1

    Figure 6 CRAW analysis of a STACK whole body index cluster that shows similarity toa genomic clone AC004106. Tissue origins are shown in brackets and clone origin is

    underlined. Primary consensus(11111) shows significant similarity to clathrin coat

    adaptor complex sigma1B protein. The secondary consensi(2222, 333333, 666666) show

    significant similarity to two exons 3400036000 and 5410054200 (putative alternate

    splice regions in genomic clone AC004106). Three ESTs (AA743074, AA205280 and

    AA128258)(bold) were reported to represent putative alternate splice transcripts on

    genomic clone AC004106 (Bouck et al., 1999). This craw analysis would suggest that

    there are more ESTs in addition to the three reported by Bouck that samples the

    putative alternate transcripts for the clathrin coat complex.

    Consensus processing: CONTIGPROC

    Once consensus sequence(s) have been generated by CRAW, it is necessary to attempt to

    choose the most approriate consensus according to gene isoform.

    CONTIGPROC independently partitions the aligned sequences amongst the CRAW consensi,

    then ranks the consensi according to number of assigned sequences and number of calledbases. Contigproc reads a cluster assembly and the associated consensus sequences generatedby CRAW, then assigns each EST in the assembly to its 'best consensus'- the consensus that

    has the most contributing ESTs. A round of elimination follows, in which consensi

    representing only a single (or even no) ESTs are removed, then the consensi are ranked

    according to the number of ESTs assigned to each (ties are broken by consensus sequence

    length considering only clear ATCG base calls). The remaining consensi are logged with the

    best consensus in the GIO and GDE file formats which support representation of sequence

    alignment data. Finally, the internal cluster representation is output in the supported file

    formats (GIO-NCGR, STACK-FASTA, SANIGENE-FASTA, STACK-GDE, SANIGENE-

    GDE). The 5' or 3' orientation of each cluster is determined by a vote of the individual EST

    annotations, and all output consensi are arranged to read 5' to 3'. Low quality consensus

    regions, defined as two N's followed by at least thirteen IUPAC codes with four or less clear

    A, T, C, or G calls, are replaced by a single run of 10 Ns.

    Clone-linking:

    More than 1 end may exist for each clone ID, and rather than tracing from EST1 -> clone ID -

    > EST2, we simply form groups of ESTs for each clone ID and then check for any linksbetween these sets for extended clone link networks.

    Output system

    GDE format is appropriate for the presentation of cluster assembly data, but does not support

    significant regions of non-overlapping sequences as in the case of the clone-linked data. Such

    features are well supported by GSDB's GIO format, but this is not widely accepted by

    software in the sequence processing field. FASTA format sequence data is ubiquitously

    accepted, but captures only a single sequence in each record. As a result, SANBI supplies

    appropriate data in all three formats, and further distinguishes clusters which are/are notjoined together in the clone linking phase for the GIO and FASTA output sets.

    Data Structures

    12.2.1.1 Cluster assembly for maximum consensus quality

    Further processing of cluster assemblies (see CONTIGPROC/CRAW below), ensures highthroughput consistency of clustering alignment while alignment editors such as GDE can be

    invaluable for improvement "by hand" of individual cases. Nonetheless, assembled EST

    clusters provide confidence and quality information - even in the absence of original trace

    data - from which high accuracy, extended length consensus sequences can be constructed for

  • 8/2/2019 EST Tutorial

    19/24

    the vast majority of base positions.

    Output restriction for generation of higher quality sequence:

    SANIGENE dataset generation strategies

    The exact sanigene membership criteria is:

    At each base position of the consensus sequence

    A = number of EST bases that agree with consensusB = number of EST bases that disagree with consensus

    (A-B) >= 2

    only ESTs assigned to the top CRAW consensus are considered, i.e. hopefully only the "good

    sequences" from a cluster are in this set to begin with. (for any STACK or SANIGENE

    cluster we mainly only work with the top CRAW consensus and the ESTs assigned to it, i.e.CRAW subconsensi and their ESTs are dumped off to the GDE files)

    Generation of high confidence consensi

    By restricting the final output sequence to only those regions represented by at least two

    reads, as in the SANBI SANIGENE dataset, high confidence consensi can be generated with

    a longer length (due to the rejection of clusters with small numbers of ESTs) in the average

    case.

    Rejection strategies.

    To avoid redundancy in the final sets for BLAST searching, only the best consensus from

    CRAW processing is used in the FASTA data sets (see Cluster Assembly and Processing

    above)

    The simple rule of thumb is that ESTs contain regions of very low quality, but the savvy

    bioinformaticist can make up for this deficiency and even improve confidence overall by

    studying the alignment of related sequences. In the real world of noisy data, no alignment

    engine performs as well in every case as the human scientist's eye might be able to suggest;

    for high-volume, high-throughput applications this problem can be compensated for by

    further processing of cluster assemblies (see CONTIGPROC/CRAW below), while alignment

    editors such as GDE can be invaluable for improvement "by hand" of individual cases.

    Nonetheless, assembled EST clusters provide confidence and quality information - even in

    the absence of original trace data - from which high accuracy, extended length consensus

    sequences can be constructed for the vast majority of base positions. By restricting the final

    output sequence to only those regions represented by at least two reads, as in the SANBI

    SANIGENE dataset, high confidence consensi can be generated with a longer length (due to

    the rejection of clusters with small numbers of ESTs) in the average case. The judicious

    definition of linker segments to join regions of high-confidence, multiple read sequence data

    in the final consensus sequence can actually improve results for BLAST searches, by allowing

    the natural separation of High-scoring Segment Pairs (HSPs) in the search algorithm. This

    result is over and above the improvement obtained by eliminating regions of low quality readsand other production errors. Even when utilising an assembly engine prone to mis-alignment

    and error when confronted with the problems commonly observed in EST data, cluster

    assembly provides a net benefit in maximising consensus sequence quality: the cleanest

    regions will be identified and highlighted, while the remainder are isolated in preparation for

    appropriate strategies for cluster alignment processing.

  • 8/2/2019 EST Tutorial

    20/24

    12.2.2 Adding using STACK_PACK

    See above.

    12.2.2.1 Accession number management under dynamic addition

    ESTs are passed against existing singletons and consensi. As ESTs are added to an existing

    clustered database, it is likely that new sequences will join previously separate clusters. The

    STACK strategy is to decompose the existing clusters to their constituent ESTs, then

    reprocess these complete, new clusters from the assembly step onwards. In this process the

    previous cluster IDs are removed from the database and new IDs created, but a problemdevelops in that existing sequence analysis and visualisation tools rely on accession numbers

    to remain unchanged as the sequence databases grow (the original targets, experimentally

    determined sequences, would not be expected to change significantly over time). The only

    method we have identified to cope with this problem is the annotation of "secondaryaccession" fields for each cluster, listing the older cluster IDs each has subsumed. This

    strategy is mirrored in the system used with th IS databank consortium by the use ofaccession

    versions. This is a new field where the first number is the never changing accession number,

    followed by a period and a version number. The version number starts at one, and increases

    by one each time the sequence changes (ie is merged, where the largest cluster keeps its

    primary accession, and the smaller cluster subsumes to a secondary accession.) We remainable to extract clusters based on constituent accessioned ESTs.

    Virtual protein manufacture: Currently, EST_SCAN

    (http://www.ch.embnet.org/software/ESTScan.html), FRAMEFINDER (Guy Slater, personal

    communication) and other tools are being evaluated for efficacy.

    12.3References in order of mention.Vasmatzis, G., M. Essand, U. Brinkmann, B. Lee, and I. Pastan. 1998. Discovery of threegenes specifically expressed in human prostate by expressed sequence tag database analysis.

    Proc. Natl. Acad. Sci. USA 95(1):300-304 .

    Wolfberg, T.G. and D. Landsman. 1997. A comparison of expressed sequence tags (ESTs) to

    human genomic sequences. Nucleic Acids Research 25(8):1626-1632.

    Hillier,L., N. Clark, T. Dubuque, K. Elliston, M. Hawkins, M. Holman, M. Hultman, T.

    Kucaba, M. Le, G. Lennon, M. Marra, J. Parsons, L. Rifkin, T. Rohlfing, M. Soares, F. Tan,

    E. Trevaskis, R. Waterston, A. Williamson, P. Wohldmann, and R. Wilson. 1996. Generation

    and Analysis of 280,000 Human Expressed Sequence Tags. Genome Research 6:807-828.

    Aaronson, J.S., B. Eckman, R.A. Blevins, J.A. Borkowski, J. Myerson, S. Imran, and K.O.

    Elliston. 1996. Toward the Development of a Gene Index to the Human Genome: An

    Assessment of the Nature of High-throughput EST Sequence Data. Genome Research 6:829-

    -845.

    Adams, M.D., J.M. Kelley, J.D. Gocayne, M. Dubnick, M.H. Polymeropoulos, H. Xiao, C.R.Merril, A. Wu, B. Olde, R.F. Moreno, A.R. Kerlavage, W.R. McConbie, and J.C. Venter.

    1991. Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome

    Project. Science 252: 1651-1656.

    Baldo, M.F., G. Lennon, and M.B. Soares. 1996. Normalization and Subtraction: TwoApproaches to FacilitateGene Discovery. Genome Research 6: 791 - 806..

    Alwen,, J. United Kindom human genome mapping project: development, components,

    coordination and management, and international links of the project. 1990. Genomics. 6. 386-

    388.

    Gill RW, Hodgman TC, Littler CB, Oxer MD, Montgomery DS, Taylor S, Sanseau P.Advanced Technology & Informatics Unit, Glaxo-Wellcome Medicines Research Centre,

    Stevenage, Heris, UK. [email protected]

    Torney, D.C., et al. 1990. Computation of d2: A Measure of Sequence Dissimilarity.

  • 8/2/2019 EST Tutorial

    21/24

    Computers and DNA, SFI Studies in the Sciences of Complexity, vol. VII, eds. G. Bell and T.

    Marr, Addison-Wesley.

    Hide, W., J. Burke, and D.Davison. 1994. Biological Evaluation of d^2, an Algorithm for

    high-performance Sequence Comparison. J. Comp. Bio. 1:199-215.

    Wu, T.J., J.P. Burke, and D.B. Davison. 1997. A Measure of DNA Sequence Dissimilarity

    Based on Mahalanobis Distance Between Frequencies of Words. Biometrics 53:1431-1439.

    S. Burkhardt, A.Crauser, P.Ferragina, H-P. Lenhof, E. Rivals, M. Virgon q-gramm BasedDatabase Searching Using a Suffix Array (QUASAR), Proceedings of the 3rd Inrenational

    Conference on Computational Molecular Biology (RECOMB99), Lyon, France, 1999.

    Strelets, V.B., Ptitsyn, A.A., Milanesi, L., Lim, H.A. (1994) Data bank homology search

    algorithm with linear computation complexity. Comp.Appl. Biosci., v.10, n. 3 (1994), pp.

    319-322

    Sutton, G., O. White, M.D.Adams, and A.R. Kerlavage. 1995. TIGR Assembler: A New Tool

    for Assembling Large Shotgun Sequencing Projects. Genome Science and Technology 1:9-

    18.

    Burke, J.P., H. Wang, W. Hide, and D. Davison. 1998. Alternative Gene Form Discoveryand Candidate Gene Selection from Gene Indexing Projects. Genome Research 8: 276-290.

    Boguski, M.S. and G.D. Schuler. 1995. ESTablishing a Human Transcript Map. Nature

    Genetics 10:369-371.

    Benson D.A., M.S. Boguski, D.J. Lipman, and J. Ostell. 1994. GenBank. Nucleic Acids

    Research 22: 3441-3444.

    Hide, W. ,Burke, J., Christoffels, A., Miller, R., 1997. A novel approach towards acomprehensive consensus representation of the expressed human genome. In Genome

    Informatics 1997, pp187-196.Satoru Miyano and Toshihisa Takagi Eds. Universal Academy

    Press Inc. Tokyo, Japan. ISSN 0919-9454

    Houlgatte R., R. Mariage-Samson, S. Duprat, A. Tesslier, S. Bentolila, B. Lamy, C. Auffray.

    1995. The GenExpress Index: A Resource for Gene Discovery and the Genic Map of the

    Human Genome. Genome Research 5: 272-304.

    Williamson et al. 1995. Merck Gene Index

    Green 1996, PHRAP

    Bouck et al 1999 Trends in Genetics 15 (4): 159-162

  • 8/2/2019 EST Tutorial

    22/24

    12.4 AppendixSTACK Schema.

    Unigene WHALE toolhttp://www.ncbi.nlm.nih.gov/Sicotte/whale.html

    Unigene and d2-cluster comparisons: Fragmentation as a result of stricter clustering.

    There is significant improvement in clustering with d2-cluster vs Unigene.

    d2_cluster and UniGene produce results that are between 83% and 90% identical, and

    d2_cluster results are between 8% and 20% less fragmented than UniGene. The above figure

  • 8/2/2019 EST Tutorial

    23/24

    demonstrates a cluster that has been fragmented by Unigene and is kept together by d2-

    cluster.

    D2_cluster measurably decreases the rate of fragmentation: producing around 20% fewer

    singleton sequences (6463 to 5198) and reducing the overall number of clusters by 10%(15225 to 13756). Generally the numbers of smaller clusters are reduced while larger clusters

    appear with slightly larger frequency. (see table: Burke, Davison, Hide, Submitted, Genome

    Research))

    Table of Cluster Sizes

    Cluster size UniGene

    (Rat-build 19)

    d2_cluster

    Singleton clusters 6463 5189

    2 3496 3298

    3-4 3002 2971

    5-8 1635 1602

    9-16 491 531

    17-32 111 127

    33-64 21 27

    65-128 3 8

    129-256 3 2

    257-512 1 0

    Total Clusters 15225 13756

    EST links./

    STACK tools:

    www.sanbi.ac.za/stack

    www.sanbi.ac.za/stack_pack

    * ICA Tools http://sunny.ebi.ac.uk/ebi/EST/

    * basic explanation of clustering

    http://sunny.ebi.ac.uk/~jparsons/work/est_assembly/est_assembly_intro

    .html

    * clustering rat sequences (good overview) http://goliath.ifrc.mcw.edu/EST/contents.html

    * the UniGene database, probably the most well-known EST

  • 8/2/2019 EST Tutorial

    24/24

    database: http://www.ncbi.nlm.nih.gov/UniGene/index.html

    * starting information when creating STACK: http://www.ncbi.nlm.nih.gov/dbEST/index.html

    the TIGR THC database, another view of the consensus world.

    http://www.tigr.org/tdb/hgi/hgi.html

    * other EST links: http://industry.ebi.ac.uk/~muilu/EST/EST_links.html

    * integrate EST information http://www.ncbi.nlm.nih.gov/LocusLink/index.html

    Websites referenced:

    1 http://genome.wustl.edu/est/esthmpg.html

    2 http://hercules.tigem.it/ASSEMBLY/capdoc.html3 http://www.girinst.org/4 ftp://ncbi.nlm.nih.gov/repository/vector/

    567 http://sunny.ebi.ac.uk/EST/

    8 http://www.ncbi.nlm.nih.gov/LocusLink/9 ftp.sanbi.ac.za/stack/software/viz.tar