The Annotation, Mapping, Expression and Network (AMEN) suite of tools for molecular systems biology

  • Published on
    30-Sep-2016

  • View
    215

  • Download
    1

Embed Size (px)

Transcript

  • ral

    ssBioMed CentBMC Bioinformatics

    Open AcceSoftwareThe Annotation, Mapping, Expression and Network (AMEN) suite of tools for molecular systems biologyFrdric Chalmel and Michael Primig*

    Address: Institut National de la Sant et de la Recherche Mdicale (Inserm) Unit 625, Groupe d'Etude de la Reproduction chez l'Homme et les Mammifres, Institut Fdratif de Recherche 140; Universit de Rennes 1, Campus de Beaulieu, F-35042 Rennes, France

    Email: Frdric Chalmel - frederic.chalmel@rennes.inserm.fr; Michael Primig* - michael.primig@rennes.inserm.fr

    * Corresponding author

    AbstractBackground: High-throughput genome biological experiments yield large and multifaceteddatasets that require flexible and user-friendly analysis tools to facilitate their interpretation by lifescientists. Many solutions currently exist, but they are often limited to specific steps in the complexprocess of data management and analysis and some require extensive informatics skills to beinstalled and run efficiently.

    Results: We developed the Annotation, Mapping, Expression and Network (AMEN) software asa stand-alone, unified suite of tools that enables biological and medical researchers with basicbioinformatics training to manage and explore genome annotation, chromosomal mapping, protein-protein interaction, expression profiling and proteomics data. The current version providesmodules for (i) uploading and pre-processing data from microarray expression profilingexperiments, (ii) detecting groups of significantly co-expressed genes, and (iii) searching forenrichment of functional annotations within those groups. Moreover, the user interface is designedto simultaneously visualize several types of data such as protein-protein interaction networks inconjunction with expression profiles and cellular co-localization patterns. We have successfullyapplied the program to interpret expression profiling data from budding yeast, rodents and human.

    Conclusion: AMEN is an innovative solution for molecular systems biological data analysis freelyavailable under the GNU license. The program is available via a website at the Sourceforge portalwhich includes a user guide with concrete examples, links to external databases and helpfulcomments to implement additional functionalities. We emphasize that AMEN will continue to bedeveloped and maintained by our laboratory because it has proven to be extremely useful for ourgenome biological research program.

    BackgroundHigh-throughput DNA sequencing, microarray-basedmRNA expression profiling, proteomics experiments andprotein-protein interaction assays have been yielding

    scale expression profiling using microarrays is among themost popular experimental approaches in genome biol-ogy and therefore optimized methods are available for allkey analytical steps. They include raw data pre-processing,

    Published: 6 February 2008

    BMC Bioinformatics 2008, 9:86 doi:10.1186/1471-2105-9-86

    Received: 12 October 2007Accepted: 6 February 2008

    This article is available from: http://www.biomedcentral.com/1471-2105/9/86

    2008 Chalmel and Primig; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 11(page number not for citation purposes)

    large and complex datasets that need to be integrated withfunctional information at the gene- or genome level. Large

    quality control and normalization [1,2], identification ofdifferentially expressed genes during static or time-course

  • BMC Bioinformatics 2008, 9:86 http://www.biomedcentral.com/1471-2105/9/86

    conditions [3-5], gene clustering [6-8] and searching forsignificant over- or under-representation of functionalannotation in expression clusters [9-11]. The Bioconductorproject provides numerous software packages developedin R that are devoted to high-throughput analysis tasks[12-14]. However, installing and running them requiresextensive programming skills that are not yet common-place among life scientists. To alleviate this problem, pro-grams with a convenient Graphical User Interface (GUI)have been developed that facilitate functional analyses inmost cases limited to annotation by the Gene Ontology(GO) consortium [10] or restricted to a set of genomes[15,16]. Other tools correlate expression with chromo-somal localization [17-21], protein-protein interaction[22] or pathway data [23].

    In an attempt to combine analysis steps many web-basedapplications have been developed [24-30]. They are freeand do not require maintenance work. However, theiraccessibility and speed depend upon web-traffic, serveravailability and the specifications of the analyses proce-dure. Moreover, web-based systems usually provide pre-configured and inflexible approaches to data analysis andoften do not include advanced options to combine differ-ent types of high-throughput data. In order to addressthese issues and to allow for integrated exploration of dif-ferent types of data we have developed the Annotation,Mapping, Expression, and Network (AMEN) program thatenables users to explore and analyse multifaceted high-throughput biological data. It includes a suite of tools andalgorithms for which parameters can be fine-tuned andanalysis steps ordered as required. AMEN covers arraydata management, analysis and interpretation in a man-ner similar to EXPANDER 2.0 [29]. However, our softwareincludes more options to combine different types of dataand enables users to import, not only genome annotationand transcriptome-, but also proteome and interactomedata without species restrictions.

    ImplementationThe AMEN software architecture consists of four layersimplemented in Tcl/Tk (Figure 1) [31]. The first layer pro-vides modules for uploading, formatting and pre-process-ing expression, annotation, chromosomal mapping andprotein-protein interaction data. The second layer is theuser-friendly GUI of the main application window thatemploys popup menus for Project, Upload data, Tools, Viewsand Options functionalities (Figure 2). Six panels provideaccess to lists of items such as probe IDs, genes, proteins(Group), and data from RNA/protein profiling experi-ments (Expression), genome annotation (Annotation),protein-protein and protein-DNA interaction (Interac-tion), chromosomal localization (Mapping) as well as the

    (Up/Down), mark items (Select all/Deselect), change thecontent of a panel (Add/Remove), change item names(Modify) or change the file content (Edit). Selected itemsin each panel highlighted in yellow are combined into dif-ferent workflows by the user. For example, selectingmouse genes (Annotation panel) showing differentialexpression in testis (DET) and peak signals in somatic Ser-toli cells (SO) compared to mitotic (MI), meiotic (ME)and post-meiotic (PM) germ cells (Group panel) andSpermatogenesis data (Expression panel) (see Figure 2)yields a graphical display of RNA profiling signals gener-ated by the module controlled via the Views > Expressiondata > Profiles menu (Figure 3). Selecting protein networkdata (Interaction panel) enables users to display interac-tion patterns (see Figure 4 in reference [32]). Selectingchromosomal localization (Mapping panel) and statisti-cal items enables users to correlate expression and map-ping information or to reveal a link betweentranscriptional patterns and roles in biological processes(see reference [33] for more details). In the background,the third layer automatically creates and runs scripts forthe statistical computing environment R which executestatistical calculations and clustering methods imple-mented in Bioconductor packages. The fourth layer dis-plays the output based on Tk scripts and the graphrendering software GraphViz.

    Our program requires Tcl/Tk, R and GraphViz programsavailable for all frequently used operating systems such asMS Windows, UNIX, Linux and Mac OS X to be pre-installed. Detailed downloading and installation instruc-tions are accessible via the Sourceforge website. We alsoprovide a Windows version of the software that pre-installs Tcl/Tk, R and GraphViz embedded into it. Notethat AMEN supersedes goCluster, a much simpler toolpreviously developed in our laboratory [34]. We havedecided to discontinue goCluster because it lacked keyanalysis features and the cost for further development andmaintenance outweighed the benefit for our lab and thecommunity. We emphasize that AMEN is frequentlyupdated because this software is a key tool for our ongo-ing biomedical studies. Its structure greatly facilitates theimplementation of new modules. Indeed, a single Tclcode line is sufficient to include additional functionalitiesinto the main GUI.

    Results and DiscussionData uploadingThe typical workflow involves five types of modules avail-able in the current release: data uploading and pre-processing, statistical filtration, clustering, functionalmining, and visualization. Data are imported and com-bined within an analysis project using the main applica-Page 2 of 11(page number not for citation purposes)

    output of the statistical module (Statistic). Function but-tons below each panel enable users to scroll item lists

    tion window (Figure 2). It includes six panelscorresponding to different input data: items (such as

  • BMC Bioinformatics 2008, 9:86 http://www.biomedcentral.com/1471-2105/9/86

    genes, transcripts, proteins or probe identifiers), expres-sion signal, functional annotation, (protein-protein)interaction, chromosomal location and statistical data.During this process items (such as probe set IDs) are auto-matically associated with other data types (including genesymbols, chromosomal position and GO terms). Thisinterface makes it easy to access the data and to design anoptimal analysis procedure. Data are input as text files intab-delimited compatible format to ensure compatibilitywith all operating systems.

    Group dataUsers can upload pre-selected lists of items (called "main

    clustering, and/or visualization modules (see Figure 3 and4).

    Expression dataCurrently, expression data quality control and normaliza-tion modules are implemented for commercial Affymetrixhigh density oligonucleotide microarrays (GeneChips)and Illumina Gene Expression BeadArrays. Methods forbackground correction, variance stabilization and nor-malization methods are MAS5.0, RMA, GCRMA and RSN[35]. It is also possible to upload pre-normalized expres-sion datasets as long as they are represented as tab-delim-ited matrices whose rows and columns contain main

    The AMEN architectureFigure 1The AMEN architecture. A flow-chart diagram of the software and work-flow is shown.Page 3 of 11(page number not for citation purposes)

    entries", such as probe, transcript, gene, or protein identi-fiers) or they can obtain such lists via statistical filtration,

    items (usually probe identifiers) and experimental condi-tion names, respectively.

  • BMC Bioinformatics 2008, 9:86 http://www.biomedcentral.com/1471-2105/9/86

    Annotation, Interaction and Mapping dataFunctional information for transcriptome (Affymetrix andIllumina CSV annotation files), proteome (InternationalProtein Index, EBI and NCBI whole-proteome data files),interactome (Proteomics Standards Initiative-MolecularInteractions 2.5 files used by IntAct, MINT, and BioGRID)and chromosomal mapping analysis (PSL chromosomallocation files from the UCSC web site) is imported andconverted into the appropriate file format using a straight-forward procedure [36-43].

    Statistical filtrationThese modules output lists of significantly differentiallyexpressed items (represented as transcripts or probe iden-tifiers) identified within a given set of samples. Users canselect transcripts showing strong variations across experi-mental conditions via threshold parameters includingexpression level cut-off, standard deviation or fold-change. Furthermore, a similarity search module helpsretrieve groups of co-expressed transcripts using a specificuser-defined pattern. Once a set of target transcripts isidentified, the permutation (randomization), moderatedt-test (empirical Bayes approach) and non-parametricrank-based statistic methods are employed to determine ifchanges in signal intensity are reproducible and signifi-cant. These methods are implemented in multtest, samr,

    ues according to the Hommel (control of the family wiseerror rate) or Benjamini/HochBerg [determination of theFalse Discovery Rate (FDR)] multiple testing correctionmethods [47,48].

    ClusteringClustering methods are used to classify items based ontheir overall degree of similarity across the experimentalconditions. These algorithms are notably critical for theidentification of genes that are co-expressed (showingsimilar patterns of transcription), co-regulated (sharingcommon promoter elements) or that play roles in a par-ticular biological process. Users can choose between threehierarchical clustering modules: HCLUST (hierarchical),AGNES (AGglomerative NESting) and DIANA (DIvisiveANAlysis) [8,49]. Four supervised partitioning methodsinclude k-means [50], PAM (Partitioning AroundMedoids), FANNY (Fuzzy Analysis Clustering) andCLARA (Clustering LARge Applications) [49]. We alsoincluded two unsupervised clustering modules calledMCLUST (Model-based CLUstering) and HOPACH (Hier-archical Ordered Partitioning and Collapsing Hybrid)that automatically determine the number of clusters in agiven dataset [51,52]. Finally, to estimate the quality ofthe classification or to help identify the optimal numberof clusters that yields the best separation of different

    The Main Application WindowFigure 2The Main Application Window. A screen shot of the main application window is given. A possible analysis strategy for mammalian testicular expression data is shown in the six data type panels as indicated. Four groups (clusters) of genes are defined as differentially expressed in testis and somatic (DET-SO), mitotic (DET-MI), meiotic (DET-MI) and post-meiotic (DET-PM) depending on peak expression in Sertoli cells, spermatogonia, spermatocytes and spermatids, respectively. The expression dataset (Spermatogenesis) was obtained with a GeneChip covering approximately 25000 protein-coding mouse genes for which an appropriate annotation file is selected (Mouse430_2.na22). To visualize the interaction network of proteins falling into two selected clusters (DET-MI and ME) information from three sources is combined (IntAct_MINT_BioGRID). To display the chromosomal localization of selected genes falling into given expression clusters files with gene coordinates are available with and without cytological bands (affyMOE430, affyMOE430_WithCyto). Users can choose from statistical analysis of GO term enrichment in clusters (AnnotationEnrich) or gene enrichment on chromosomes (MappingEnrich).Page 4 of 11(page number not for citation purposes)

    limma and RankProducts R packages, respectively [44-46].False positives are take...

Recommended

View more >