Kristen 1scsasc

Embed Size (px)

Citation preview

  • 8/12/2019 Kristen 1scsasc

    1/35

    Bioinformatics Databases:Fundamental Concepts of

    Database Technology & Data

    Organization

    Kristen Anton

    Director of BioInformatics

    Dartmouth Medical School

    BioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    2/35

    BioInformatics @ Dartmouth Medical School

    How can data be organized? Paper (i.e. in notebooks) Flat files

    Collection of data records Minimal structure, no metadata Application program must contain relationship

    information

    Database HierarchicalNetwork Relational

  • 8/12/2019 Kristen 1scsasc

    3/35

    BioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    4/35

  • 8/12/2019 Kristen 1scsasc

    5/35

    BioInformatics @ Dartmouth Medical School

    What is a relational database?A database composed of relations and conforming

    to a set of principles governing how such relations

    are supposed to behave (Codds 12 Rules).

    There are many database systems that use tables

    but dont conform to all of the principles.

    These are often called semirelational systems.

    from Understanding SQL, Martin Gruber

  • 8/12/2019 Kristen 1scsasc

    6/35

    BioInformatics @ Dartmouth Medical School

    Practically speaking...

    A database is a body of information stored in twodimensions (rows and columns)

    Rows are records Columns are attributes of those record entities

    (usually!)

    The groups of rows and columns, or tables, arelargely independent of each other

    The power of the database lies in the relationshipsthat you construct among the tables

    A database is self-describing: it contains metadata,which is a description of its own structure

  • 8/12/2019 Kristen 1scsasc

    7/35

    A set of programs which define, administer andprocess databases and their associated applications

    A scalable DBMS can run on multiple platforms(varying sizes)

    A DBMS that supports interoperability usesindustry-standard language and standard ways ofexchanging data

    What is a Database Management

    System (DBMS)?

    Examples: Oracle, Sybase, 4D, MS Access

    BioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    8/35

    Features of a Relational Database

    Rows (records) are in no particular order Columns (fields) are ordered, numbered and

    named; names should indicate content of thefield

    Primary key uniquely identifies each row -ensures that no row is empty, and that every

    row is different from every other row

    Two-step commit processBioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    9/35

  • 8/12/2019 Kristen 1scsasc

    10/35

  • 8/12/2019 Kristen 1scsasc

    11/35

  • 8/12/2019 Kristen 1scsasc

    12/35

    The tool for communicating with

    relational databases: SQL Standard Query Language (SQL) A query is a question you ask the database,

    and SQL retrieves the appropriate answer

    set

    Interactive SQL (command line) vs. RADtool/GUI

    Standardization issue: ANSI (AmericanNational Standards Institute)

    BioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    13/35

    Data Types

    Types of data indicate functions that arepossible between related fields

    Each field is assigned one data type(imposes structure on data)

    Examples: text (CHAR, VARCHAR),number (INT, DEC); date, time, money

    binary Standardization issue: ANSI (American

    National Standards Institute)BioInformatics @ Dartmouth Medical School

  • 8/12/2019 Kristen 1scsasc

    14/35

    Designing a database is not trivial The value is not in the data, but in the

    structure

    Design to facilitate the retrieval andinterpretation of the data

    BioInformatics @ Dartmouth Medical School

    A word about database design:

  • 8/12/2019 Kristen 1scsasc

    15/35

  • 8/12/2019 Kristen 1scsasc

    16/35

  • 8/12/2019 Kristen 1scsasc

    17/35

    BioInformatics @ Dartmouth Medical School

    Design database for data

    extraction: think it through

  • 8/12/2019 Kristen 1scsasc

    18/35

    BioInformatics @ Dartmouth Medical School

    Design database for data

    extraction: think it through

  • 8/12/2019 Kristen 1scsasc

    19/35

    Reusable core modules, withcustomizable components

    Standard business logic frameworkcontrols transactions (middle layer)

    Metadata-based back-end data storage(facilitates data sharing)

    BioInformatics @ Dartmouth Medical School

    Example: BioInformatics Core

    Technology

  • 8/12/2019 Kristen 1scsasc

    20/35

    BioInformatics @ Dartmouth Medical School

    BioInformatics Core Technology

  • 8/12/2019 Kristen 1scsasc

    21/35

    Data Security: High Priority

    BioInformatics @ Dartmouth Medical School

    HIPAA,

    FIPS 140-2(VA), IRB

    requirements

  • 8/12/2019 Kristen 1scsasc

    22/35

    Life science has become a fieldwhich generates an enormous

    amount of un-integrated data.

    BioInformatics @ Dartmouth Medical School

    How can methods for data

    organization help to solve this

    problem?

  • 8/12/2019 Kristen 1scsasc

    23/35

    BioInformatics @ Dartmouth Medical School

    What is Data Integration?

    Creating a system which allows theextraction of a piece or set of information(query result) across multiple domains

    (possibly disparate data sources - flat files,

    databases, spreadsheets, URLs...)

  • 8/12/2019 Kristen 1scsasc

    24/35

  • 8/12/2019 Kristen 1scsasc

    25/35

    BioInformatics @ Dartmouth Medical School

    Understanding transcription

    factors for protein x productionShow me all genes in the public literature that are putatively

    related to protein x, have more than 4-fold expression

    differential between affected and normal tissue and are

    homologous to known transcription factors.

    Q1: Find homologsQ2: Find genes with

    4-fold differential

    Q3: Show me genes

    in public literature

    SEQUENCE EXPRESSION LITERATURE

    (Q1!Q2!Q3)

  • 8/12/2019 Kristen 1scsasc

    26/35

  • 8/12/2019 Kristen 1scsasc

    27/35

    BioInformatics @ Dartmouth Medical School

    Approaches to Integration

    where are the key issues addressed? Federated database (poses constraints on original

    data sources; fragility in reliance on source

    systems)

    Data warehousing (ETL layer, original datasources untouched, required understanding of

    domain, sophisticated update/archive processes)

    Integrating data source profiles Indexed Flat Files Others.

  • 8/12/2019 Kristen 1scsasc

    28/35

    BioInformatics @ Dartmouth Medical School

    Data Warehousing

  • 8/12/2019 Kristen 1scsasc

    29/35

  • 8/12/2019 Kristen 1scsasc

    30/35

    BioInformatics @ Dartmouth Medical School

    Data value: 55

    Metadata values:

    Data element name: vehicle speed

    Describes data types, relationships,histories, etc.

    Back-end (supports developers), front-end(supports users and application)

    Metadataone key to success

  • 8/12/2019 Kristen 1scsasc

    31/35

    BioInformatics @ Dartmouth Medical School

    Data value: 55

    Metadata values:

    Data element name: vehicle speed

    Unit: miles per hour

    Describes data types, relationships,histories, etc.

    Back-end (supports developers), front-end(supports users and application)

    Metadataone key to success

  • 8/12/2019 Kristen 1scsasc

    32/35

    BioInformatics @ Dartmouth Medical School

    Data value: 55

    Metadata values:

    Data element name: vehicle speed

    Unit: miles per hour

    Description: the average velocity of a

    vehicle

    Describes data types, relationships,histories, etc.

    Back-end (supports developers), front-end(supports users and application)

    Metadataone key to success

  • 8/12/2019 Kristen 1scsasc

    33/35

    BioInformatics @ Dartmouth Medical School

    Standards

    the final frontier

    Naming conventions Standard coordinate systems Unify interpretations of single object types Unify software solutions to the same

    problem (also data formats)

    Standards for metadata (incompatible ormissing metadata)

  • 8/12/2019 Kristen 1scsasc

    34/35

  • 8/12/2019 Kristen 1scsasc

    35/35

    New approach to integration:

    Cancer Biomarker Discovery Network of distributed data silos (does not

    perturb data sources)

    Centralized query and business logic servers,accessed through web interface

    CORBA framework manages XML profiledefinitions across the web

    A profile is a set of resource definitionsimplemented in XML for data sources residing inone or more distributed systems

    BioInformatics @ Dartmouth Medical School