44
How to use Bio2RDF with a Virtuoso server. ISMB2008 Technology track François Belleau, Marc-Alexande Nolin, Philippe Rigault Centre de Recherche du CHUL, Université Laval Département d'informatique et de génie logiciel, Université Laval

Bio2RDF-ISMB2008

Embed Size (px)

Citation preview

How to use Bio2RDF with a Virtuoso server.

ISMB2008 Technology track

François Belleau, Marc-Alexande Nolin, Philippe Rigault

● Centre de Recherche du CHUL, Université Laval● Département d'informatique et de génie logiciel, Université Laval

ISMB2005

Toronto, July 21, ISMB2008 CHUL research center - Laval University 3

Outline Introduction

− The data integration problem in bioinformatics− What is Bio2RDF Atlas− Semantic web 101− Bio2RDF approach

How to − install Virtuoso and Bio2RDF− load N3 databases into the graph− to query the graph with SPARQL and text search

Future works

Toronto, July 21, ISMB2008 CHUL research center - Laval University 4

How to solve the data integration problem in

bioinformatics ?

Toronto, July 21, ISMB2008 CHUL research center - Laval University 5

Vaugondy, Louis XV geograph, view of the world in the 18th century

Toronto, July 21, ISMB2008 CHUL research center - Laval University 6

The 2005 data integration problem in bioinformatics

Toronto, July 21, ISMB2008 CHUL research center - Laval University 7

Solution :Use the Semantic Web model

for data integration.

Toronto, July 21, ISMB2008 CHUL research center - Laval University 8

Linked data map around DBpedia RDF version of Wikipedia

http://wiki.dbpedia.org/Interlinking

Joanne Luciano's vision @ ISMB2005

Toronto, July 21, ISMB2008 CHUL research center - Laval University 10

Bio2RDF's 30 data sources loadedinto the Atlas graph about Human and mouse

Toronto, July 21, ISMB2008 CHUL research center - Laval University 11

We Drag and Drop databases converted to RDF into a triplestore

Toronto, July 21, ISMB2008 CHUL research center - Laval University 12

2008 Bio2RDF linked data map of Knowledge

Toronto, July 21, ISMB2008 CHUL research center - Laval University 13

Bio2RDF Semantic Web Atlas in numbers

● 30 different datasources, 30 different namespaces− go, geneid, uniprot, pubmed, pdb, reactome, omim, etc.

● 195 namespaces referencing non-rdfized datasource− cog, genethon, tigr, cath, goa, etc.

● 8 millions topics● 65 millions triples● 973 Mo, size of N3 format compressed data

− http://bio2rdf.org/download/bio2rdf-atlas-080414.n3.gz

Toronto, July 21, ISMB2008 CHUL research center - Laval University 14

All this knowledge with a 32 Go fileand the Virtuoso server ...

Toronto, July 21, ISMB2008 CHUL research center - Laval University 15

Our approach● Adopt W3C standards

− HTTP, RDF, OWL, SPARQL● Apply the semantic web model to data

integration in bioinformatics;− Follow Tim's 4 rules about linked data.

● Use existing software − PiggyBank, Sesame, Virtuoso

Toronto, July 21, ISMB2008 CHUL research center - Laval University 16

Semantic Web 101

Toronto, July 21, ISMB2008 CHUL research center - Laval University 17

W3C standard : RDFEverything is express in triple.

<subject> <predicate> <object>

Toronto, July 21, ISMB2008 CHUL research center - Laval University 18

W3C standard : SPARQLA sort of SQL for RDF

SELECT ?type, count(*)WHERE {

?s1 ?p1 ?o1 . ?o1 bif:contains "paget" .?s2 ?p2 ?s1 . ?s2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . }

ORDER BY DESC (count(*))

Toronto, July 21, ISMB2008 CHUL research center - Laval University 19

Tim Berners-Lee 4 linked data rules● Rule #1: Use URIs as names for things.

− Hexokinase is GO:0004396● Rule #2 : Use HTTP URIs so that people can look up those

names.− Derefencable URL http://bio2rdf.org/go:0004396

● Rule #3 When someone looks up a URI, provide useful information.

− http://bio2rdf.org/go:0004396 returns the RDF graph describing this topic

● Rule #4 :Include links to other URIs so that they can discover more things.

− go:0004396 refers to pubmed:6341351

Toronto, July 21, ISMB2008 CHUL research center - Laval University 20

How it works ?

Toronto, July 21, ISMB2008 CHUL research center - Laval University 21

The semantic mashup effect

OR = 0M eSH

OR = 1GeneID

OR = 0,5PubM ed

mean OR = 0,5

Toronto, July 21, ISMB2008 CHUL research center - Laval University 22

The semantic mashup effect

OR = 0M eSH

OR = 1GeneID

OR = 0,5PubM ed

mean OR = 0,16

Triplestore

Toronto, July 21, ISMB2008 CHUL research center - Laval University 23

How to install Bio2RDF/Virtuoso

Toronto, July 21, ISMB2008 CHUL research center - Laval University 24

How to install Bio2RDF in Tomcat● You need the Java JDK 6 downloaded from :

− http://java.sun.com/javase/downloads/index.jsp● Download Tomcat JSP server version 6 from :

− http://tomcat.apache.org/download-60.cgi● Install Tomcat server

− Unzip the archive− In TOMCAT_ROOT/conf/server.xml configuration file change

<Connector port="8080" for <Connector port="80"− Replace TOMCAT_ROOT/conf/tomcat-user.xml content of with :

<?xml version='1.0' encoding='utf-8'?> <tomcat-users> <role rolename="manager"/> <role rolename="admin"/> <user username="admin" password="" roles="admin,manager"/></tomcat-users>

Toronto, July 21, ISMB2008 CHUL research center - Laval University 25

How to install Bio2RDF in Tomcat● Define bio2rdf.org domain name locally by adding to

C:\WINDOWS\system32\drivers\etc\host :

127.0.0.1 localhost bio2rdf.org● In command mode, change directory to TOMCAT_ROOT/bin and

executre startup.bat, Tomcat server you start in a new window.● Now http://bio2rdf.org should respond.● With http://bio2rdf.org/manager/html connect using admin and an

empty password.● Undeploy path /● Deploy Bio2RDF application from the

http://bio2rdf.org/download/ROOT.war● Bio2RDF application should be installed and now http://bio2rdf.org

returns home page.

Toronto, July 21, ISMB2008 CHUL research center - Laval University 26

How to install Virtuoso Download the software from SourceForge

− http://sourceforge.net/projects/virtuoso Install it and start the server

− Unzip the archive− Change directory to virtuoso-opensource/database− Run ..\bin\virtuoso-t -f in command mode− http://localhost:8890/sparql should responds;− Connect to to Conductor using dba/dba at

http://localhost:8890

Toronto, July 21, ISMB2008 CHUL research center - Laval University 27

How to install Virtuoso Go to the virtuoso\bin folder and execute isql

− C:\Bio2RDF\Virtuoso\bin>isqlOpenLink Interactive SQL (Virtuoso), version 0.9849b.Type HELP; for help and EXIT; to exit.SQL>

Add the needed index to the triplestore − CREATE BITMAP index RDF_QUAD_POGS on

DB.DBA.RDF_QUAD (P,O,G,S);CREATE BITMAP index RDF_QUAD_PSOG on DB.DBA.RDF_QUAD (P,S,O,G);CREATE BITMAP index RDF_QUAD_SOPG on DB.DBA.RDF_QUAD (S,O,P,G);

Create the full text index and update it− DB.DBA.RDF_OBJ_FT_RULE_ADD ('', null, '');

DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();

Toronto, July 21, ISMB2008 CHUL research center - Laval University 28

How to load data from N3 files

http://bio2rdf.org/download

Toronto, July 21, ISMB2008 CHUL research center - Laval University 29

How to load data from N3 filesuse n32triplestore.pl PERL script

http://bio2rdf.org/download/n32triplestore.pl

Toronto, July 21, ISMB2008 CHUL research center - Laval University 30

Bio2RDF demo using Virtuoso SPARQL server

Toronto, July 21, ISMB2008 CHUL research center - Laval University 31

What is known about « Paget disease» ?

Toronto, July 21, ISMB2008 CHUL research center - Laval University 32

Popular web search engines without semantic

Toronto, July 21, ISMB2008 CHUL research center - Laval University 33

Integrated bioinformatics search engine

Toronto, July 21, ISMB2008 CHUL research center - Laval University 34

'Paget'->topic->rdf:type

SELECT ?type, count(*)WHERE {

?s1 ?p1 ?o1 . ?o1 bif:contains "paget" .?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . }

ORDER BY DESC (count(*))

Toronto, July 21, ISMB2008 CHUL research center - Laval University 35

Submit the SPARQL query to Virtuoso server at http://bio2rdf.org/sparql

Toronto, July 21, ISMB2008 CHUL research center - Laval University 36

Results : Number of topics by type (database) about Paget disease

Toronto, July 21, ISMB2008 CHUL research center - Laval University 37

View results with Piggy Bank for a facet browsing experience

Toronto, July 21, ISMB2008 CHUL research center - Laval University 38

'Paget'->topic1<-topic2->GO->rdfs:label

SELECT ?l3, count(*)WHERE {

?s1 ?p1 ?o1 . ?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .

?o1 bif:contains "paget" .?s2 ?p2 ?s1 .?s2 <http://bio2rdf.org/bio2rdf#xGO> ?s3 .?s3 <http://www.w3.org/2000/01/rdf-schema#label> ?l3 .}

ORDER BY DESC (count(*))

Toronto, July 21, ISMB2008 CHUL research center - Laval University 39

Results : Number of GO terms by topic related to Paget disease

Toronto, July 21, ISMB2008 CHUL research center - Laval University 40

Bio2RDF four services● Describe a topic

− http://bio2rdf.org/go:0004396● List of reverse links to a topic

− http://bio2rdf.org/links/go:0004396● Full text search

− http://bio2rdf.org/search/hexokinase● SPARQL endpoint to query

− http://bio2rdf.org/sparql

Toronto, July 21, ISMB2008 CHUL research center - Laval University 41

Future works● Build a community data provider in

bioinformatics adopting the Banff Manifesto rules;

● Build next version of Bio2RDF services using a distributed network of Virtuoso servers;

● Participate to the construction of the scientific linked data web to answer real science question using SPARQL;

● Promote the adoption of semantic web in our scientific community;

Toronto, July 21, ISMB2008 CHUL research center - Laval University 42

In short...● Bio2RDF project has rdfized many public data

sources and normalized URI;● This model enables scalability of complexity;● You can drag and drop knowlege into Virtuoso

triplestore;● RDF is usefull if you can query it with SPARQL;● If you do SQL you can do SPARQL;● We offer 4 new public services, try them;● Download it and try it home, install your own

Bio2RDF/Virtuoso server;

Toronto, July 21, ISMB2008 CHUL research center - Laval University 43

Acknowlegments

Bioinformatics lab’s team at CHUL Research Center and the Bio2RDF community on Google.

Peter Ansell for his Java coding effort

Kingsley Idehen from OpenLink Software

Thanks to the essential annotators and data providerand to developers of open source project :

Sesame, Virtuoso and PiggyBank.

Toronto, July 21, ISMB2008 CHUL research center - Laval University 44

http://bio2rdf.orgQuery the graph with SPARQL

http://bio2rdf.org/sparql

Download our software http://sourceforge.net/projects/bio2rdf/

Download the Atlas data in N3 format http://bio2rdf.org/download

Join our group http://groups.google.ca/group/bio2rdf

Contact us at [email protected]