Upload
francois-belleau
View
3.146
Download
1
Embed Size (px)
Citation preview
How to use Bio2RDF with a Virtuoso server.
ISMB2008 Technology track
François Belleau, Marc-Alexande Nolin, Philippe Rigault
● Centre de Recherche du CHUL, Université Laval● Département d'informatique et de génie logiciel, Université Laval
Toronto, July 21, ISMB2008 CHUL research center - Laval University 3
Outline Introduction
− The data integration problem in bioinformatics− What is Bio2RDF Atlas− Semantic web 101− Bio2RDF approach
How to − install Virtuoso and Bio2RDF− load N3 databases into the graph− to query the graph with SPARQL and text search
Future works
Toronto, July 21, ISMB2008 CHUL research center - Laval University 4
How to solve the data integration problem in
bioinformatics ?
Toronto, July 21, ISMB2008 CHUL research center - Laval University 5
Vaugondy, Louis XV geograph, view of the world in the 18th century
Toronto, July 21, ISMB2008 CHUL research center - Laval University 6
The 2005 data integration problem in bioinformatics
Toronto, July 21, ISMB2008 CHUL research center - Laval University 7
Solution :Use the Semantic Web model
for data integration.
Toronto, July 21, ISMB2008 CHUL research center - Laval University 8
Linked data map around DBpedia RDF version of Wikipedia
http://wiki.dbpedia.org/Interlinking
Toronto, July 21, ISMB2008 CHUL research center - Laval University 10
Bio2RDF's 30 data sources loadedinto the Atlas graph about Human and mouse
Toronto, July 21, ISMB2008 CHUL research center - Laval University 11
We Drag and Drop databases converted to RDF into a triplestore
Toronto, July 21, ISMB2008 CHUL research center - Laval University 12
2008 Bio2RDF linked data map of Knowledge
Toronto, July 21, ISMB2008 CHUL research center - Laval University 13
Bio2RDF Semantic Web Atlas in numbers
● 30 different datasources, 30 different namespaces− go, geneid, uniprot, pubmed, pdb, reactome, omim, etc.
● 195 namespaces referencing non-rdfized datasource− cog, genethon, tigr, cath, goa, etc.
● 8 millions topics● 65 millions triples● 973 Mo, size of N3 format compressed data
− http://bio2rdf.org/download/bio2rdf-atlas-080414.n3.gz
Toronto, July 21, ISMB2008 CHUL research center - Laval University 14
All this knowledge with a 32 Go fileand the Virtuoso server ...
Toronto, July 21, ISMB2008 CHUL research center - Laval University 15
Our approach● Adopt W3C standards
− HTTP, RDF, OWL, SPARQL● Apply the semantic web model to data
integration in bioinformatics;− Follow Tim's 4 rules about linked data.
● Use existing software − PiggyBank, Sesame, Virtuoso
Toronto, July 21, ISMB2008 CHUL research center - Laval University 17
W3C standard : RDFEverything is express in triple.
<subject> <predicate> <object>
Toronto, July 21, ISMB2008 CHUL research center - Laval University 18
W3C standard : SPARQLA sort of SQL for RDF
SELECT ?type, count(*)WHERE {
?s1 ?p1 ?o1 . ?o1 bif:contains "paget" .?s2 ?p2 ?s1 . ?s2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . }
ORDER BY DESC (count(*))
Toronto, July 21, ISMB2008 CHUL research center - Laval University 19
Tim Berners-Lee 4 linked data rules● Rule #1: Use URIs as names for things.
− Hexokinase is GO:0004396● Rule #2 : Use HTTP URIs so that people can look up those
names.− Derefencable URL http://bio2rdf.org/go:0004396
● Rule #3 When someone looks up a URI, provide useful information.
− http://bio2rdf.org/go:0004396 returns the RDF graph describing this topic
● Rule #4 :Include links to other URIs so that they can discover more things.
− go:0004396 refers to pubmed:6341351
Toronto, July 21, ISMB2008 CHUL research center - Laval University 21
The semantic mashup effect
OR = 0M eSH
OR = 1GeneID
OR = 0,5PubM ed
mean OR = 0,5
Toronto, July 21, ISMB2008 CHUL research center - Laval University 22
The semantic mashup effect
OR = 0M eSH
OR = 1GeneID
OR = 0,5PubM ed
mean OR = 0,16
Triplestore
Toronto, July 21, ISMB2008 CHUL research center - Laval University 23
How to install Bio2RDF/Virtuoso
Toronto, July 21, ISMB2008 CHUL research center - Laval University 24
How to install Bio2RDF in Tomcat● You need the Java JDK 6 downloaded from :
− http://java.sun.com/javase/downloads/index.jsp● Download Tomcat JSP server version 6 from :
− http://tomcat.apache.org/download-60.cgi● Install Tomcat server
− Unzip the archive− In TOMCAT_ROOT/conf/server.xml configuration file change
<Connector port="8080" for <Connector port="80"− Replace TOMCAT_ROOT/conf/tomcat-user.xml content of with :
<?xml version='1.0' encoding='utf-8'?> <tomcat-users> <role rolename="manager"/> <role rolename="admin"/> <user username="admin" password="" roles="admin,manager"/></tomcat-users>
Toronto, July 21, ISMB2008 CHUL research center - Laval University 25
How to install Bio2RDF in Tomcat● Define bio2rdf.org domain name locally by adding to
C:\WINDOWS\system32\drivers\etc\host :
127.0.0.1 localhost bio2rdf.org● In command mode, change directory to TOMCAT_ROOT/bin and
executre startup.bat, Tomcat server you start in a new window.● Now http://bio2rdf.org should respond.● With http://bio2rdf.org/manager/html connect using admin and an
empty password.● Undeploy path /● Deploy Bio2RDF application from the
http://bio2rdf.org/download/ROOT.war● Bio2RDF application should be installed and now http://bio2rdf.org
returns home page.
Toronto, July 21, ISMB2008 CHUL research center - Laval University 26
How to install Virtuoso Download the software from SourceForge
− http://sourceforge.net/projects/virtuoso Install it and start the server
− Unzip the archive− Change directory to virtuoso-opensource/database− Run ..\bin\virtuoso-t -f in command mode− http://localhost:8890/sparql should responds;− Connect to to Conductor using dba/dba at
http://localhost:8890
Toronto, July 21, ISMB2008 CHUL research center - Laval University 27
How to install Virtuoso Go to the virtuoso\bin folder and execute isql
− C:\Bio2RDF\Virtuoso\bin>isqlOpenLink Interactive SQL (Virtuoso), version 0.9849b.Type HELP; for help and EXIT; to exit.SQL>
Add the needed index to the triplestore − CREATE BITMAP index RDF_QUAD_POGS on
DB.DBA.RDF_QUAD (P,O,G,S);CREATE BITMAP index RDF_QUAD_PSOG on DB.DBA.RDF_QUAD (P,S,O,G);CREATE BITMAP index RDF_QUAD_SOPG on DB.DBA.RDF_QUAD (S,O,P,G);
Create the full text index and update it− DB.DBA.RDF_OBJ_FT_RULE_ADD ('', null, '');
DB.DBA.VT_INC_INDEX_DB_DBA_RDF_OBJ ();
Toronto, July 21, ISMB2008 CHUL research center - Laval University 28
How to load data from N3 files
http://bio2rdf.org/download
Toronto, July 21, ISMB2008 CHUL research center - Laval University 29
How to load data from N3 filesuse n32triplestore.pl PERL script
http://bio2rdf.org/download/n32triplestore.pl
Toronto, July 21, ISMB2008 CHUL research center - Laval University 30
Bio2RDF demo using Virtuoso SPARQL server
Toronto, July 21, ISMB2008 CHUL research center - Laval University 31
What is known about « Paget disease» ?
Toronto, July 21, ISMB2008 CHUL research center - Laval University 32
Popular web search engines without semantic
Toronto, July 21, ISMB2008 CHUL research center - Laval University 33
Integrated bioinformatics search engine
Toronto, July 21, ISMB2008 CHUL research center - Laval University 34
'Paget'->topic->rdf:type
SELECT ?type, count(*)WHERE {
?s1 ?p1 ?o1 . ?o1 bif:contains "paget" .?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . }
ORDER BY DESC (count(*))
Toronto, July 21, ISMB2008 CHUL research center - Laval University 35
Submit the SPARQL query to Virtuoso server at http://bio2rdf.org/sparql
Toronto, July 21, ISMB2008 CHUL research center - Laval University 36
Results : Number of topics by type (database) about Paget disease
Toronto, July 21, ISMB2008 CHUL research center - Laval University 37
View results with Piggy Bank for a facet browsing experience
Toronto, July 21, ISMB2008 CHUL research center - Laval University 38
'Paget'->topic1<-topic2->GO->rdfs:label
SELECT ?l3, count(*)WHERE {
?s1 ?p1 ?o1 . ?s1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type .
?o1 bif:contains "paget" .?s2 ?p2 ?s1 .?s2 <http://bio2rdf.org/bio2rdf#xGO> ?s3 .?s3 <http://www.w3.org/2000/01/rdf-schema#label> ?l3 .}
ORDER BY DESC (count(*))
Toronto, July 21, ISMB2008 CHUL research center - Laval University 39
Results : Number of GO terms by topic related to Paget disease
Toronto, July 21, ISMB2008 CHUL research center - Laval University 40
Bio2RDF four services● Describe a topic
− http://bio2rdf.org/go:0004396● List of reverse links to a topic
− http://bio2rdf.org/links/go:0004396● Full text search
− http://bio2rdf.org/search/hexokinase● SPARQL endpoint to query
− http://bio2rdf.org/sparql
Toronto, July 21, ISMB2008 CHUL research center - Laval University 41
Future works● Build a community data provider in
bioinformatics adopting the Banff Manifesto rules;
● Build next version of Bio2RDF services using a distributed network of Virtuoso servers;
● Participate to the construction of the scientific linked data web to answer real science question using SPARQL;
● Promote the adoption of semantic web in our scientific community;
Toronto, July 21, ISMB2008 CHUL research center - Laval University 42
In short...● Bio2RDF project has rdfized many public data
sources and normalized URI;● This model enables scalability of complexity;● You can drag and drop knowlege into Virtuoso
triplestore;● RDF is usefull if you can query it with SPARQL;● If you do SQL you can do SPARQL;● We offer 4 new public services, try them;● Download it and try it home, install your own
Bio2RDF/Virtuoso server;
Toronto, July 21, ISMB2008 CHUL research center - Laval University 43
Acknowlegments
Bioinformatics lab’s team at CHUL Research Center and the Bio2RDF community on Google.
Peter Ansell for his Java coding effort
Kingsley Idehen from OpenLink Software
Thanks to the essential annotators and data providerand to developers of open source project :
Sesame, Virtuoso and PiggyBank.
Toronto, July 21, ISMB2008 CHUL research center - Laval University 44
http://bio2rdf.orgQuery the graph with SPARQL
http://bio2rdf.org/sparql
Download our software http://sourceforge.net/projects/bio2rdf/
Download the Atlas data in N3 format http://bio2rdf.org/download
Join our group http://groups.google.ca/group/bio2rdf
Contact us at [email protected]