60
Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon Plan 1. Introduction 2. Querying sequence databases (60%) 3. Building your own sequence databases (30%) 4. Use of API (10%) 5. Further

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon Plan 1.Introduction 2.Querying sequence databases (60%) 3.Building

Embed Size (px)

Citation preview

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Plan

1. Introduction

2. Querying sequence databases (60%)

3. Building your own sequence databases (30%)

4. Use of API (10%)

5. Further

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Introduction

1. History

2. Un système de base de données et un outil d’interrogation

3. Principe général d’ACNUC

4. Accès aux programmes et aux bases

5. Déroulement de l’atelier

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Introduction Historique

ACNUC est un système de gestion de bases de données dédié à la gestion des séquences biologiques, en particulier génomiques.

Son développement a débuté en 1980.

Il sert à la fois d'outil d'interrogation et de couche basse pour le développement de logiciel.

Il reste le seul logiciel permettant l'interrogation, transparente pour l'utilisateur, des sous-séquences des séquences présentent dans les banques.

Des développements récents avec Stéphane Delmote permettent d’interroger les banques à distance via un serveur de sockets

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Introduction Principe

Le principal géneral d’ACNUC repose sur l’indexation des fichiers de séquences annotées (EMBL, GenBank, SwissProt ...)

Les différents champs des annotations sont indexés dans des fichier d’index (NOMS, ESPECES, MOT-CLEFS, etc) qui sont mis en relation via des pointeurs.

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Introduction Accès aux programmes et aux bases

Les programmes, les bases de données et la documentation sont accessibles sur le site du PBIL:

http://pbil.univ-lyon1.fr/

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Introduction Workshop progress

Several exercises and examples of applications will be discussed during the workshop.

This presentation and several scripts are available at:

ftp://pbil.univ-lyon1.fr/pub/in2p3/formation_acnuc/

GENERAL DOCUMENTATION:

http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html

QUERY LANGAGE DOCUMENTATION LANGUAGE:http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Query sequence databases

1. First steps with ‘QueryWin’2. The query language

• simple query• séquences and sub-sequences• complicated query

3. Data extraction• several formats• extract peculiar part of the sequences

4. Using ‘query’• simle scripts• complex scripts

5. Using ‘seqinR’• query databases from R

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

« QueryWin » works on all platforms : Unix/Linux, Mac, Windows

2 versions are availble:

the « local  version» works on local databases

the « client version » works on distant databases

Available at PBIL:

http://pbil.univ-lyon1.fr/software/query_win.html

Documentation available at PBIL

http://pbil.univ-lyon1.fr/software/doclogi/docacnuc/acnucwin/acnwian/aquerywin.html

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

Lauch Query_Win - Mac version: click on the application

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

Launch Query_Win - on the clusters (local version)

launch query_win on EMBL:

>query_win embl

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

Lauch Query_Win - on the clusters (local version)

launch query_win on EMBL:

>query_win embl

command window - query language

command buttons

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

Two ways (not exclusives) of querying tthe database:

1.using buttons and menus

2.using the query language

Exercise 1 :select mouse sequences in EMBL

•method 1:

Click on the buttons select then species and type « mus » in the opening window .

Choose option « build query »

Have a look on the command window.

Execute

Try again with the option « make list »

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

First steps with QueryWin

•method 2: type « sp=mus » in the command window

IMPORTANT :

Queries done with method 1 are displayed as a query langage in the command window

This is an excellent way to learn the query language

From now, try to answer the question with the buttons and menus and observe thow it is translated in query language.

Little by little ,you may tru to use directly the query language.

Another thing: A « HELP » mode is available in Query_Win

Exercice 1 suite

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query languagesimple queries

All operations are possible with query_win (by clicking on buttons or using the query language)

Some simple examples :

-query a sequence according to its name

-query a sequence according to its accession number

-query a sequence according to its species or taxon

-query a sequence according to a keyword

Other examples :

-Which species is associated to this sequence ?

-Which keywords are associated to this sequence ?

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query languagesimple queries

ACNUC query language is described here:

http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

Exercise 2 :

Query SwissProt

Retrieve sequences of cat (Felis cattus) using the buttons

Retrieve sequences of cat (Felis cattus) using the query language

Compare the results

Exercise 2bis :

Query SwissProt

Retrieve sequences with the taxonomic ID (TaxonID) of the felis genre (tid=9682)

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query languagesimple queries

Exercise 3 :

Query SwissProt

Retrieve sequences associated to the keyword « adenylate cyclase » using the buttons

Retrieve sequences associated to the keyword « adenylate cyclase » using the query language

Check the different annotation fields. Where is adenylate cyclase?

Do the same with GenBank

ACNUC query language is described here:

http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query languagesimple queries

Exercise 4 :

Query GenBank

Retrieve sequences associated to the BTG1 gene

Check the different annotation fields. Where is the information on the gene ?

Do the same with SwissProt

Help

the gene name is a keyword

ACNUC query language is described here:

http://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html#QUERYLANGUAGE

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query languagesimple queries

Use of « wild card » : @

To retrieve keyword beginning with « toto », search for toto@ .

Exercize 5 :

Retrieve sequences associated to keyword beginning with BTG

Note

You may use the wild card for species and sequence name

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query language sequences & sub-sequences

One of the main strength of ACNUC is the definition and the use of sequences and sub-sequences.

ID ESCOL3_3; SV 2; circular; genomic DNA; GRV; PRO; 5498450 BP.XXAC BA000007_GR;XXblah blah blahXXCC This Genome Reviews entry was created from entry BA000007.2 in theCC EMBL/Genbank/DDBJ databases on 03 March 2009.XXFH Key Location/QualifiersFHFT source 1..5498450FT /organism="GR Escherichia coli"FT /strain="Sakai = O157:H7 = RIMD 0509952 = EHEC"FT /mol_type="genomic DNA"FT /chromosome="Chromosome"FT /db_xref="taxon:386585"FT .5F1 5'ncr 1..189FT /cds_name="ESCOL3_3.PE1 "FT .PE1 CDS 190..273FT /codon_start=1FT /gene_name="thrL"FT /locus_tag="ECs0001"FT /protein_id="BAB33424.1"FT /transl_table=11FT /translation="MKRISTTITTTITTTITITITTGNGAG"FT .3F1 3'ncr 274..353FT /cds_name="ESCOL3_3.PE1 "FT misc_structure 215..328FT /gene_name="Thr_leader"FT /db_xref="Rfam:RF00506"FT .5F2 5'ncr 274..353FT /cds_name="ESCOL3_3.PE2 "FT .PE2 CDS 354..2816FT /codon_start=1FT /gene_name="thrA"FT /locus_tag="ECs0002"FT /product="Aspartokinase I, homoserine dehydrogenase I "FT /function="NADP or NADPH binding"FT /function="amino acid binding"FT biosynthetic process"FT /protein_id="BAB33425.1"FT /db_xref="GO:0004072"FT /db_xref="UniProtKB/TrEMBL:Q8XA84"FT /transl_table=11

etc etc

CDS5’ncr3’ncr

5’ 3’1

1

1

22

2

33

3

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

ACNUC defines sequences and sub-sequences.

A sequence may contain many sub-sequences.

For example, a chromosome and its CDS are respectively a sequence containing several sub-sequences

A sub-sequence may be of several type

Exercise 6 :

Query HOGENOMDNA (complete genomes)

Retrieve sequences of Escherichia coli o157:h7 str. sakai

Question: what are these sequences ?

Retrieve sub-sequences of chromosome ESCOL3_3

Question: which type are these sequences ?

Retrieve the CDS of chromosome ESCOL3_3

Back to the séquence ESCOL3_3: check for the CDS in the annotations

The query language sequences & sub-sequences

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Séquences are associated to one species.

All its sub-sequences are associated to this species.

It is not the case of keywords. A keyword may be associated to a sequence or only to one of its sub-sequence.

Exercise 7 :

Query SwissProt

Retrieve sequences associated to the BTG1 gene

Do the same in GenBank

What are these sequences?Help

gene name is a keyword

The query language sequences & sub-sequences

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Combinations of criteria:

•Operations AND, OR, NOT, AND NOT

•Use of parenthesis

•Crossing results list:

Exercice 8 :

Query SwissProt

Retrieve mammalian sequences

Retrieve sequences associated to BTG1

Cross these 2 list : list1 AND list2

Retrieve mammalian sequences associated to BTG1 in a single query

Retrieve mammalian sequences associated to BTG1,BTG2,BTG3 and BTG4 in a single query. How many sequences you obtained?

Indice

beware OR and AND

The query language complex queries

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Other criteria:

year of publication ex: y<1986

author of publication au=marley

idem journal

molecule m=mRNA

organelle o=MITOCHONDRION

type t=CDS

hôte h=homo sapiens

status (not for GenBank) st=EST

The query language complex queries

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Modify a sequences list according to the sequences date or sequence lengths

Exercise 10::

Query SwissProt

Retrieve sequences from mus

Select sequences with more than 300 aa

Select sequences which have been added after Y2K

The query language complex queries

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query language complex queries

Exercise 11:

Query SwissProt

Wich are species in witch BTG1 is found in sequence annotations? (it does not mean that other species do not present this gene)

Solution :retrieve sequences associated to the gene then retrieve the species associated to these sequences)

Exercise 11bis

Do the same in one command line

Exercise 12

Retrieve the name of all the strains of E. coli found in EMBL

Exercice 12bis

Retrieve the list of eukaryots in HOGENOMDNA.

Retrieve the list of fungi.

Help

projecting species

ps

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query language browsing taxonomy and keywords

Both taxonomy and keytwords are organised in a hierarchy.

It is possibleto browse these hierarchies with the button browse of Query_win

A keyword may have « parent ».

For example, EC-numbers are keyword, all descending of the keyword « EC_Number »

This is very useful to sort and select keywords.

You may select a parent keywords in Query_Win by selecting the button « by name », then enter the word and click « exec » then « done 

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query language browsing taxonomy and keywords

Exercise 13 :

Query SWISSPROT

Retrieve all keywords associated to human

There is too many keywords!

We only want EC numbers:

Retrieve descending keyowrds of de « EC_NUMBERS »

How many are they?

Exercise 13 bis:

Retrieve EC_NUMBERS  associated to human

Vocabulaire

pk list

kd list

(nk=)

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of de files

You may use of files containing:

sequence names

sequence accession number

keywords

species

Exercise 14 :

In Uniprot retrieve the human EC numbers from the file created in exercise 13bis. What are the mouse sequences associated to these EC numbers.

Vocabulaire

fk file

un lmist

ps list

The query language complex queries

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

The query language scan of annotations

It is possible to scan the annotations.

Interesting of the word to scan is not indexed and if the list of sequences to scan is not too big

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Data extractionseveral formats

Exercise 15 :

Query HOGENOMDNA

Selectionner sequences of yeast (saccharomyces)

Extract sequences of chromosomes in FASTA format

Extract sequences of CDS translated into protein in FASTA format

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Data extraction extract part of sequences

Exercise 16 :

In HOGENOMDNA

Selectionner sequences of yeast (saccharomyces)

Extract sequences of CDS in FASTA format

Extract sequences of CDS in EMBL format

Extract 5’non coding sequences in FASTA format

Extract the 1000 first residus of each chromosome in FASTA format

Extract the 500 residus preceding the CDS in FASTA format

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query

« query » is the command line version of query_win

Its interest relies on the possibilty of using scripts.

This helps the automation of th processing, which is very useful in the following cases:

- long suite of queries boring re-write each time: less errors, save time

- use of workflows

- use of generic scripts for different uses

- use on clusters and farms.

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query launching

As Query_Win , 2 versions are available:

local version ( installed on pbil, pbil-dev, et les workers pbil-debX)

client version (query distant databases)

Both available for Linux/Unix, MacOS, Windows.

Locale version : query embl

>query embl

Client version : queryr embl

>raa_query

then choose database, or directly:

>raa_query pbil.univ-lyon1.fr:5558/embl

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query instructions

« query » use the same query language as query_win.

However, there are small differences, especially in the managment of lists.

Do not hesitate to consult help by typing HELP.

Exercise 17

Query HOGENOMDNA (complete genomes)

Retrieve sequences of Escherichia coli o157:h7 str. sakai

Retrieve sub-sequences of chromosome ESCOL3_3

Retrieve CDS of chromosome ESCOL3_3

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query instructions

Solution exercise 17

Query HOGENOMDNA (complete genomes)

Retrieve sequences of Escherichia coli o157:h7 str. sakai

Retrieve sub-sequences of chromosome ESCOL3_3

Retrieve CDS of chromosome ESCOL3_3

Save CDS

query hogenomdna

sel

sp=Escherichia coli O157:h7 str. sakai

mod

list1

5

sel

n=ESCOL3_3 et t=cds

save

list3

list_cds

stop

select a list ( defaut :list1)

selection criterium

modify list

list to be modified

type of modification

selec a new list ( default: list3)

selection criterium

save list

list to be saved

file

exit query

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query instructions

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query use of scripts

A script is used as it follows

query banque << EOF

instructions

instructions

instructions

instructions

EOF

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query use of scripts

Execute precedng exercise with a script.

Moreover, extract CDS in FASTA format

source exemple_script_1.csh

or

csh exemple_script_1.csh

terminalno

Exercice 18

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query use of scripts

source exemple_script_2.csh oucsh exemple_script_2.csh

sel/l=plant

giving a name to the list helps the writing and understanding of the script

This script select homologous gene famiies ( HOGENOM families) shared by plants and cyanobacteria but not by animals.

CDS of Arabidopsis present in these families are saved and extracted in FASTA format

Exercice 19

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of query use of scripts

csh exemple_script_3_bis.csh viridiplantae cyanobacteria metazoa

Use of a script with arguments

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of seqinR

It is possible to query ACNUC databases from the R software.

Use the seqinR package

Exercise 17ter

with R:

Query HOGENOMDNA (complete genomes)

Retrieve the CDS of Escherichia coli o157:h7 str. sakai

Plot the histogram of CDS lengths

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of seqinR

Solution Exercise 17

install.pacakges(« seqinr »)

library(« seqinr »)

choosebank(« hogenomdna »)

query("cds","sp=Escherichia coli o157:h7 str. sakai et t=cds")

lengths<-lapply( cds$req,getLength)

hist(unlist(lengths))

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Why ?

1. To stock and access to sequences of interest.

• selection and modification of a sub-set of a generalist database

• sequencing

2. Allowing complex queries

3. Create your own keywords and associated hierarachy

4. Automation of queries

5. Share and diffusion

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

How to select a local database:

index are in /ma_banque/index

flat files are in /ma_banque/flat_files

Define environnement variables acnuc et gcgacnucsetenv mabase « /ma_banque/index /ma_banque/flat_files »

query mabase

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Build a database from annotated data

script build_uniprot.csh

initf : create indexes

acnucgener: indexation of sequences

Documentation:

http://pbil.univ-lyon1.fr/databases/acnuc/acnuc_gestion.html

Exercise 20

build a database in SWISSPROT format

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Build a database from annotated data

script build_embl.csh

initf : create indexes

acnucgener: indexation of sequences

Documentation:

http://pbil.univ-lyon1.fr/databases/acnuc/acnuc_gestion.html

Exercise 21

build a database in EMBL format

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

By default, many fields are used to define the keywordsHowever it is possible to specify supplementary fields to define keywords.

Examplesearch for keyword HBG298754 in the previously created embl database.

The keyowrd is nout found.. However the field /gene_family="HBG298754"  exists(cf séquence ECODH_1.PE2)

Exercise 22

Rebuild the database with

build_embl_customized.csh...

Query for the keyword again.

Defining new keywords (EMBL/GenBank only)

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Qualifier = GENE_FAMILY Use_Value = True Parent_Keyword = GENE_FAMILY

Qualifier = DB_XREF Use_Value = True Parent_Keyword = CROSS REFERENCES

Qualifier = PROTEIN_ID Use_Value = True Parent_Keyword = PROTEIN IDS

Qualifier = %(C+G) Use_Value = True Parent_Keyword = CG_CONTENTS

Qualifier = LOCUS_TAG Use_Value = True Parent_Keyword = LOCUS_TAG

Build your own ACNUC database

Defining new keywords (EMBL/GenBank only)

Use the file « custom_policy » which should be in the directory $acnuc (index)

fichier custom_policyECODH_1.PE2 Location/Qualifiers (length=2463 bp)FT CDS 337..2799FT /codon_start=1FT /gene_family="HBG298754"FT /evidence="4: Predicted"FT /gene_id="IGI03726849"FT /gene_name="thrA"FT /locus_tag="ECDH10B_0002"FT /product="Fused aspartokinase I and homoserineFT dehydrogenase I"FT /function="NADP or NADPH binding"FT /function="amino acid binding"FT /function="homoserine dehydrogenase activity"FT /biological_process="aspartate family amino acidFT biosynthetic process"FT /protein_id="ACB01207.1"FT /db_xref="GO:0004072"FT /db_xref="InterPro:IPR001048"FT /db_xref="UniProtKB/TrEMBL:B1XBC7"FT /transl_table=11FT /%(C+G)="CG<60%"FT /note="C+G content in third codon positions = 57.6 % "//

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Enrich annotations et create keywords

Yoy may enrich the annotations with adapted keywords.

For example, the following lines

FT /gene_family="HBG298754"FT /%(C+G)="CG<60%"FT /note="C+G content in third codon positions = 57.6 % "

have been added to allows to query the database according to the GC contents or the gene family.

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Enrich annotations et create keywords

Going further: Modify the annotations and create an associated custom_qualifier_policy file.

Exercise 23

Modify custom_policy to generate different keywords

2 examples custom_qualifier_policy.hogenom custom_qualifier_policy.tp

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Build a database from raw sequence data (FASTA)

ACNUC database are builded from SwissProt, EMBL ou Genbank format. You need to convert a FASTA file into the correct format to build the database.

Uniprot: script BioPerl

EMBL/GenBank : readseq

http://www.ebi.ac.uk/cgi-bin/readseq.cgi

gener_prot.pl Chlre4_best_proteins_small.fasta Chlre4_best_proteins.dat CHLRE

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Build your own ACNUC database

Build a database from raw sequence data (FASTA)

Exercise 24

Transfom ecoli_dna.fasta file in EMBL format and build an ACNUC database

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Sample sequence management

Yoy may want to do query as: Retrieve all the sequences send to the

sequencing of 15/02/2010 Retrieve all the sequences send to the

sequencing of 15/02/2010 and associated to the « toto » experiment.

Retrieve all the sequences associated to the « toto » experiment and the « tata » species

Build your own ACNUC database

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Steps Step 1: Cleaning and annotation of sequences Step 2: Transform FASTA file into EMBL file

(readseq). Step 3: Add keywords as:

Obtention date Experiment name Etc.

Step 4: Build the database

Build your own ACNUC database

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of API API C/C++

Documentation :General structurehttp://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html

API C (local version)http://pbil.univ-lyon1.fr/databases/acnuc/structure.html

API C (client version ,acces via sockets)http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html

API C++ (client version ,acces via sockets, Bio++)http://pbil.univ-lyon1.fr/databases/acnuc/bpp-raa/bpp-raa.html

Exemples of API C local version :http://pbil.univ-lyon1.fr/databases/acnuc/example.phphttp://pbil.univ-lyon1.fr/databases/acnuc/cfonctions.html

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of APIAPI C

local version

Exercise 25

test exemple1.c

/*gcc -c exemple1.c -I /bge/banques/csrcgcc -o exemple1 exemple1.o -L /bge/banques/csrc -lcacnucdeb*/

#include "dir_acnuc.h"

main(int argc,char *argv[]) {/*char my_taxon[] = "Bovidae"; /* case ignored */char my_taxon[500];int num, err, *list, numsp;int i = 2;

if (argc == 1) {fprintf(stderr,"Usage: exemple1 taxon_name\n");exit(1);}strcpy(my_taxon,argv[1]);

while (argc > i) {strcat(my_taxon," ");strcat(my_taxon,argv[i]);i ++;

}acnucopen();

list = (int *)calloc(lenw , sizeof(int) );err = shkseq(my_taxon, list, 1);if(err == 2) { printf("Taxon %s does not exist in the current database.\

n",my_taxon);exit(1);}num = 1;while( (num = irbit(list, num, nseq)) != 0) {/* here num is the rank of a seq attached to taxon my_taxon */readsub(num);printf("%s\t%s\n",my_taxon,psub->name);}

free(list);

}

select a local database:

« choix embl » or « choixbanque »

else

setenv acnuc/acnucdb/embl/index

setenv gcgacnuc /acnucdb/embl/flat_files

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of APIAPI C

local version

Exercise 26

http://pbil.univ-lyon1.fr/databases/acnuc/ex_requete.php

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of API API C

client version

Exercise 27

http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html#example

API C client version (acess via les sockets)http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Use of API API Python

Documentation :http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh

Formation ACNUC - 3 Mars 2010 - Pôle Informatique du LBBE - CNRS - Université de Lyon

Further

Install an ACNUC server

Questions?

et n’oubliez pas:

www.mangerbouger.fr