13
Genome Biology 2003, 4:R36 comment reviews reports deposited research refereed research interactions information Open Access 2003 GonzÆlez et al. Volume 4, Issue 6, Article R36 Research The mosaic structure of the symbiotic plasmid of Rhizobium etli CFN42 and its relation to other symbiotic genome compartments Vctor GonzÆlez, Patricia Bustos, Miguel A Ramrez-Romero, Arturo Medrano-Soto, Heladia Salgado, Ismael HernÆndez-GonzÆlez, Juan Carlos HernÆndez-Celis, Vernica Quintero, Gabriel Moreno- Hagelsieb, Lourdes Girard, Oscar Rodrguez, Margarita Flores, Miguel A Cevallos, Julio Collado-Vides, David Romero and Guillermo DÆvila Address: Centro de Investigacin Sobre Fijacin de Nitrgeno, Universidad Nacional Autnoma de MØxico, Cuernavaca, Morelos, MØxico 62210. Correspondence: Guillermo DÆvila. E-mail: [email protected]. © 2003 González et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. The mosaic structure of the symbiotic plasmid of Rhizobium etli CFN42 and its relation to other symbiotic genome compartments The symbiotic plasmid is a circular molecule of 371,255 base-pairs containing 359 coding sequences. Nodulation and nitrogen-fixation genes common to other rhizobia are clustered in a region of 125 kilobases. Numerous sequences related to mobile elements are scattered throughout. In some cases the mobile elements flank blocks of functionally related sequences, thereby suggesting a role in transposition. The plasmid contains 12 reiterated DNA families that are likely to participate in genomic rearrangements. Comparisons between this plas- mid and complete rhizobial genomes and symbiotic compartments already sequenced show a general lack of synteny and colinearity, with the exception of some transcriptional units. There are only 20 symbiotic genes that are shared by all SGCs. Abstract Background: Symbiotic bacteria known as rhizobia interact with the roots of legumes and induce the formation of nitrogen-fixing nodules. In rhizobia, essential genes for symbiosis are compartmentalized either in symbiotic plasmids or in chromosomal symbiotic islands. To understand the structure and evolution of the symbiotic genome compartments (SGCs), it is necessary to analyze their common genetic content and organization as well as to study their differences. To date, five SGCs belonging to distinct species of rhizobia have been entirely sequenced. We report the complete sequence of the symbiotic plasmid of Rhizobium etli CFN42, a microsymbiont of beans, and a comparison with other SGC sequences available. Results: The symbiotic plasmid is a circular molecule of 371,255 base-pairs containing 359 coding sequences. Nodulation and nitrogen-fixation genes common to other rhizobia are clustered in a region of 125 kilobases. Numerous sequences related to mobile elements are scattered throughout. In some cases the mobile elements flank blocks of functionally related sequences, thereby suggesting a role in transposition. The plasmid contains 12 reiterated DNA families that are likely to participate in genomic rearrangements. Comparisons between this plasmid and complete rhizobial genomes and symbiotic compartments already sequenced show a general lack of synteny and colinearity, with the exception of some transcriptional units. There are only 20 symbiotic genes that are shared by all SGCs. Conclusions: Our data support the notion that the symbiotic compartments of rhizobia genomes are mosaic structures that have been frequently tailored by recombination, horizontal transfer and transposition. Published: 13 May 2003 Genome Biology 2003, 4:R36 Received: 1 November 2002 Revised: 6 March 2003 Accepted: 2 April 2003 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2003/4/6/R36

The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

com

ment

reviews

reports

deposited research

refereed researchinteractio

nsinfo

rmatio

n

Open Access2003Gonzálezet al.Volume 4, Issue 6, Article R36ResearchThe mosaic structure of the symbiotic plasmid of Rhizobium etli CFN42 and its relation to other symbiotic genome compartmentsVíctor González, Patricia Bustos, Miguel A Ramírez-Romero, Arturo Medrano-Soto, Heladia Salgado, Ismael Hernández-González, Juan Carlos Hernández-Celis, Verónica Quintero, Gabriel Moreno-Hagelsieb, Lourdes Girard, Oscar Rodríguez, Margarita Flores, Miguel A Cevallos, Julio Collado-Vides, David Romero and Guillermo Dávila

Address: Centro de Investigación Sobre Fijación de Nitrógeno, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México 62210.

Correspondence: Guillermo Dávila. E-mail: [email protected].

© 2003 González et al.; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.The mosaic structure of the symbiotic plasmid of Rhizobium etli CFN42 and its relation to other symbiotic genome compartmentsThe symbiotic plasmid is a circular molecule of 371,255 base-pairs containing 359 coding sequences. Nodulation and nitrogen-fixation genes common to other rhizobia are clustered in a region of 125 kilobases. Numerous sequences related to mobile elements are scattered throughout. In some cases the mobile elements flank blocks of functionally related sequences, thereby suggesting a role in transposition. The plasmid contains 12 reiterated DNA families that are likely to participate in genomic rearrangements. Comparisons between this plas-mid and complete rhizobial genomes and symbiotic compartments already sequenced show a general lack of synteny and colinearity, with the exception of some transcriptional units. There are only 20 symbiotic genes that are shared by all SGCs.

Abstract

Background: Symbiotic bacteria known as rhizobia interact with the roots of legumes and inducethe formation of nitrogen-fixing nodules. In rhizobia, essential genes for symbiosis arecompartmentalized either in symbiotic plasmids or in chromosomal symbiotic islands. Tounderstand the structure and evolution of the symbiotic genome compartments (SGCs), it isnecessary to analyze their common genetic content and organization as well as to study theirdifferences. To date, five SGCs belonging to distinct species of rhizobia have been entirelysequenced. We report the complete sequence of the symbiotic plasmid of Rhizobium etli CFN42, amicrosymbiont of beans, and a comparison with other SGC sequences available.

Results: The symbiotic plasmid is a circular molecule of 371,255 base-pairs containing 359 codingsequences. Nodulation and nitrogen-fixation genes common to other rhizobia are clustered in aregion of 125 kilobases. Numerous sequences related to mobile elements are scatteredthroughout. In some cases the mobile elements flank blocks of functionally related sequences,thereby suggesting a role in transposition. The plasmid contains 12 reiterated DNA families thatare likely to participate in genomic rearrangements. Comparisons between this plasmid andcomplete rhizobial genomes and symbiotic compartments already sequenced show a general lackof synteny and colinearity, with the exception of some transcriptional units. There are only 20symbiotic genes that are shared by all SGCs.

Conclusions: Our data support the notion that the symbiotic compartments of rhizobia genomesare mosaic structures that have been frequently tailored by recombination, horizontal transfer andtransposition.

Published: 13 May 2003

Genome Biology 2003, 4:R36

Received: 1 November 2002Revised: 6 March 2003Accepted: 2 April 2003

The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2003/4/6/R36

Genome Biology 2003, 4:R36

Page 2: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.2 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

BackgroundNitrogen-fixing symbiotic bacteria grouped within the Rhizo-biaceae, Phyllobacteriaceae and Bradyrhizobiaceae familiesare widespread in nature [1]. Ordinarily known as rhizobia,these organisms contain genomes of one or two chromo-somes and several large plasmids ranging in size from about100 kilobases (kb) to more than 2 megabases (Mb). A com-mon feature of the genomes of rhizobia is that the genes in-volved in the symbiotic process are located in specificsymbiotic genome compartments (SGCs), either as independ-ent replicons known as symbiotic plasmids (pSym) or as sym-biotic islands or regions within the chromosome. Completegenome sequences have been recently reported for Meso-rhizobium loti MAFF303099 [2], Sinorhizobium meliloti1021 [3�6], Bradyrhizobium japonicum USDA110 [7] andthe non-nitrogen-fixing close relative Agrobacterium tume-faciens C58 [8,9]. In addition, the sequence of the pSym ofRhizobium species NGR234 - pNGR234a [10] - as well as thatof the chromosomal symbiotic regions of B. japonicumUSDA110 and M. loti R7A have been reported [11,12]. Genom-ic comparisons reveal that the chromosomes of S. meliloti, M.loti, and the circular chromosome of A. tumefaciens havemore than 50% of orthologous genes in common [6]. A clearsyntenic relationship is observed between the circular chro-mosomes of S. meliloti and A. tumefaciens [8,9] and albeit toa lesser extent, synteny is also apparent when both are com-pared to the chromosome of M. loti [6,8,9]. These results leadto the hypothesis that rhizobial chromosomes have a commonancestral origin [6,8,9]. Other genome constituents of rhizo-bia (that is, other chromosomes and plasmids) are thought tobe the result of subsequent events of genomic rearrange-ments and horizontal transfer [6,8,9], but the precise mecha-nisms involved in their generation have not been elucidatedso far.

Here we report the complete DNA sequence of the pSym(p42d) of Rhizobium etli CFN42 and its comparative analysiswith other rhizobial SGCs. R. etli is the symbiont of the com-mon bean Phaseolus vulgaris and has been widely used asmodel for metabolic and genome dynamics studies [13,14]. Itsgenome is composed by one chromosome and six plasmidsranging in size from 184 kb to about 600 kb [15]. The physicalmap of p42d was previously determined and was the basis forobtaining the entire sequence [16]. In this study we show thatthe SGCs are heterogeneous in sequence, gene compositionand gene order. There are only 20 symbiotic genes that areshared by all SGCs. There are also some conserved gene clus-ters of related function that are present in some SGCs, but ab-sent in others. Besides genes unique to a particular SGC,several orthologous genes are located in different genomecontexts in other rhizobia. Other common features to allSGCs, such as reiterated genes, pseudogenes, and a largeamount of insertion sequences (ISs), support the view thatp42d, as well as other SGCs, is a mosaic structure that mayhave assembled from different genome contexts, either chro-mosomal or plasmidic.

Results and discussionGeneral features of p42dThe symbiotic plasmid p42d is a circular molecule of 371,255base-pairs (bp) (Figure 1) that belongs to the repABC type ofreplicator [17]. We identified 359 coding sequences (CDS), ofwhich 63% have an assigned function, 17% have homologs indatabases without an assigned function, and 20% are orphan(Figure 1, see also Additional data file 1). The CDS distribu-tion between the two strands is asymmetrical, with 61% ofthem located in the minus strand. The plus and minus strandswere defined according to the previously reported physicalmap [16]. Moreover, the plus strand contains two reiteratednifHDK gene clusters in a clockwise orientation (NRa andNRb, Figure 1). The main functional classes of genes identi-fied are: transport, nitrogen fixation, nodulation and tran-scriptional regulation. Ten pseudogenes related to knowngenes were identified that carry deletions and frameshifts attheir amino or carboxyl termini. The plasmid also containsmany reiterated sequences and a large number of elementsrelated to insertion sequences (ERIS) accounting for 10% ofthe entire sequence. The major reiterations (28 elements)were grouped into 12 families on the basis of their sequencesimilarity (see below).

The average GC content of the plasmid is 58.1%. When geneswere classified as low, average or high GC content (using themean GC ± 1 standard deviation as thresholds), we observeda clear distinction between high or low GC in some gene clus-ters (Figure 2a). Several hypothetical genes and the nod genesshow the lowest GC values (< 55%), whereas the highest GCvalues (> 62%) were displayed by the genes for cytochromeP450 (CPX), tra genes (TRA), and the genes for type III(TSSIII) and type IV (TSSIV) transport secretion systems.Similarly, when the genes were classified according to poor,typical or rich codon usage (CU) (see Materials and methodsfor details), genes with high GC also exhibited a rich CU (Fig-ure 2b), whereas the GC-CU correlation was found to be lowerfor other genes. For example, the regions that contain the ni-trogenase structural genes, other nif genes (NRa, b and c, seebelow), and the fixNOQPGHIS genes (FIX1) showed averageGC content but rich CU. The variable correlation between GCcontent and CU levels reveals sequence heterogeneity withinp42d and suggests a dynamic structure for this plasmid, pre-sumably as a consequence of extensive genomic rearrange-ments, recombination rates, lateral transfer, and relaxationor intensification of selective pressures.

Organization of genes involved in nodulationMost nod genes present in p42d are clustered in a region of 16kb (NOD); however, nodA is separated from nodBC by 27 kb.The Nod factor backbone of R. etli CFN42 is an N-acetylglu-cosamine pentasaccharide synthesized by the common nodAand nodBC gene products. Modifications to this backboneconsist of methyl and carbamoyl groups at the non-reducingend, while the reducing one is modified by the addition of afucosyl group that is in turn acetylated [18]. The

Genome Biology 2003, 4:R36

Page 3: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.3

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

methyltransferase, fucosyltransferase and acetyltransferaseactivities required for these modifications are encoded bynodS, nodZ and nolL, respectively. It is unclear, however,which gene product is responsible for the carbamoylation ofthe Nod factor, as nodU, the most likely gene to carry out thisfunction, is a pseudogene. The two membrane proteins en-coded by nodI and nodJ (located downstream of nodBCSU),participate in the transport of the Nod factor to the outside ofthe cell [19]. Other genes present in p42d whose homologs inother rhizobia have a role in nodulation are nolO, nolE, nolT

and nolV, the last two being part of the TSSIII system (see be-low). In addition to nodU, two other pseudogenes, noeI andnodQ, were identified.

The expression of nod genes depends on the activity of NodDproteins [20], which interact with specific sites known as nodboxes located upstream of the nod operons [21]. The se-quence of the p42d revealed three nodD genes; nodD1 ispresent in the NOD region while nodD2 and nodD3 are 50 kbapart. We also predict 15 potential nod boxes (see Materialsand methods and Additional data file 2), seven of which areassociated with almost all nod genes: nodA, nodZ, nodBCSU,nolE, nodD1, nodD2, and nodD3. The rest of the nod boxes arelocated proximal to genes so far unrelated to the nodulationprocess; namely, the genes bglS (β-glucosidase), yp108 (pu-tative monooxygenase), and the orphans yh005, yh007 andyh050. There is also a putative nod box upstream of the geneencoding NifA, the major transcriptional regulator of the ni-trogen-fixation genes. Even though the regulation of nifA isvariable among rhizobia, dependence on flavonoid inductionis unknown.

Organization of the genes involved in nitrogen fixationThe nif and fix genes are distributed in five regions spanninga total of 125 kb (Figure 1); the NOD cluster mentioned previ-ously maps within this section as well. There are three copiesof the nitrogenase reductase gene nifH [22], defining thethree nif regions (NR) a, b and c (Figure 1). NRa contains nif-HDK genes and a truncated nifE pseudogene; NRb containsthe nifHDKENX genes; NRc contains nifH and a truncatednifD pseudogene. The largest reiterated regions found inp42d correspond to NRa and NRb regions that share 4,470identical nucleotides. The NRc region of 1,131 nucleotides isidentical to sequences within NRa and NRb. The orientationof NRc is inverted in relation to the direction of NRa and NRb.Recent duplications of these NR regions might underlie theunusually high sequence identity between them. Alternative-ly, a mechanism of 'copy-correction' resembling gene conver-sion may be involved in maintaining nucleotide identity [23].

The highest density of nif and fix genes in p42d occurs 10 kbupstream of NRb, in the FIX2 region. This contains the fixA-BCX genes that encode a flavoprotein [24] (see below); nifB,which is needed for the synthesis of the iron-molybdenum co-factor [25]; nifW and nifZ, whose products may be requiredfor protection of the nitrogenase from oxygen [26,27]; and thegenes for the regulatory proteins NifA and RpoN2. The genesnifU, nifS and hesB (also named iscN) also map in the FIX2region. The products of these genes have been implicated inthe formation of the Fe-S cluster required for nitrogenasecomplex function [28]. In R. etli CNPAF512, the inactivationof hesB (iscN) results in a Fix- phenotype [28]. Other genescommonly found in nif regions of rhizobia were also identi-fied in the FIX2 region. These are the ferredoxin gene fdxN,which is essential for nitrogen fixation in S. meliloti [29], andthe gene for the anaerobic transcriptional regulator FnrNd

Figure 1Structure of the symbiotic plasmid p42d of R. etli CNF42. The structure of p42d is represented in five concentric circles. Outermost circle, relevant regions referred to in the text: NRa, b and c, regions containing nitrogenase structural genes; FIX1 and FIX2, clusters containing nitrogen-fixation genes; NOD, major cluster of nodulation genes; CPX, cluster for cytochrome P450; TRA, cluster for tra genes; REP, replicator region; TSSIII and IV, clusters for transport secretion system genes. The 125 kb region that contains most of the symbiotic genes, described in the text as a putative mobile element, is shown in green. Second circle, organization of predicted CDSs located according to the direction of transcription color-coded as below; those transcribed on the plus strand are shown in the outer half of the circle. For each class, the number of CDSs and the percentage of the total are: hypothetical (70) 19.5% (dark red); hypothetical conserved (62) 17.3% (red); integration recombination (55) 15.3% (purple); various enzymatic functions (45) 12.3% (khaki); transport secretion systems (37) 10.3% (gray); nitrogen fixation (35) 9.8% (yellow); nodulation (18) 5% (dark blue); transcriptional regulation (15) 4.2% (light blue); plasmid maintenance (10) 2.8% (orange); electron transfer (7) 2.1% (magenta); chemotaxis (3) 0.8% (pink); and polysaccharide synthesis (2) 0.6% (green). Third circle, elements related to insertion sequences (ERIS). Putative partial ISs (purple), and putative complete ISs (black). Fourth circle, reiterated DNA families. The major reiterated families (see text) are shown in different colors. Innermost circle, potential genomic rearrangements. Arrowheads indicate the sites for homologous recombination leading to genomic rearrangements. Black lines connect sites for amplification or deletion events; red lines connect sites for inversion.

p42d

REP

NRa FIX1 NODNRc

CPX

FIX

2

TSSIV

TS

SII

TRA

NR

b

Genome Biology 2003, 4:R36

Page 4: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.4 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

(see below). The products of nifV and nifQ have been involvedin the synthesis of the iron-molybdenum cofactor [30,31];nevertheless, nifV is absent in the p42d and nifQ is locatedupstream of nifHc in the NRc region.

RpoN regulationThe RpoN (σN, also known as σ54) subunit of the RNApolymerase, encoded by rpoN, and the transcriptional activa-tor NifA protein, encoded by nifA (both present in the FIX2region, Figure 1), participate in the regulation of nif genes.RpoN binds to specific promoter regions and interacts withthe NifA protein that binds to specific upstream activator se-quences (UAS) [32]. In R. etli CNPAF512, two rpoN geneshave been described [33], one located in the chromosome(rpoN1), and the other in the pSym (rpoN2). The rpoN genefound in the p42d is orthologous to rpoN2. Regulation byRpoN and NifA has been demonstrated for nifH a, b and c[34]. We predicted, as described in Materials and methods,potential RpoN-binding sites and UAS for NifA in the up-stream region of several genes (see Additional data file 3).Both types of sites were also identified upstream of othergenes; the reiterated yp003, yp021 and yp099 genes that

encode the recently described BacS protein, highly expressedin nodules [35]; yp010 in the putative operon for terpenoidsynthesis [36]; the fixA, hesB, and cpxA5 genes; and yp104,which encodes a toxin-transport-related protein. The expres-sion of yp003 (bacS) and hesB (iscN) has recently beenshown to depend on NifA [28,35].

RpoN-like promoters were also predicted upstream of severalgenes for which no associated NifA-binding sites could be de-tected (see Additional data file 3). Among them are the nitro-gen-fixation genes fixO, nifQ and nifB; the genes for theputative decarboxylase, pcaC1, and alcohol dehydrogenase,xylB2. Furthermore, potential sites for RpoN were also foundin several genes of unknown function. Recently, Dombrechtet al. [37] predicted RpoN promoter sites in all completerhizobial genomes and p42d; we report here a larger set ofgenes potentially regulated by RpoN in p42d. It includesgenes for nitrogen fixation, electron transfer, transport, andseveral of unknown function. The genes reported by Dombre-cht et al. are mainly in nif and fix genes, the ferredoxins (fdxBand N; not predicted by us), and some genes of unknownfunction [37]. The differences between their results and ours

Figure 2Compositional features of the coding sequences (CDS) of p42d. (a) GC content, and (b) CU of the 359 CDS of p42d. Red lines indicate the average in GC (58.1%) and CU (0.58). Blue lines indicate 1 standard deviation of GC ± 3.5% and CU ± 0.16. Highest and lowest percentage values of GC are 69.4 and 45.8 respectively. The CU limit values varies from 0.11 to 1.00. (c) CDS distribution with the color codes for functional classes and the relevant regions described in Figure 1.

61.6

58.1

54.6

0.74

0.58

0.42

(a)

(b)

(c)

Genome Biology 2003, 4:R36

Page 5: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.5

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

may be explained in part by the different strategies used toconstruct the weight-matrices in both studies, which in ourcase includes only 85 RpoN promoters whose transcriptionstart sites have been experimentally determined, instead ofthe whole set of 186 promoters used by Dombrecht et al.([37], see also [38]); see Materials and methods for details.

Energy supply and anaerobic regulationThe electron flux and supply of energy for the reduction ofmolecular nitrogen requires the flavoprotein encoded by thefixABCX genes mentioned above (FIX2 region, Figure 1), aspecific cytochrome oxidase encoded by fixNOQP genes, anda cation pump encoded by fixGHIS genes. The latter clustersin the region FIX1 (Figure 1). A second copy of fixNOQP andfixG genes has been found in the plasmid p42f [39].

In the symbiotic state, the cytochrome terminal oxidases en-coded by the fixNOQP operon provide the energy required tofix nitrogen [32]. The cytochrome production is regulated inresponse to oxygen concentration and the products of fixLJ,fixK and fnrNd genes are also known to be involved in suchregulation [40]. In R. etli, the duplicated fixNOQP operonsare differentially regulated and only the fixNOQPd is requiredfor symbiosis [39]. An inactive fixKd is present in p42d but nofixJ genes have been found in R. etli CFN42 [39]. It has beenshown that FixKf controls both fixNOQP operons; loss of Fix-Lf (presumably a fusion protein of FixL and FixJ) suppressesfixNOQPf expression, but has only a moderate effect on thatof fixNOQPd [39]. Two fnrN genes have been described in R.etli CFN42; one is chromosomal (fnrNchr), and the other ison p42d (fnrNd). Both regulators participate in the activationof the operon fixNOQPd [41].

In Escherichia coli, Fnr is an oxygen-responsive global tran-scriptional regulator that binds to conserved boxes upstreamof several genes (anaeroboxes) [42]. By computational meth-ods we predict 45 possible anaeroboxes in p42d (see Materi-als and methods). In some cases there are pairs ofanaeroboxes in the same region. For example, two anaero-boxes lie within the intergenic region of the divergent operonsfixKd and fixNOQPd and two more were detected upstream offixG, nocR, nodD3, and fnrNd. Other genes that display singleanaeroboxes are fixX, nifW, nifU, hemN2, psiB, hesB, mcpC,teuB1 and some other genes of unknown function. Althoughthere is no direct transcriptional evidence about the expres-sion of these genes in microaerobic conditions, previous ob-servations suggest that several regions of p42d are activatedunder these conditions [43].

Complex systems for macromolecular transportA variety of transporters, which account for 10% of the CDSs,are scattered throughout p42d. These include several partialand complete ABC transporters for sugars, as well as the typeIII (TSSIII), and type IV (TSSIV) large-molecule secretionsystems (Figure 1).

In several pathogenic bacteria, the TSSIII translocate viru-lence factors into eukaryotic cells [44]. In Rhizobium this sys-tem was first found in pNGR234a [10] and it has been shownto have a role in nodulation efficiency in some host plants[45]. Genes that encode proteins implicated in this system arealso present in some of the SGCs [2] and were detected in thesequence of p42d. Interestingly, a gene homologous with anelicitor of the hypersensitive response in plants, hrpW, is ex-clusively present in p42d. This gene might form an operonwith pcrD, which encodes a calcium-binding membrane pro-tein that is also part of the type-III secretion system.

The TSSIV encoded by the virB genes has been described inseveral α-proteobacterial pathogens and plant symbionts[46] (see below). It consists of a membrane channel for deliv-ering proteins or DNA into eukaryotic cells. In p42d, a com-plete set of virB genes, from virB1 to virB11, is present (Figure1). Other TSSIV correspond to the tra genes that participatein bacterial conjugation. Although p42d is not a self-conjuga-tive plasmid [47], it contains the traACDG genes, an oriT anda truncated traI pseudogene (yp096), suggesting that p42dmight have lost its self-conjugative capability.

Other functionsIn addition to nifA, fnrN, rpoN2, and nodD1-3, 12 predictedgenes encoding potential transcriptional regulators arepresent in p42d. They belong to different families, includingLysR, AraC, Crp and GntR. The plasmid also encodes otherfunctions including plasmid-maintenance, electron transfer,polysaccharide biosynthesis, melanin synthesis and second-ary metabolism. The sequence of p42d revealed a putativemethionyl-tRNA synthetase that could represent a reiteratedgene or could have another functional role (for example inantibiotic resistance) [48].

Elements related to insertion sequences (ERIS)In general, large numbers of ERIS have been found in thesymbiotic compartments of rhizobia [2,7,10,11]. The genomeof S. meliloti, however, contains a relatively low abundance ofthese elements and their distribution is asymmetric; that isERIS are more abundant in the pSymA, especially near sym-biotic genes [3]. In p42d, ERIS belonging to 12 known IS fam-ilies comprise 10% of the entire DNA sequence. The greatmajority of them belong to the IS3 and IS66 families. Al-though most ERIS represent incomplete, presumably inac-tive, IS sequences, some of them are organized in complete ISelements (Figure 1).

The positions of some ERIS might suggest a role in plasmidshuffling. The 125 kb region that contains most of the symbi-otic genes (Figure 1) is flanked by two complete IS elements.Both elements share identical 30 bp direct repeats at theirborders, suggesting a potential transposition capability. Thepresence of the gene for an integrase-like protein (yp018) andthe fact that the 125 kb region separates the repABC and thetra genes, has prompted the idea that the entire symbiotic

Genome Biology 2003, 4:R36

Page 6: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.6 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

region could be a mobile element. Furthermore, some groupsof genes flanked by ERIS might have arrived in p42d as partof composite transposons, such as the cytochrome P450 clus-ter (see below), the NRb region, and a putative ATPase of anABC transporter.

Reiterated DNA families and genomic rearrangementsIt has previously been shown that p42d contains several reit-erated DNA sequences [16] that can recombine, leading to ge-nomic rearrangements [14,49,50]. The sequence of theplasmid revealed a large amount of DNA reiteration. The ma-jor reiterated families were defined by containing a continu-ous stretch of at least 300 nucleotides with identicalsequence. There are 12 such families, with two or three mem-bers each (Figure 1). In addition to the nif family describedabove, five families are related to ERIS and the rest are vari-ous genes such as those that encode the BacS protein [35], orgene fragments.

As previously shown with pNGR234a [51], the DNA sequenceallows prediction, identification and isolation of the potentialrearrangements that may be generated by homologous re-combination. The complete sequence of p42d will allow theidentification of the precise sites of previously identified ge-nomic rearrangements [50]. In the present study we havepredicted the major potential rearrangements in p42d as itwas previously described [51]; these include amplifications,deletions and inversions such as those illustrated in Figure 1.

In other SGCs the differences in number, organization, orien-tation and length of the reiterated elements predict specificgenome rearrangements, as exemplified by the rearrange-ments that involve the nifH reiteration of p42d andpNGR234a [50,51].

Genetic information of p42d in the context of other genomesThe putative protein sequences of p42d were compared to theproteomes of several complete bacterial genomes extractedfrom GenBank [52] (see Materials and methods) as well as tothe SGC sequences available to date (Figure 3). We identifiedall pairs of potential orthologs between p42d and each of thegenomes analyzed, following the strategy and definition de-scribed in Materials and methods. As expected, the highestpercentage of orthologs common to p42d and to any otherbacterial genome was found among the nitrogen-fixing sym-biotic bacteria. S. meliloti and M. loti have, respectively, 51%and 45% of the orthologs found in p42d (see Additional datafile 5). Members of the α-proteobacterial subclass such asCaulobacter crescentus, Brucella melitensis and A. tume-faciens (a plant pathogenic member of the Rhizobiaceae)have from 25% to 32% of the orthologs present in p42d. Thepercentage of p42d orthologs within the genomes of plantpathogens varies from 17% for Xyllella fastidiosa to 31% forRalstonia solanacearum. Human bacterial pathogens suchas Haemophilus influenzae and Helicobacter pylori, those

with small genomes as Rickettsia prowazekii and Mycoplas-ma genitalium, and the archaea compared here, display thelowest number of shared orthologs. Instances of putative or-thologs found in p42d and some complete bacterial genomesare shown in Figure 3a. In general, a collection of orthologsinvolved in diverse enzymatic activities is present in p42d andin most genomes compared here. They include the geneshemN1, hemN2, ctrE, hisC, icfA, pgmV, aatC, pcaC1, adhE,ribAB, bglS, kprS, mcpG, mcpA and mmsB (see Additionaldata file 1, for the assigned function). Their identity is in mostcases 50% or lower.

When we examined the distribution of orthologs in the sixSGCs (see above), including p42d and using the genomes ofM. loti and S. meliloti as reference, it was found that half ofthe hits lie in the respective SGCs and the rest are dispersedamong other replicons, including the chromosomes (Table 1,Figure 3f,3g). In general, the genes for symbiosis are very wellconserved in the SGCs, whereas the orthologs of genes not in-volved in symbiosis are distributed in nonsymbiotic plasmidsand in the chromosomes (Figure 3c,3d,3e).

A total of 177 p42d CDSs (49%) have orthologs at least in oneSGC. A subset of these (80 CDSs) belongs to the symbiotic re-gion of 120 kb (Figure 3c,3d,3e,3f,3g, from NRa to NRb re-gions; Table 1) and the rest are interspersed in the remaining251 kb of the plasmid. Among the SGCs compared,pNGR234a shares the highest percentage of orthologs (30%)with p42d (Table 1, Figure 3d), followed by the pSymA (28%)and the SGC of M. loti R7A (Table 1, Figure 3g and 3e, respec-tively). The SGC of M. loti MAFF303099 and B. japonicumshare the fewest orthologs (24%) with p42d (Table 1, Figure3f and 3c, respectively). The A. tumefaciens plasmids displaythe highest similarity with the TSSIV, TRA, and REP regionsof p42d; the rest of the matches are distributed in the circularand the linear chromosomes (Figure 3b).

There are 20 genes common to all SGCs. These correspondexclusively to symbiotic genes including both nitrogen fixa-tion (nifHDKENXAB, fixABCX, fdxN, fdxB) and nodulation(nodABCIJD) genes. The essential nodBC genes, however,have possible paralogs in some plant pathogens such as A.tumefaciens, Ralstonia solanacearum and Xanthomonas.Possible paralogs of the transport genes nodIJ are present inall the genomes analyzed. In these bacterial species, putativeparalogs of nod genes might participate in the synthesis andsecretion of outer membrane lipopolysaccharides [53].

Conserved gene clusters in SGCs and other genomesThe fixNOQPGHIS common to different nitrogen-fixing sym-biotic rhizobia are not always confined to the SGCs [10,11]. Asmentioned above, in R. etli CFN42, the fix genes are distrib-uted in two replicons, p42d and p42f, and some of them arereiterated, as is frequently observed in other genomes [2,3].In S. meliloti, these genes are reiterated three times in pSymA[3] and in M. loti there are two copies of the entire operon [2].

Genome Biology 2003, 4:R36

Page 7: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.7

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

In B. japonicum they lie outside of the SGC (410 kb) deter-mined by Gottfert et al. [11] but are included in the equivalent681 kb SGC of the complete genome [7]. Moreover, in Rhizo-bium sp. NGR234 they are chromosomal [54]. The fixNO-QPGHIS cluster was identified in the genome of the plantpathogen A. tumefaciens (circular chromosome), the intrac-ellular parasite Brucella melitensis (chromosome I), and inthe free-living aquatic bacterium C. crescentus; all of thembelonging to the α-proteobacterial subdivision. Among γ-pro-teobacteria, the plant pathogen Pseudomonas aeruginosahas this fix cluster, which is absent in E. coli. Also, orthologsof this gene cluster are conserved in R. solanacearum, a plantpathogen that belongs to the β-proteobacteria.

The fixABCX operon is highly conserved in diazotrophs aswell as in a wide variety of other bacterial and archaeal spe-cies such as E. coli, Mycoplasma genitalium, Bacillus subti-lis, Thermotoga maritima and Archeoglobus fulgidus. In E.coli these fix genes are related to the carnitine pathway, buttheir function is unknown in the other species [55]. Theferredoxins FdxN and FdxB are always linked to nif genes insymbiotic as well as nonsymbiotic organisms. In S. meliloti,mutations in fdxN significantly impair the nitrogen-fixationprocess [29].

As mentioned above, the CPX cluster (9 kb, 15 CDSs) in p42dexhibits GC and CU profiles that diverge from the rest of the

Figure 3Comparison of predicted proteins from p42d with those from other genomes and SGCs. Bidirectional best hits (BDBHs) between p42d and other genomes are shown. The bars in all rows represent the percentage identity (number of identities/length of the alignment) of BDBHs between p42d and the indicated genome (see below for color code). The horizontal red line in each row indicates 50% of similarity. A color code is shown for each genome or compartment. (a) Different organisms: Bacillus subtilis (dark magenta); Brucella melitensis (yellow); Caulobacter crescentus (red); Escherichia coli K12 (light magenta), Methanobacterium thermoautotrophicum (dark purple), and Ralstonia solanacearum (purple). (b) A. tumefaciens C58 circular chromosome (white), linear chromosome (pale gray), pAT (gray), and pTi (dark gray). (c) B. japonicum USDA110 SGC (turquoise). (d) pNGR234a (blue green). (e) M. loti R7A SGC (green). (f) M. loti MAFF303099 SGC (dark blue), and the rest of the chromosome (light blue). (g) S. meliloti pSymA (pale yellow), pSymB (yellow), and the chromosome (dark yellow). (h) CDS distribution for p42d with the color codes for functional classes and the relevant regions as indicated in Figure 1.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Genome Biology 2003, 4:R36

Page 8: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.8 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

genes; CPX gene function is not known and no symbiotic rolehas so far been assigned to them. In B. japonicum some ofthese genes might participate in terpenoid synthesis [36]. Thegenes included in the CPX region showed similar organiza-tion in the SGC of M. loti (strains MAFF303099 and R7A), inpNGR234a, and in p42d. In B. japonicum, the CPX clusterwas not located in the 410 kb SGC [11,36] but is present in the680 kb SGC determined by Kaneko et al. [7]. In pSymA of S.meliloti, this cluster is partially represented by homologs ofcpxP2, cpxP4, ctrE and some conserved hypothetical genes,yp013-yp015. Homologs of IS are located at the right borderof the CPX region in pNGR234a and pSymA, while they are tothe left of the SGC in M. loti R7A. The CPX region in p42d isflanked by ERIS, highlighting its potential for transposition.

A common feature in the SGCs is the presence of either theTSSIII or the TSSIV transport secretion systems. The TSSIIIis found in pNGR234a, in the SGCs of B. japonicum, and inM. loti MAFF303099. The TSSIV is located in pSymA of S.meliloti and the symbiotic island of M. loti R7A. Both trans-port systems are present in p42d. In the pTi and pRi plasmidsof A. tumefaciens C58, the TSSIV system is used for transfer-ring the T-DNA to plant cells. In the absence of T-DNA in theSGCs, the precise function of these systems is not clear. Fur-thermore, both TSSIII and TSSIV are found in bacterial path-ogens of plants and animals as well as in some α-proteobacteria. Complete or partial TSSIII or TSSIV arepresent in Brucella melitensis, C. crescentus, X. citri and X.campestris, whereas in Rickettsia prowazekii, some virBgenes are conserved. P. aeruginosa contains a complete

TSSIII but lacks homologs of the TSSIV, while in Xyllella fas-tidiosa, nine putative conjugative proteins of the plasmidpXF41 are clearly orthologs of the corresponding virB geneset found in other microorganisms.

Absence of synteny among SGCsIt is generally known that gene order is conserved in closelyrelated strains and species. The six SGCs compared here, ex-cept the SGCs of the two M. loti strains, have 20-30% of genesin common according to our estimates (Table 1). Most ofthese genes are located within the conserved clusters de-scribed above. Furthermore, genes unique to each of the indi-vidual genomes are interspersed among genes present in allin SGCs. For example, p42d contains 71 orphan genesthroughout its structure.

The SGCs in M. loti strains MAFF303099 and R7A sharelarge conserved segments that contain all the symbiotic genes[12] (Figure 4, panel 10). The colinearity is disrupted by genesunique to either of the SGCs. The smallest region that enclos-es the 20 common orthologous genes (essentially nod and nifgenes) can be delimited to about 50 kb in pSymA, 120 kb inp42d, 250 kb in pNGR234a, 300 kb in the SGC of B. japoni-cum, and 320 kb in the two SGCs of M. loti (Figure 5). Suchvariability in gene order suggests that the SGCs have recom-bined frequently with other genome elements.

Several transcriptional units that are conserved in some SGCsappear to have undergone rearrangements in others. Exam-ples taken from the nif, fix and nod operons are illustrated in

Table 1

Number of bidirectional best hits between pairs of SGCs or complete genomes

p42d

pNGR234a 120 pNGR23a

SGCBj 88* 133 SGCBj

SmChr 63 86 59 SmChr

pSymA 100 77 47 ND pSymA

pSymB 20 43 17 ND ND pSymB

MlChr 62 127 66 2367 321 613 MlChr

pMLa 15 29 8 18 17 10 ND pMLa

pMLb 5 12 11 8 23 12 ND ND pMLb

SGCMl 81 116 79 ND 65 ND ND ND ND SGCMl

SGCR7A 101 135 89 86 73 54 30 21 2 240

Bidirectional best hits (BDBHs) were calculated in pairwise comparisons using BLASTP. All reciprocal matches with e-value up to 1e-04 and a coverage of at least 50% on the length of the shorter CDS were collected. p42d, the symbiotic plasmid of R. etli, 371 kb, 359 CDS; pNGR234a, the symbiotic plasmid of Rhizobium sp. 536 kb, 416 CDS; SGCBj, B. japonicum USDA110 symbiotic chromosomal region, 410 kb, 388 CDS; SmChr, S. meliloti chromosome, 3,600 kb, 3396 CDS; pSymA, S. meliloti symbiotic plasmid A, 1,354 kb, 1,295 CDS; pSymB, S. meliloti symbiotic plasmid B, 1,683 kb, 1,571 CDS; MlChr, M. loti MAFF303099 chromosome without the symbiotic island, 6,425 kb, 6,172 CDS; pMLa, M. loti MAFF303099 cryptic plasmid a, 351 kb, 320 CDS; pMLb, M. loti MAFF303099 cryptic plasmid b, 208 kb, 209 CDS; SGCMl, M. loti MAFF303099 symbiotic island, 611 kb, 580 CDS; SGCR7A, M. loti R7A symbiotic island, 502 kb, 414 CDS. ND, not determined. *The number of BDBHs with the complete genome of B. japonicum USDA110 is 150.

Genome Biology 2003, 4:R36

Page 9: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.9

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

Additional data file 6. The nifHDK and nifENX, are neighbor-ing conserved transcriptional units in p42d, in the two SGCsof M. loti, and in pNGR234a. However, nifH and nifN are sep-arated from their respective operons in the SGC of B. japoni-cum and in pSymA of S. meliloti, respectively. Similarly, nodAis located away from the nodBC genes in p42d, and nodB isdistant in the SGC of M. loti strains. The operon fixABCX isdisrupted in the SGC of B. japonicum, where fixA is in anoperon with nifA. In turn, in other SGCs, nifA is commonlyfound in an operon with nifB and fdxN. Phylogenetic analysesof the 20 common genes in the six SGCs result in nonequiva-lent trees, even for genes that are organized in operons (datanot shown). For example, trees derived from the genes of theoperons nifHDK and fixABCX are incongruent, indicatingthat intraoperon recombination has been also frequent.

ConclusionsOur results indicate that p42d contains several regions thatsignificantly deviate from the average GC content and typicalCU. The plasmid harbors a large amount of ERIS and severalreiterated DNA families. In addition, it contains 10 pseudo-genes. These features resemble those found in other SGCs. AllSGCs sequenced so far are heterogeneous regarding theirgene content, and the genes common to most of them aremainly those involved in nodulation and nitrogen fixation.Other common genes are present either in SGCs or in othergenome locations (see above). The lack of synteny betweenp24d and the different SGCs analyzed gives further support to

the notion that the symbiotic compartments of rhizobial ge-nomes are mosaic structures [10], presumably assembledfrom regions derived from diverse genomic contexts, thatmight have been frequently modified as a consequence oftransposition, recombination and lateral transfer events.

Materials and methodsSequencing strategyA minimal set of cosmids that covers the entire p42d [16]were used to generate shotgun libraries (1-2 kb mean insertsize) cloned in M13 or pUC19 vectors. DNA sequencing reac-tions were performed using the Big-Dye Terminator kit in anautomatic 373A DNA Sequencer (Applied Biosystems, FosterCity, CA). Gaps were filled by a primer-walking strategy aswell as by sequencing appropriate clones from pBR328 andpSUP202 libraries. A total of 6,210 readings of 450 bases inaverage were collected to achieve a coverage of 7× for the en-tire p42d.

AssemblyBase calling was done using the program PHRED and the as-sembly was obtained by PHRAP [56,57]. Graphicrepresentation and edition of the assembly were accom-plished using the CONSED program [58]. Low-quality andsingle-stranded regions were located, and further sequencingwas done to cover these areas. An error rate of less than 1 per10,000 bases was estimated using base qualities determinedby the PHRAP assembler. To confirm the assembly, pairs of

Figure 4Analysis of synteny among the SGCs. Pairs of orthologous proteins among different genomes or SGCs are plotted. Each protein pair is shown according to the location of the corresponding coordinate of the predicted translation start of the gene on the DNA region. The axes correspond to the total length of the respective DNA region: p42d 371,255 bp; M. loti MAFF303099 symbiotic island 610,975 bp; M. loti R7A symbiotic island 502,000 bp; S. meliloti pSymA 354,226 bp; pNGR234a 536,165 bp and B. japonicum symbiotic region 410,573 bp. For each group the first region mentioned corresponds to the x-axis. (a) p42d vs pNGR234a; (b) p42d vs pSymA; (c) p42d vs B. japonicum symbiotic region; (d) p42d vs M. loti MAFF303099 symbiotic island; (e) p42d vs M. loti R7A, symbiotic island; (f) pNGR234a vs S. meliloti pSymA; (g) pNGR234a vs B. japonicum symbiotic region; (h) pNGR234a vs M. loti 303099 symbiotic island; (i) pNGR234a vs M. loti R7A symbiotic island; (j) M. loti MAFF303099 symbiotic island vs M. loti R7A symbiotic island.

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Genome Biology 2003, 4:R36

Page 10: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.10 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

forward and reverse primers were designed and used to raiseoverlapping PCR products with an average size of 5 kb, cover-ing the entire plasmid in a single circular contig. The PCRproducts obtained agree well with the determined sequence(data not shown).

CDS prediction and annotationThe coding capacity of p42d was determined by applyingGLIMMER 2.02 [59,60] iteratively to enhance the overallprediction efficiency. Given the evidence indicating thatGLIMMER-based predictions are less effective in plasmids[2], our approach also took into consideration the existence ofseveral gene classes with different codon-usage (CU) patterns[61], and a potential ribosome-binding site (RBS) specific top42d to aid GLIMMER in the selection of start codons.

RBS predictionAn initial set of presumably functional genes with a corre-sponding upstream RBS was detected by running BLASTX[62] comparisons (using a maximum e-value cutoff of 0.001)of the entire plasmid against the nonredundant (nr) database[52] at the National Center for Biotechnology Information(NCBI). All matches with hypothetical or putative proteins aswell as those with an upstream neighbor hit closer than 50 bpwere discarded to avoid genes within operons. We took intoconsideration only hits displaying an identity ≥ 40%, startingat the first amino acid, and alignment coverage of at least 80%of the matched protein. We then extended the selected hits

towards the 5' terminus and kept those with an upstream in-frame stop codon before any other possible start codon. Thisprocedure left 21 hits. From the p42d sequence we extracted20-bp regions upstream of the start codons of these hits andinferred the most probable RBS (6 bp in length) by applyingthe CONSENSUS program [63]. The resulting consensus ma-trix supported the sequence GGAGAG with an expected fre-quency of 2.034 × 10-8.

CDS predictionTo train GLIMMER, we took the initial output of the BLASTXcomparison detailed above, and selected as training set allhits with an alignment length ≥ 100 amino acids. Again, allmatches with hypothetical/putative proteins were discarded.Overlapping hits matching the same protein were mergedinto a single larger hit, generating a total of 183 DNA seg-ments. The RBS and the training set obtained were then usedto run GLIMMER, yielding a prediction of 460 CDSs that in-cluded 93% of the 183 segments in the training set. However,we noticed that running GLIMMER iteratively yielded betterresults, because it produced a lower number of predictedCDSs and a greater number of segments in the training setmapping within predicted CDSs. We applied the method ofA.M.-S., G.M.-H., A. Christen and J.C.-V. (unpublished work)to split the initial set of 460 CDSs into three groups displayingpoor, typical and rich codon usage. Essentially, this methodquantifies the extent to which individual genes use the mostabundant codons in the plasmid. Each group was used as atraining set and GLIMMER was run for 20 iterations to pre-dict CDSs ≥ 300 bp (CDSs ≥ 500 bp in the first iteration com-posed the training set for the second iteration, and so forth).The best prediction for each CU group was selected, and thethree resulting predictions were incorporated into a singleone that produced 396 CDSs and recovered 97.75% of the in-itial training set.

AnnotationAll CDSs were manually curated using BLASTX comparisons(e-value ≤ 0.001) against the nr database. The followingcriteria were applied to annotate the CDSs: CDSs were taggedas hypothetical (yh) when no homolog could be detected; hy-pothetical conserved CDSs (yp) were those displaying strongsimilarity to hypothetical proteins or weak similarity toknown genes; CDSs with similarity ≥ 50% along the entirelength of known genes were assigned the same name as thematching gene; CDSs related to insertion sequences (IS) andtransposons (yi) were compared with BLASTN and BLASTXagainst the IS database [64] to identify the family they belongto. These elements were also analyzed for the presence of in-verted repeats at their borders applying OLIGO 6.4 [65] andBLAST2 programs. Functional classification was carried outfollowing the categories proposed in Freiberg et al. [10]. Ad-ditional support for annotation was obtained by searching forprotein domains and motifs with the Interpro suite [66].Transmembrane domains and leader peptides were searchedusing the PSORT program [67]. A relational database that

Figure 5Distribution of the 20 genes common to all the SGCs analyzed. (a) p42d; (b) M. loti MAFF303099 SGC; (c) pNGR234a; (d) M. loti R7A SGC; (e) B. japonicum SGC; (f) S. meliloti pSymA. The color bars indicate the position of the genes. The nodulation genes nodABCDIJ are represented in blue, and the nitrogen-fixation genes nifHDKNEXAB, fixABCX and fdxBN are represented in yellow.

0 100,000 200,000

Length (nucleotides)

300,000

(a)

(b)

(c)

(d)

(e)

(f)

Genome Biology 2003, 4:R36

Page 11: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.11

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

compiles all this information is available at [68], and Addi-tional data file 1, which shows the set of 359 annotated CDS.

Transcription units, RpoN promoters and regulatory binding sitesWe predicted that all CDS in p42d are organized in 235 tran-scription units (TUs) by applying a previously reported dis-tance-based methodology [69]. Binding sites were detectedusing upstream regions of variable length (but properly spec-ified in each case) for all annotated CDS in the pSym. To iden-tify genes potentially expressed by RpoN promoters, wecompiled an initial training set containing 85 prokaryoticpromoters for which the transcription start site has been ex-perimentally mapped [38]. The CONSENSUS/PATSER set ofprograms [63] was then used to predict promoters 16 bp longin upstream regions of 250 bp. A final set of 37 RpoN promot-ers was obtained using as PATSER threshold the mean (µ)minus one standard deviation (σ) estimated from the set of 85promoters (µ - 1σ = 6.33). Binding sites for NifA or UAS werepredicted using seven reported sites [27,70�73] as the train-ing set. CONSENSUS/PATSER was run to predict sites of 16bp in length within -400 to +50 bp regions, yielding 21 siteswith PATSER score ≥ 8.03 (µ - 1σ); if a more stringent thresh-old is used instead, several known sites are undetected. Wefurther discarded all predicted UAS without an associatedRpoN promoter. In the case of nod boxes, we applied thedyad-sweeping method [74] to a set of six reported sites [75�78] in order to pinpoint the location of potential nod boxes inthe p42d (as a conglomerate of five or more dyads), and thenused CONSENSUS/PATSER to determine 47-bp sites within-600 to +50 bp regions.

Seven putative nod boxes were found by these approaches;however, several known functional sites were still undetected,and thus we trained CONSENSUS/PATSER again with theseven p42d sites found. Given that the mean PATSER scorefor these sites is too high (21.48), the usual threshold (µ - 1σ)is also correspondingly high (16.21), and thus we could notpredict any additional sites. For these reasons, we decided touse as threshold the lowest PATSER score (9.15) obtainedfrom the seven training sequences, in this way we finally pre-dicted 15 nod boxes. CONSENSUS/PATSER programs werealso applied to identify regulatory motifs for Fnr based on 30known binding sites in E. coli extracted from RegulonDB[79]. Predictions were carried out in the -400 to +50 bp re-gions using as threshold the PATSER score ≥ 6.2 (µ - 1σ),which yielded 45 potential Fnr binding sites in p42d. If we getstricter and use the mean PATSER score (9.77) as threshold,only eight sites are detected. Nonetheless, given the evidencesuggesting there is high transcriptional activity in the p42dunder low-oxygen conditions [43], we decided to relax thescore to allow more Fnr sites.

Genome comparisonsProtein sequences from different genomes or symbiotic com-partments were obtained from GenBank [52]: pNGR234a

U00090; B. japonicum USDA110 symbiotic region AF322012and AF322013; S. meliloti AL591688; M. loti MAFF303039NC_002678; M. loti R7A symbiotic island AL672111; A.tumefaciens C58 (U. Washington) AE008688 andAE008689; A. tumefaciens C58 (Cereon) AE007869 andAE007870; Ralstonia solanacearum AL646052; C. crescen-tus AE005673; Rickettsia prowazekii AJ235269; E. coliO157:H7 BA000007; E. coli K12 U00096; Brucella melitensisAE008917; P. aeruginosa AE004091; Xanthomonas citriAE008923; Nostoc NC_003272; Xanthomonas campestrisAE008922; Synechocystis PCC6803 AB001339; Xylella fas-tidiosa AE003851; Borrelia burgdorferi AE000783; Buch-nera sp. APS AP000398; Mycoplasma genitalium L43967;Thermotoga maritima AE000512; Aquifex aeolicusAE000657; Archeoglobus fulgidus AE000782; Aeropyrumpernix BA000002; Methanobacterium thermoautotrophi-cum AE000666; Methanococcus jannaschii L77117; Methan-opyrus kandleri AE009439. Most probable orthologs weredetected applying a previously reported method [69].Essentially, the method performs BLASTP pairwise compari-sons against the protein sequences of p42d, and bidirectionalbest hits (BDBHs) were used to define the most likely orthol-ogous genes. All BDBHs with an e-value ≤ 0.0001 and align-ment coverage of at least 50% of the smaller CDS were takeninto consideration.

Nucleotide sequence accession numberThe nucleotide sequence reported here has been deposited inGenBank under the accession number U80928.

Additional data filesThe most relevant features of the functional annotation of thep42d can be found in Additional data file 1 available with theonline version of this paper. It contains the name, the predict-ed protein size, the best nr-matching homolog, and the per-centage of similarity/identity. Lists of predicted binding sitesare shown in Additional data file 2 (nod boxes), Additionaldata file 3 (RpoN promoters and NifA UAS) and Additionaldata file 4 (anaeroboxes). The number of BDBHs betweenseveral complete genomes and the p42d is given in Additionaldata file 5. The topological representation of the 20 commongenes in the six SGCs is detailed in Additional data file 6, A,p42d; B, M. loti MAFF303099; C, pNGR234a; D, M. loti R7ASGC; E, B. japonicum SGC; F, S. meliloti pSymA.

AcknowledgementsWe dedicate this paper to Rafael Palacios and Jaime Mora in gratitude fortheir support and stimulating critical discussions. We are grateful for theskillful technical support and advice given by J.A. Gama, R.E. Gómez, R.I.Santamaría, S. Caro, J. Espíritu, D. García, F. Sánchez, E. Díaz, E. Pérez-Rue-da, V. del Moral, K.D. Noel, J. Sanjuan, M. Rosenblueth, P. Gaytán, E. López,P. Rabinowicz, and P.M. Reddy. This work was partially supported by apublic grant from CONACyT (México) under the Program for EmergingAreas (N-028).

Genome Biology 2003, 4:R36

Page 12: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

R36.12 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. http://genomebiology.com/2003/4/6/R36

References1. Garrity GM, Johnson KL, Bell JA, Searles DB: Taxonomic outline

of the Procaryotes. Release 3.0. In Bergey's Manual of SystematicBacteriology. New York: Springer-Verlag 2002.

2. Kaneko T, Nakamura Y, Sato S, Asamizu E, Kato T, Sasamoto S, Wa-tanabe A, Idesawa K, Ishikawa A, Kawashima K, et al.: Complete ge-nome structure of the nitrogen-fixing symbiotic bacteriumMesorhizobium loti. DNA Res 2000, 7:331-338.

3. Barnett MJ, Fisher RF, Jones T, Komp C, Abola AP, Barloy-Hubler F,Bowser L, Capela D, Galibert F, Gouzy J, et al.: Nucleotide se-quence and predicted functions of the entire Sinorhizobiummeliloti pSymA megaplasmid. Proc Natl Acad Sci USA 2001,98:9883-9888.

4. Capela D, Barloy-Hubler F, Gouzy J, Bothe G, Ampe F, Batut J, Bois-tard P, Becker A, Boutry M, Cadieu E, et al.: Analysis of the chro-mosome sequence of the legume symbiont Sinorhizobiummeliloti strain 1021. Proc Natl Acad Sci USA 2001, 98:9877-9882.

5. Finan TM, Weidner S, Wong K, Buhrmester J, Chain P, Vorholter FJ,Hernández-Lucas I, Becker A, Cowie A, Gouzy J, et al.: The com-plete sequence of the 1,683-kb pSymB megaplasmid fromthe N2-fixing endosymbiont Sinorhizobium meliloti. Proc NatlAcad Sci USA 2001, 98:9889-9894.

6. Galibert F, Finan TM, Long SR, Pühler A, Abola P, Ampe F, Barloy-Hubler F, Barnett MJ, Becker A, Boistard P, et al.: The compositegenome of the legume symbiont Sinorhizobium meliloti. Sci-ence 2001, 293:668-672.

7. Kaneko T, Nakamura Y, Sato S, Minamisawa K, Uchiumi T, SasamotoS, Watanabe A, Idesawa K, Iriguchi M, Kawashima K, et al.: Com-plete genomic sequence of nitrogen-fixing symbiotic bacteri-um Bradyrhizobium japonicum USDA110. DNA Res 2002, 9:189-197.

8. Goodner B, Hinkle G, Gattung S, Miller N, Blanchard M, Qurollo B,Goldman BS, Cao Y, Askenazi M, Halling C, et al.: Genome se-quence of the plant pathogen and biotechnology agent Agro-bacterium tumefaciens C58. Science 2001, 294:2323-2328.

9. Wood DW, Setubal JC, Kaul R, Monks DE, Kitajima JP, Okura VK,Zhou Y, Chen L, Wood GE, Almeida NF Jr, et al.: The genome ofthe natural genetic engineer Agrobacterium tumefaciens C58.Science 2001, 294:2317-2323.

10. Freiberg C, Fellay R, Bairoch A, Broughton WJ, Rosenthal A, PerretX: Molecular basis of symbiosis between Rhizobium andlegumes. Nature 1997, 387:394-401.

11. Gottfert M, Rothlisberger S, Kundig C, Beck C, Marty R, Hennecke H:Potential symbiosis-specific genes uncovered by sequencinga 410-kilobase DNA region of the Bradyrhizobium japonicumchromosome. J Bacteriol 2001, 183:1405-1412.

12. Sullivan JT, Trzebiatowski JR, Cruickshank RW, Gouzy J, Brown SD,Elliot RM, Fleetwood DJ, McCallum NG, Rossbach U, Stuart GS, et al.:Comparative sequence analysis of the symbiosis island ofMesorhizobium loti strain R7A. J Bacteriol 2002, 184:3086-3095.

13. Encarnación S, Dunn M, Willms K, Mora J: Fermentative and aer-obic metabolism in Rhizobium etli. J Bacteriol 1995, 177:3058-3066.

14. Romero D, Brom S, Martínez-Salazar J, Girard ML, Palacios R, DávilaG: Amplification and deletion of a nod-nif region in the sym-biotic plasmid of Rhizobium phaseoli. J Bacteriol 1991, 173:2435-2441.

15. Brom S, García de los Santos A, Stepkowsky T, Flores M, Dávila G,Romero D, Palacios R: Different plasmids of Rhizobiumleguminosarum bv. phaseoli are required for optimal symbiot-ic performance. J Bacteriol 1992, 174:5183-5189.

16. Girard ML, Flores M, Brom S, Romero D, Palacios R, Dávila G: Struc-tural complexity of the symbiotic plasmid of Rhizobium legu-minosarum bv. phaseoli. J Bacteriol 1991, 173:2411-2419.

17. Ramírez-Romero MA, Bustos P, Girard L, Rodríguez O, Cevallos MA,Dávila G: Sequence, localization and characteristics of thereplicator region of the symbiotic plasmid of Rhizobium etli.Microbiology 1997, 143:2825-2831.

18. Poupot R, Martínez-Romero E, Gautier N, Prome JC: Wild typeRhizobium etli, a bean symbiont, produces acetyl-fucosylated,N-methylated, and carbamoylated nodulation factors. J BiolChem 1995, 270:6050-6055.

19. Cárdenas L, Domínguez J, Santana O, Quinto C: The role of thenodI and nodJ genes in the transport of Nod metabolites inRhizobium etli. Gene 1996, 173:183-187.

20. Schlaman HR, Phillips D, Kondorosi E: Genetic organization andtranscriptional regulation of rhizobial nodulation genes. In

The Rhizobiacea. Edited by: Spaink HP, Kondorosi A, Hoykaas PJJ. Dor-drecht: Kluwer 1998, 361-386.

21. Rostas K, Kondorosi E, Horvarth B, Simoncsits A, Kondorosi A: Con-servation and extended promoter regions of nodulationgenes in Rhizobium. Proc Natl Acad Sci USA 1986, 83:1757-1761.

22. Quinto C, de la Vega H, Flores M, Fernández L, Ballado T, SoberónG, Palacios R: Reiteration of nitrogen fixation gene sequencesin Rhizobium phaseoli. Nature 1982, 299:724-726.

23. Rodríguez C, Romero D: Multiple recombination events main-tain sequence identity among members of the nitrogenasemultigene family in Rhizobium etli. Genetics 1998, 149:785-794.

24. Earl CD, Ronson CW, Ausubel FM: Genetic and structural anal-ysis of the Rhizobium meliloti fixA, fixB, fixC, and fixX genes. JBacteriol 1987, 169:1127-1136.

25. Kaminski PA, Batut J, Boistard P: A survey of symbiotic nitrogenfixation by Rhizobia. In The Rhizobiacea. Edited by: Spaink HP, Kon-dorosi A, Hoykaas PJJ. Dordrecht: Kluwer 1998, 431-460.

26. Kim S, Burgess BK: Evidence for the direct interaction of thenifW gene product with the MoFe protein. J Biol Chem 1996,271:9764-9770.

27. Lee HS, Berger DK, Kustu S: Activity of purified NIFA, a tran-scriptional activator of nitrogen fixation genes. Proc Natl AcadSci USA 1993, 90:2266-2270.

28. Dombrecht B, Tesfay MZ, Verreth C, Heusdens C, Napoles MC,Vanderleyden J, Michiels J: The Rhizobium etli gene iscN is highlyexpressed in bacteroids and required for nitrogen fixation.Mol Genet Genomics 2002, 267:820-828.

29. Klipp W, Reilander H, Schluter A, Krey R, Pühler A: The Rhizobiummeliloti fdxN gene encoding a ferredoxin-like protein is nec-essary for nitrogen fixation and is cotranscribed with nifA andnifB. Mol Gen Genet 1989, 216:293-302.

30. Hoover TR, Robertson AD, Cerny RL, Hayes RN, Imperial J, Shah VK,Ludden PW: Identification of the V factor needed for synthesisof the iron-molybdenum cofactor of nitrogenase ashomocitrate. Nature 1987, 329:855-857.

31. Imperial J, Ugalde RA, Shah VK, Brill WJ: Role of the nifQ geneproduct in the incorporation of molybdenum into nitroge-nase in Klebsiella pneumoniae. J Bacteriol 1984, 158:187-194.

32. Fischer HM: Genetic regulation of nitrogen fixation inrhizobia. Microbiol Rev 1994, 58:352-386.

33. Michiels J, Van Soom T, D'Hooghe I, Dombrecht B, Benhassine T, deWilde P, Vanderleyden J: The Rhizobium etli rpoN locus: DNAsequence analysis and phenotypical characterization ofrpoN, ptsN, and ptsA mutants. J Bacteriol 1998, 180:1729-1740.

34. Valderrama B, Dávalos A, Girard L, Morett E, Mora J: Regulatoryproteins and cis-acting elements involved in the transcrip-tional control of Rhizobium etli reiterated nifH genes. JBacteriol 1996, 178:3119-3126.

35. Jahn OJ, Dávila G, Romero D, Noel KD: BacS: An abundantbacteroid protein in Rhizobium etli whose expression ex plan-ta requires NifA. Mol Plant-Microbe Interact 2003, 16:65-73.

36. Tully RE, Keyster DL: Cloning and mutagenesis of a cyto-chrome p-450 locus from Bradyrhizobium japonicum that isexpressed anaerobically and symbiotically. Appl EnvironMicrobiol 1993, 59:4136-4142.

37. Dombrecht B, Marchal K, Vanderleyden J, Michiels J: Prediction andoverview of the RpoN-regulon in closely related species ofthe Rhizobiales. Genome Biol 2002, 3:research0076.1-0076.11.

38. Barrios H, Valderrama B, Morett E: Compilation and analysis ofsigma(54)-dependent promoter sequences. Nucleic Acids Res1999, 27:4305-4313.

39. Girard L, Brom S, Dávalos A, López O, Soberón M, Romero D: Dif-ferential regulation of fixN-reiterated genes in Rhizobium etliby a novel fixL - fixK cascade. Mol Plant Microbe Interact 2000,13:1283-1292.

40. Batut J, Daveran-Mingot ML, David M, Jacobs J, Garnerone AM, KahnD: fixK, a gene homologous with fnr and crp from Escherichiacoli, regulates nitrogen fixation genes both positively andnegatively in Rhizobium meliloti. EMBO J 1989, 8:1279-1286.

41. López O, Morera C, Miranda-Ríos J, Girard L, Romero D, Soberón M:Regulation of gene expression in response to oxygen inRhizobium etli: role of FnrN in fixNOQP expression and insymbiotic nitrogen fixation. J Bacteriol 2001, 183:6999-7006.

42. Lynch AS, Lin ECC: Responses to molecular oxygen. In Es-cherichia coli and Salmonella: Cellular and Molecular Biology. Edited by:Neidhardt FC. Washington, DC: American Society of Microbiology1996, 1526-1538.

Genome Biology 2003, 4:R36

Page 13: The mosaic structure of the symbiotic plasmid of Rhizobium ...smcg.ccg.Unam.mx/enp-unam/15-PGRhizobiumetli/Gonzalez_GenBiol_2003.pdfVíctor GonzÆlez, Patricia Bustos, Miguel A Ramírez-Romero,

http://genomebiology.com/2003/4/6/R36 Genome Biology 2003, Volume 4, Issue 6, Article R36 González et al. R36.13

com

ment

reviews

reports

refereed researchdepo

sited researchinteractio

nsinfo

rmatio

n

43. Girard ML, Valderrama B, Palacios R, Romero D, Dávila G: Tran-scriptional activity of the symbiotic plasmid of Rhizobium etliis affected by different environmental conditions. Microbiology1996, 142:2847-2856.

44. Cornelis GR, Van Gijsegem F: Assembly and function of type IIIsecretory systems. Annu Rev Microbiol 2000, 54:735-774.

45. Viprey V, Del Greco A, Golinowski W, Broughton WJ, Perret X:Symbiotic implications of type III protein secretion machin-ery in Rhizobium. Mol Microbiol 1998, 28:1381-1389.

46. Christie PJ: Type IV secretion: intercellular transfer of macro-molecules by systems ancestrally related to conjugationmachines. Mol Microbiol 2001, 40:294-305.

47. Brom S, García-de los Santos A, Cervantes L, Palacios R, Romero D:In Rhizobium etli symbiotic plasmid transfer, nodulationcompetitivity and cellular growth require interaction amongdifferent replicons. Plasmid 2000, 44:34-43.

48. Kitabatake M, Ali K, Demain A, Sakamoto K, Yokoyama S, Soll D: In-dolmycin resistance of Streptomyces coelicolor A3(2) by in-duced expression of one of its two tryptophanyl-tRNAsynthetases. J Biol Chem 2002, 277:23882-23887.

49. Flores M, Brom S, Stepkowski T, Girard ML, Dávila G, Romero D, Pal-acios R: Gene amplification in Rhizobium: identification and invivo cloning of discrete amplifiable DNA regions (amplicons)from Rhizobium leguminosarum biovar phaseoli. Proc Natl AcadSci USA 1993, 90:4932-4936.

50. Romero D, Martínez-Salazar J, Girard L, Brom S, Dávila G, Palacios R,Flores M, Rodríguez C: Discrete amplifiable regions (ampli-cons) in the symbiotic plasmid of Rhizobium etli CFN42 JBacteriol 1995, 177:973-980.

51. Flores M, Mavingui P, Perret X, Broughton WJ, Romero D, Hernán-dez G, Dávila G, Palacios R: Prediction, identification, and artifi-cial selection of DNA rearrangements in Rhizobium: towarda natural genomic design. Proc Natl Acad Sci USA 2000, 97:9138-9143.

52. GenBank [http://www.ncbi.nlm.nih.gov/Genbank]53. Vázquez M, Santana O, Quinto C: The NodL and NodJ proteins

from Rhizobium and Bradyrhizobium strains are similar tocapsular polysaccharide secretion proteins from gram-nega-tive bacteria. Mol Microbiol 1993, 8:369-377.

54. Viprey V, Rosenthal A, Broughton WJ, Perret X: Genetic snapshotsof the Rhizobium species NGR234 genome. Genome Biol 2000,1:research0014.1-0014.17.

55. Buchet A, Nasser W, Eichler K, Mandrand-Berthelot MA: Positiveco-regulation of the Escherichia coli carnitine pathway cai andfix operons by CRP and the CaiF activator. Mol Microbiol 1999,34:562-575.

56. Ewing B, Green P: Base-calling of automated sequencer tracesusing phred. II. Error probabilities. Genome Res 1998, 8:186-194.

57. Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automatedsequencer traces using phred. I. Accuracy assessment. Ge-nome Res 1998, 8:175-185.

58. Gordon D, Abajian C, Green P: Consed: a graphical tool for se-quence finishing. Genome Res 1998, 8:195-202.

59. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improvedmicrobial gene identification with GLIMMER. Nucleic Acids Res1999, 27:4636-4641.

60. Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identi-fication using interpolated Markov models. Nucleic Acids Res1998, 26:544-548.

61. Hayes WS, Borodovsky M: How to interpret an anonymous bac-terial genome: machine learning approach to geneidentification. Genome Res 1998, 8:1154-1171.

62. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lip-man DJ: Gapped BLAST and PSI-BLAST: a new generation ofprotein database search programs. Nucleic Acids Res 1997,25:3389-3402.

63. Hertz GZ, Stormo GD: Identifying DNA and protein patternswith statistically significant alignments of multiplesequences. Bioinformatics 1999, 15:563-577.

64. IS database [http://www-is.biotoul.fr]65. Oligo 6.0 primer analysis software [http://www.oligo.net]66. InterPro database [http://www.ebi.ac.uk/interpro/scan.html]67. Psort Software [http://psort.nibb.ac.jp/index.html]68. R. etli database [http://www.cifn.unam.mx/retlidb]69. Moreno-Hagelsieb G, Collado-Vides J: A powerful non-homology

method for the prediction of operons in prokaryotes. Bioinfor-matics 2002, 18(Suppl 1):S329-S336.

70. Barrios H, Grande R, Olvera L, Morett E: In vivo genomic foot-printing analysis reveals that the complex Bradyrhizobiumjaponicum fixRnifA promoter region is differently occupiedby two distinct RNA polymerase holoenzymes. Proc Natl AcadSci USA 1998, 95:1014-1019.

71. Charlton W, Cannon W, Buck M: The Klebsiella pneumoniae nifJpromoter: analysis of promoter elements regulating activa-tion by the NifA promoter. Mol Microbiol 1993, 7:1007-1021.

72. Morett E, Buck M: NifA-dependent in vivo protection demon-strates that the upstream activator sequence of nif promot-ers is a protein binding site. Proc Natl Acad Sci USA 1988, 85:9401-9405.

73. Preker P, Hubner P, Schmehl M, Klipp W, Bickle TA: Mapping andcharacterization of the promoter elements of the regulatorynif genes rpoN, nifA1 and nifA2 in Rhodobacter capsulatus. MolMicrobiol 1992, 6:1035-1047.

74. Benítez-Bellón E, Moreno-Hagelsieb G, Collado-Vides J: Evaluationof tresholds for the detection of binding sites for regulatoryproteins in Escherichia coli K12 DNA. Genome Biol 2002,3:research0013.1-0013.16.

75. Baev N, Endre G, Petrovics G, Banfalvi Z, Kondorosi A: Six nodula-tion genes of nod box locus 4 in Rhizobium meliloti are in-volved in nodulation signal production: nodM codes for D-glucosamine synthetase. Mol Gen Genet 1991, 228:113-124.

76. Fisher RF, Long SR: DNA footprint analysis of the transcription-al activator proteins NodD1 and NodD3 on inducible nodgene promoters. J Bacteriol 1989, 171:5492-5502.

77. Suominen L, Paulin L, Saano A, Saren AM, Tas E, Lindstrom K: Iden-tification of nodulation promoter (nod-box) regions of Rhizo-bium galegae. FEMS Microbiol Lett 1999, 177:217-223.

78. Wang SP, Stacey G: Studies of the Bradyrhizobium japonicumnodD1 promoter: a repeated structure for the nod box. JBacteriol 1991, 173:3356-3365.

79. Salgado H, Santos-Zavaleta A, Gama-Castro S, Millán-Zárate D, Díaz-Peredo E, Sánchez-Solano F, Pérez-Rueda E, Bonavídes-Martínez C,Collado-Vides J: RegulonDB (version 3.2): transcriptional reg-ulation and operon organization in Escherichia coli K-12. Nu-cleic Acids Res 2001, 29:72-74.

Genome Biology 2003, 4:R36