Apéro RubyBdx - MongoDB - 8-11-2011

Preview:

Citation preview

Pierre-Louis GottfroisBastien MurzeauApéro Ruby Bordeaux, 8 novembre 2011

• Brève introduction

• Cas pratique

• Map / Reduce

Qu’est ce que mongoDB ?

mongoDB est une base de donnée de type NoSQL,

sans schéma

document-oriented

sans-schéma

• Très utile en développements ‘agiles’ (itérations, rapidité de modifications, flexibilité pour les développeurs)

• Supporte des fonctionnalités qui seraient, en BDDs relationnelles :• quasi-impossible (stockage d’éléments non finis, ex. tags)

• trop complexes pour ce qu’elles sont (migrations)

document-oriented

• mongoDB stocke des documents, pas de rows

• les documents sont stockés sous forme de JSON; binary JSON

• la syntaxe de requêtage est aussi fournie que SQL

• le mécanisme de documents ‘embedded’ résout bon nombre de problèmes rencontrés

document-oriented

• Les documents sont stockés dans une collection, en RoR = model

• une partie des ces données sont indexées pour optimiser les performances

• un document n’est pas une poubelle !

stockage de données volumineuses

• mongoDB (et autres NoSQL) sont plus performantes pour la scalabilité horizontale

• ajout de serveurs pour augmenter la capacité de stockage («sharding»)

• garantissant ainsi une meilleur disponibilité

• load-balancing optimisé entre les nodes

• augmentation transparente pour l’application

Cas pratique• ORM devient ODM, la gem de référence mongoid

• ou : mongoMapper, DataMapper

• Création d’une application a base de NoSQL MongoDB

• rails new nosql

• edition du Gemfile

• gem ‘mongoid’

• gem ‘bson_ext’

• bundle install

• rails generate mongoid:config

Cas pratique• edition du config/application.rb

• #require 'rails/all'

• require "action_controller/railtie"

• require "action_mailer/railtie"

• require "active_resource/railtie"

• require "rails/test_unit/railtie"

Cas pratique

class Conversation include Mongoid::Document include Mongoid::Timestamps

field :public, :type => Boolean, :default => false

has_many :scores, :as => :scorable, :dependent => :delete has_and_belongs_to_many :subjects belongs_to :timeline embeds_many :messages

class Subject include Mongoid::Document include Mongoid::Timestamps

has_many :scores, :as => :scorable, :dependent => :delete, :autosave => true has_many :requests, :dependent => :delete belongs_to :author, :class_name => 'User'

Map Reduce

Example

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 215

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

A “ticket” collection

Problematic

• We want to

• Calculate the ‘checkout’ sum of each object in our ticket’s collection

• Be able to distribute this operation over the network

• Be fast!

• We don’t want to

• Go over all objects again when an update is made

Map : emit(checkout)

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 215

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 215 73

The ‘map’ function emit (select) every checkout value of each object in our collection

Reduce : sum(checkout)

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 215

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 215 73

142 288

430

Reduce function

The ‘reduce’ function apply the algorithmic logic for each key/value received from ‘map’ function

This function has to be ‘idempotent’ to be called recursively or in a distributed system

reduce(k, A, B) == reduce(k, B, A)reduce(k, A, B) == reduce(k, reduce(A, B))

Inherently Distributed

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 215

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 215 73

142 288

430

Distributed

Since ‘map’ function emits objects to be reduced and ‘reduce’ function processes for each emitted

objects independently, it can be distributed through multiple workers.

map reduce

Logaritmic Update

For the same reason, when updating an object, we don’t have to reprocess for each obejcts.

We can call ‘map’ function only on updated objects.

Logaritmic Update

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 210

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 215 73

142 288

430

Logaritmic Update

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 210

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 210 73

142 288

430

Logaritmic Update

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 210

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 210 73

142 283

430

Logarithmic Update

{“id” : 1,“day” : 20111017,“checkout” : 100

}

{“id” : 2,“day” : 20111017,“checkout” : 42

}

{“id” : 3,“day” : 20111017,“checkout” : 210

}

{“id” : 4,“day” : 20111017,“checkout” : 73

}

100 42 210 73

142 283

425

Let’s do some code!

$> mongo

> db.tickets.save({ "_id": 1, "day": 20111017, "checkout": 100 })> db.tickets.save({ "_id": 2, "day": 20111017, "checkout": 42 })> db.tickets.save({ "_id": 3, "day": 20111017, "checkout": 215 })> db.tickets.save({ "_id": 4, "day": 20111017, "checkout": 73 })

> db.tickets.count()4

> db.tickets.find(){ "_id" : 1, "day" : 20111017, "checkout" : 100 }...

> db.tickets.find({ "_id": 1 }){ "_id" : 1, "day" : 20111017, "checkout" : 100 }

> var map = function() {... emit(null, this.checkout)}

> var reduce = function(key, values) {... var sum = 0... for (var index in values) sum += values[index]... return sum}

Temporary Collection> sumOfCheckouts = db.tickets.mapReduce(map, reduce){ "result" : "tmp.mr.mapreduce_123456789_4", "timeMills" : 8, "counts" : { "input" : 4, "emit" : 4, "output" : 1 }, "ok" : 1}

> db.getCollectionNames()[ "tickets", "tmp.mr.mapreduce_123456789_4"]

> db[sumOfCheckouts.result].find(){ "_id" : null, "value" : 430 }

Persistent Collection> db.tickets.mapReduce(map, reduce, { "out" : "sumOfCheckouts" })

> db.getCollectionNames()[ "sumOfCheckouts", "tickets", "tmp.mr.mapreduce_123456789_4"]

> db.sumOfCheckouts.find(){ "_id" : null, "value" : 430 }

> db.sumOfCheckouts.findOne().value430

Reduce by Date

> var map = function() {... emit(this.date, this.checkout)}

> var reduce = function(key, values) {... var sum = 0... for (var index in values) sum += values[index]... return sum}

> db.tickets.mapReduce(map, reduce, { "out" : "sumOfCheckouts" })

> db.sumOfCheckouts.find(){ "_id" : 20111017, "value" : 430 }

What we can do

Scored Subjects per User

Subject User Score

1 1 2

1 1 2

1 2 2

2 1 2

2 2 10

2 2 5

Scored Subjects per User (reduced)

Subject User Score

1 1 4

1 2 2

2 1 2

2 2 15

$> mongo

> db.scores.save({ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 2, "subject_id": 1, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 3, "subject_id": 1, "user_id": 2, "score": 2 })> db.scores.save({ "_id": 4, "subject_id": 2, "user_id": 1, "score": 2 })> db.scores.save({ "_id": 5, "subject_id": 2, "user_id": 2, "score": 10 })> db.scores.save({ "_id": 6, "subject_id": 2, "user_id": 2, "score": 5 })

> db.scores.count()6

> db.scores.find(){ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 }...

> db.scores.find({ "_id": 1 }){ "_id": 1, "subject_id": 1, "user_id": 1, "score": 2 }

> var map = function() {... emit([this.user_id, this.subject_id].join("-"), {subject_id:this.subject_id,... user_id:this.user_id, score:this.score});}

> var reduce = function(key, values) {... var result = {user_id:"", subject_id:"", score:0};... values.forEach(function (value) {result.score += value.score;result.user_id = ... value.user_id;result.subject_id = value.subject_id;});... return result}

ReducedScores Collection

> db.scores.mapReduce(map, reduce, { "out" : "reduced_scores" })

> db.getCollectionNames()[ "reduced_scores", "scores"]

> db.reduced_scores.find(){ "_id" : "1-1", "value" : { "user_id" : 1, "subject_id" : 1, "score" : 4 } }{ "_id" : "1-2", "value" : { "user_id" : 1, "subject_id" : 2, "score" : 2 } }{ "_id" : "2-1", "value" : { "user_id" : 2, "subject_id" : 1, "score" : 2 } }{ "_id" : "2-2", "value" : { "user_id" : 2, "subject_id" : 2, "score" : 15 } }

> db.reduced_scores.findOne().score4

Dealing with Rails Query

ruby-1.9.2-p180 :007 > ReducedScores.first => #<ReducedScores _id: 1-1, _type: nil, value: {"user_id"=>BSON::ObjectId('...'), "subject_id"=>BSON::ObjectId('...'), "score"=>4.0}>

ruby-1.9.2-p180 :008 > ReducedScores.where("value.user_id" => u1.id).count => 2

ruby-1.9.2-p180 :009 > ReducedScores.where("value.user_id" => u1.id).first.value['score'] => 4.0

ruby-1.9.2-p180 :010 > ReducedScores.where("value.user_id" => u1.id).last.value['score'] => 2.0

Questions ?