Oxalide MorningTech #1 - BigData

MorningTech #1 – BigDatale 15 décembre 2016 –Ludovic Piot

Les événements Oxalide

• Objectif : présentation d’une thématique métier ou technique• Tout public : 80 à 100 personnes• Déroulé : 1 soir par trimestre de 18h à 21h

• Introduction de la thématique par un partenaire• Tour de table avec des clients et non clients• Echange convivial autour d’un apéritif dînatoire

• Objectif : présentation d’une technologie• Réservé aux clients : public technique avec laptop – 30 personnes• Déroulé : 1 matinée par trimestre de 9h à 13h

• Présentation de la technologie• Tuto pour la configuration en ligne de commande

• Objectif : présentation d’une thématique métier ou technique• Réservé aux clients : 30 personnes• Déroulé : 1 matin par trimestre de 9h à 12h

• Big picture• Démonstration et retour d’expérience

Apérotech

Workshop

Morning Tech

Les speakers

Ludovic PiotConseil / Archi / DevOps @ Oxalide

@lpiot

Oxalide Recrute !Contactez-nousà[email protected]

Enjeux & tendances

SoLoMo et IoT – l’explosion de la data

SOcial

LOcal

MObile

IoT – l’explosion de la data

Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.! 11!

Enterprise!Data!Trends!@!Scale!

The!volume!of!data!that!is!available!for!analysis!is!transforming!organizations,!as!well!as!

the!entire!IT!industry.!Everyone!is!seeing!data!external!to!an!organization!as!becoming!

just!as!strategic!as!internal!data.!!SemiMstructured!and!unstructured!data!volume!is!

beginning!to!dwarf!the!traditional!data!in!relational!databases!and!data!warehouses.!

• Facebook!has!around!50!PB!warehouse!and!it’s!constantly!growing.!!

• Twitter!messages!are!140!bytes!each!generating!8TB!data!per!day.!

• Data!is!more!than!doubling!every!year.!

• Almost!80%!of!data!will!be!unstructured!data.!

• Netflix:!75%!of!streaming!video!results!from!recommendations.!

• Amazon:!35%!of!product!sales!come!from!product!recommendations.!

!

!

!

Enterprise Data Trends @ Scale Organizations are redefining data strategies due to the requirements of the evolving Enterprise Data Warehouse (EDW).

Enterprise Data

VoIP

Machine Data

Social Media

Les 3V : les dimensions du Gartner

• Volume : Le volume de données crées et gérées est en constante augmentation (+59% / an en 2011)

• Variété : Les types de données collectées sont très variés (texte, son, image, logs…). Nécessité que les outils de traitement prennent en compte cette diversité

• Vélocité : Besoin de rapidité pour pouvoir utiliser les données au fur et à mesure qu'elles sont collectées. Il faut les utiliser rapidement, ou elles n'ont aucune valeur.

Les 2 nouveaux V émergeant :

• Véracité : dimension apportant une notion de qualité de la donnée pour le métier

• Visibilité : pour souligner la nécessité que la data soit accessible pour le métier afin de permettre la prise de décision rapide

Evolution des tendances de la BigData

batchtemps réel

prédict

rapport alertes prévision

Principes

BigData vs. gestion traditionnelle des données

20! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

Traditional!Systems!vs.!Hadoop!

Hadoop!is!not!designed!to!replace!existing!relational!databases!or!data!warehouses.!!Relational!databases!are!designed!to!manage!transactions.!They!contain!a!lot!of!feature/functionality!designed!around!managing!transactions.!They!are!based!upon!schemaMonMwrite.!Organizations!have!spent!years!building!Enterprise!Data!Warehouses!(EDW)!and!reporting!systems!for!their!traditional!data.!The!traditional!EDWs!are!not!going!anywhere!either.!EDWs!are!also!based!on!schemaMonMwrite.!!!

Hadoop!is!not:!

• Relational!

• NoSQL!

• RealMtime!

• A!database!

Hadoop!is!a!data!platform!that!compliments!existing!data!systems.!Hadoop!is!designed!for!schemaMonMread!and!can!handle!the!large!data!volumes!coming!from!semiMstructured!and!unstructured!data.!With!the!low!cost!of!storage!on!Hadoop,!organizations!are!looking!at!using!Hadoop!more!for!archiving.! !

!

Traditional Systems vs. Hadoop

Traditional Database

SCALE (storage & processing)

Hadoop Distribution NoSQL MPP

Analytics EDW

schema

speed

governance

best fit use

processing

Required on write Required on read

Reads are fast Writes are fast

Standards and structured Loosely structured

Limited, no data processing Processing coupled with data

data types Structured Multi and unstructured

Interactive OLAP Analytics Complex ACID Transactions

Operational Data Store

Data Discovery Processing unstructured data Massive Storage/Processing

Le stockage distribué


Data!Integrity!–!Writing!Data!

High!performing!applications!stream!data!to!files.!!HDFS!does!this!as!well;!the!HDFS!client!caches!packets!of!data!in!memory.!!Once!that!data!reaches!the!HDFS!block!size,!the!client!will!notify!the!NameNode.!!The!NameNode!will!provide!the!DataNode!information!about,!and!the!locations,!for!the!block!replicas.!!The!client!will!then!stream!the!packet!of!data!to!the!first!targeted!DataNode.!!Replication!is!performed!in!a!pipeline!fashion;!the!first!DataNode!will!start!writing!the!block!and!will!then!transfer!that!data!to!the!second!DataNode.!!The!second!DataNode!will!start!sending!the!data!to!the!third!DataNode!and!so!on.!

When!the!blocks!in!a!directory!reach!a!defined!limit,!which!is!controlled!via!dfs.datanode.numblocks,!the!DataNode!will!define!a!new!subdirectory.!!After!defining!the!subdirectory!it!will!start!placing!new!data!blocks!and!the!corresponding!metadata!in!that!subdirectory.!!This!is!performed!using!a!fanMout!structure!ensuring!no!single!directory!is!overloaded!with!files!or!becomes!too!deep.!!!

! !

!

Data Pipeline

DataNode 1

Data Integrity – Writing Data

6. Success!

3. Data +

checksum

4. Verify Checksum

4. Data and checksum

5. Success! 5.Success!

DataNode 4 DataNode 12

Client 2. OK,

please use DataNodes

1, 4, 12. 1. I want to write a block

of data. NameNode

Le théorème de CAP

Le Map/Reduce

154! Copyright!©!2014,!Hortonworks,!Inc.!All!rights!reserved.!

MapReduce!

The!original!useMcase!for!Hadoop!was!distributed!batch!processing.!MapReduce!is!a!power!application!paradigm!for!processing!massive!amounts!of!data.!!!Core!features!of!MapReduce!are:!!

• Co?locating!processing!with!data!blocks:!Take!the!computing!to!where!the!data!lives,!rather!than!querying!or!reading!data!into!a!remote!application.!Would!you!rather!move!hundreds!of!GB/TB!of!data!around!your!network,!or!would!you!rather!move!an!application!that!processes!the!same!data!to!where!the!data!actually!lives?!!

• Map!Phase:!This!is!the!initial!phase!of!all!MapReduce!jobs.!This!is!where!raw!data!can!be!read,!extracted,!transformed,!and!results!written!out!to!HDFS!or!moved!on!to!Reducers!for!aggregate!processing,!such!as!a!final!count,!sum,!min,!max,!etc.!The!Map!phase!can!also!be!thought!of!as!the!ETL!or!projection!step!for!MapReduce.!

!• Reduce!Phase:!This!is!the!final!phase!where!data!is!sorted!on!a!userMdefined!key!

and!grouped!by!that!same!key.!!!The!Reducer!has!the!option!to!perform!an!

!

MapReduce Map$Phase$ Shuffle/Sort$

Mapper $

Mapper $

Mapper $

Data$is$shuffled$across$the$network$

and$sorted$

NM + DN

NM + DN

NM + DN

Reduce$Phase$

Reducer $

Reducer $

NM + DN

NM + DN

La table des latences

Le pipeline BigData

data answersingest / collect store process analyse

Time to answer (latency)Throughput

Cost

La Lambda Architecture


Defining!Data!Layers!

There!are!multiple!ways!of!organizing!data!in!an!Enterprise!Data!Warehouse!and!the!same!goes!for!Hadoop.!!

One!way!is!the!Lambda!Architecture,!which!defines!different!data!layers.!!A!Hadoop!cluster!can!work!by!itself!or!be!integrated!with!HBase!and!other!EDWs!and!ODSs!to!build!different!data!layers!that!meet!the!data!needs!of!an!organization.!

The!process!of!building!different!data!layers!is!a!familiar!concept!within!data!warehousing!and!analytics.!!The!data!layers!are!built!in!a!Hadoop!cluster!for!the!same!reasons!they!have!been!built!in!data!warehouses!for!the!last!30!years,!the!facilitate!speed.!!There!are!3!data!layers:!

• Batch!Layer:!!Immutable!master!data!set!(source!of!truth).!!Used!to!create!views!for!the!batch!layer.!

• Serving!Layer:!Contains!preMcomputed!views.!!!

• Speed!Layer:!!Contains!additional!levels!of!preMcomputed!views,!structures!and!indexes!to!reduce!the!latency!that!exists!in!the!serving!layer.!

!

!

Defining Data Layers

Serving Layer

Standardize, Cleanse, Integrate, Filter, Transform

Batch Layer

Extract & Load

Conform, Summarize, Access

Speed Layer

•  Organize data based on source/derived relationships

•  Allows for fault

and rebuild process

•  There are lots of different ways of organizing data in an enterprise data platform that includes Hadoop.

Evolution des traitements Big Data

http://www.slideshare.net/1Strategy/2016-utah-cloud-summit-big-data-architectural-patterns-and-best-practices-on-aws

Collect Store Analyse ConsumeETL

Hot

Warm

Hot

Cold

Hot

Hot

Hot

Slow

Ecosystème



Dataflow

Dataproc

BigQueryBigTable

CloudSQL

CloudPub/Sub

Demo Time

Amazon S3

http://bit.ly/2grJMMf

Shard 0

Amazon KinesisAmazon Cognito

Amazon EC2

R Shiny-Server

https://github.com/lpiot/amazon-kinesis-IoT-sensor-demo

Machine learning& deep learning

La démarche de datascience

Le Machine Learning

• Jeu de données : labellisé (avec les réponses)• Objectif d’apprentissage :

• Régression (prévision)• Classification

Apprentissage supervisé

Hypothèse et fonction de coût

But du jeu :Trouver une fonction h qui représente fidèlement les données.

Régression linéaire :ℎ 𝑥 = 𝜃% + 𝜃'𝑥' + 𝜃(𝑥( + ⋯+ 𝜃*𝑥*

Le Machine Learning

• Jeu de données : non-labellisé (sans réponse)• Objectif d’apprentissage :

• Identifier / détecter des structures dans les données

Apprentissage non-supervisé

Algorithmes de classification

But du jeu :Trouver l’algorithme qui distingue au mieux les structures dans les données.

Réseaux neuronaux

• Basés sur le fonctionnementd’un cerveau

• Hypothèse non linéaire !• Classification multi-classe

• Comme avant, on essayede minimiser la fonction de coût en modifiant peu àpeu les coefficients Θ(i)

Questions ?

?

Sources

• [6, 10] : Hortonworks : Operations Management with HDP

• [8, 11, 12] : http://www.slideshare.net/1Strategy/2016-utah-cloud-summit-big-data-architectural-patterns-and-best-practices-on-aws

Big Data : les domaines d’application

Objectifs recherchés :

• Collecter la donnée dès sa production (en temps réel)• Conserver l’intégralité de la donnée, sans perte d’information• Permettre l’exploitation a posteriori pour de nouveaux usages et/ou à travers de nouvelles technologies

Mise en œuvre :

• Collecte et nettoyage des données via Flume, Storm, Spark, Logstash, Kafka, Kinesis, etc.• Stockage de la donnée dénormalisée dans Cassandra, HDFS, Hbase, Hive, AWS S3, Redshift

DatalakeCollecter et stocker la donnée

AWS S3HADOOPCASANDRA

Besoin recensé sur : EasyBourse, L’Etudiant…

REDSHIFTHIVE HBASE KAFKA



• Collecter la donnée dès sa production (en temps réel)• Traiter la donnée au fil de l’eau• Permettre l’exploitation et la consultation immédiates des données traitées dans des outils de requête en

temps réel

Mise en œuvre :

• Collecte, nettoyage et traitement des données via Flume, Storm, Spark, Logstash, Kafka, Kinesis, etc.• Stockage de la donnée traitée dans Cassandra, Redshift, ElasticSearch

Lambda architecture – Speed layerTraiter immédiatement la donnée et la consulter en temps réel

SPARK

Besoin recensé sur : EasyBourse, L’Etudiant…

FLUME STORMELASTICSEARCHCASANDRA REDSHIFT KINESIS


DMP : Data Management PlatformQualifier son audience


• Personnalisation de contenus et de l'expérience utilisateur

Mise en œuvre :

• TBC

http://www.journaldunet.com/ebusiness/expert/58869/la-data-management-platform--dmp----fonctionnalites-et-benefices-de-l-exploitation-des-donnees.shtml

Besoin recensé sur : L’Express, Kwanko, Le Parisien, 20 min, …



• Explorer des jeux de données restreints pour identifier des caractéristiques• Classifier les données selon des features détectées automatiquement• Identifier automatiquement des groupes de données similaires• Faire des prédictions basées sur les données existantes

Mise en œuvre :

• Mise en place d’outils d’exploration pour les datascientists : Jupyter, zeppelin, spark notebook, RStudio• Mise en œuvre d’un datapipeline : kafka, yarn, scikit-learn, spark ml, R, H2O, graphlab,…

Machine LearningUn pas vers l’IA

Besoin recensé sur : Fjord, Qivivo

SCIKIT LEARNZEPPELINJUPYTER RYARN KAFKASPARK H2O

Internet

Oxalide MorningTech #1 - BigData