Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory...

W W W . C R I M . C A

Principal partenaire financier

Scalable Density Clustering for Spark

THOMAS TRIPLET, PH.D., ENG.

MARCH 9TH 2016

TECHNOLOGIES BIG-DATA

• Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle

• Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch

APACHE

• Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs

AGENDA

• Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)

• PatchWork – Algorithm – Results – Performance

• Conclusion

• Future Work

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

PatchWork ➔

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

PatchWork ➔

PATCHWORK ALGORITHM

2 main steps:

1. createCells( dataPoints ) à cells à RDD[(string, int)]

2. createClusters( cells) à clusters

STEP 1: CELL CREATION

4( -1,2 ; )

4( -1,3 ; )

4( -2,2 ; )

1( -3,4 ; )

4( 2,3 ; )

3( 2,4 ; )

3( 3,3 ; )

3( 3,4 ; )

STEP 1: CELL CREATION

1( -1,2 ; )

1( -2,2 ; )

1( -1,2 ; )

1( 3,4 ; )

4( -1,2 ;

4( -1,3 ;

4( -2,2 ;

1( -3,4 ;

4( 2,3 ;

3( 2,4 ;

3( 3,3 ;

3( 3,4 ;

setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)

STEP 2: CLUSTER CREATION

EXPERIMENTAL SETUP

• 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage

• Cloudera CDH 5.4.0 • Apache Spark 1.3

DATASETSAggregation Compound

Jain Spiral

RESULTS (JAIN DATASET)K-means

DBScan PatchWork

RESULTS (SPIRAL DATASET)K-means

DBScan PatchWork

RESULTS (AGGREGATION DATASET)K-means

DBScan PatchWork

RESULTS (COMPOUND DATASET)K-means

DBScan PatchWork

PERFORMANCER

10,000

100,000

Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0

DBSCAN PatchWork MLLib k-means ||

PERFORMANCE: SCALABILITYN

Number of servers1 2 3 4 5

MLLib k-means|| PatchWork

CONCLUSION

FUTURE WORK

• Tests against new clustering algorithms available in Spark 1.6

• Better distribution of step 2

• Indexing for region query using R-trees

• Streaming version

Contact: thomas.triplet@crim.ca

Availability: https://github.com/crim-ca/patchwork (MIT Licence)

Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track

WWW.CRIM.CA

Suivez-nous Dialoguez avec nous

Suivez-nous #CRIM_ca wwwCRIMca

Thomas Triplet, Ph.D., Eng. thomas.triplet@crim.ca

Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.

Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory...

Documents

Introduction à MapReduce/Hadoop et Sparkdenoyer/wordpress/wp-content/uploads/2015/02/... · Introduction à MapReduce/Hadoop et Spark Certiﬁcat Big Data Ludovic Denoyer et Sylvain

Apache Pig - vvtesh.co.invvtesh.co.in/teaching/bigdata-2020/slides/Lecture9-Pig.pdf · •Neither Hadoop nor HDFS is required •Useful for prototyping and debugging •MapReduce

Importation et exportation de données dans HDFS€¦ · · 2014-04-24HDFS. Un job MapReduce est démarré, avec pour entrée les journaux web. Les ... shell qui sert d’interface

Capacités bioindicatrices des diatomées pour les … · -Echantillonnage : Grattage du biofilm ... -Préparation de l’échantillon au laboratoire-Comptage au microscope (100x

MapReduce au Niveau RP*

HADOOP și MAPREDUCE – SOLUȚIE PENTRU PROCESAREA …stst.elia.pub.ro/news/RCI_2009_10/Teme_RCI_2015_16/...Este folosit pentru procesarea distribuită a volumelor mari de date (petabytes

Prépa Phase Exploration · latency Analysis latency Decision latency Action latency time Counter measure takes effect Counter measure approved Analysis completed Insights about the

Petit-déjeuner MapReduce-La révolution dans l’analyse des BigData

MapReduce et Hadoop - Université de Bordeauxdenis/Enseignement/... · A. Denis – ENSEIRB PG306 Fouille de données - 2 Recherche & indexation de gros volumes Appliquer une opération

Ricco Rakotomalala ricco/cours ...eric.univ-lyon2.fr/~ricco/cours/slides/programmation mapreduce sous... · Hadoop est un environnement logiciel « open source » de la fondation

TEMA: MAPREDUCE SOLUȚIE PENTRU PROCESAREA …stst.elia.pub.ro › news › RCI_2009_10 › Teme_RCI_2015_16 › ... · mari în spate (Cloudera etc.); ca atare sunt mai multe persoane

Introduction à Hadoop & MapReduce · MOOC / FUN 2014 - 2015 Introduction à Hadoop & MapReduce Cours 1 Benjamin Renaut

iptables 云计算工程师必备技能VPN - Geekbang · 2018-08-10 · Memcached AWS ElastiCache Solr ElasticSearch AWS ElasticSearch Spark SQL Spark MLib Spark GraphX 开发 MapReduce

Bases de Données NoSQL - denoyer/wordpress/wp-content/uploads/2015/01/... · MapReduce, langage orienté-objet (JSON) Écrit en C++. LI328 – Technologies Web (B. Amann) 21 MongoDB

L’AVENIR DU NoSQL - Léonard Meyer€¦ · Le point de vue du la communauté NoSQL ... On peut également voir que le design pattern MapReduce est souvent embarqué dans les solutions

HADOOP, MAP REDUCE - Big Data Paris 2020 · 2 hadoop, mapreduce… une si courte histoire… hadoop, mapreduce… retour sur ces technologies qui ont changÉ le visage de l’analytique

Hadoop - fnac-static.com · 2017. 4. 14. · Introduction • Contexte de création d’Hadoop • Architecture in-frastructurelle d’Hadoop • MapReduce • Hadoop • HDFS •

Mémoire Parallélisation d'algorithmes de graphes avec MapReduce sur un cluster d'ordinateurs : M2 RSD BENHADJ DJILALI Hadjer AIT AMEUR Ouerdia Lydia 2015-2016

5 Steps To - Ashisuto · 簡単にHadoop クラスタを得る為、一部の企業はAmazon Elastic MapReduce (EMR)としてクラウドのアプローチを検討するかもしれませんが、クラウドの生産性と迅速な結果を出す為に適切なツールを得る

MapReduce et Hadoop - LaBRIdenis/Enseignement/... · A. Denis – ENSEIRB PG306 MapReduce – exemple 2 - 9 Déterminer les documents pointant vers une URL – Résolution inverse