Author
others
View
3
Download
0
Embed Size (px)
W W W . C R I M . C A
Principal partenaire financier
Scalable Density Clustering for Spark
THOMAS TRIPLET, PH.D., ENG.
MARCH 9TH 2016
2
TECHNOLOGIES BIG-DATA
• Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle
• Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch
3
APACHE
• Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs
4
AGENDA
• Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)
• PatchWork – Algorithm – Results – Performance
• Conclusion
• Future Work
5
• Class labels are known and pre-defined
• Training and testing datasets are (manually) labeled with same classes
• Goal is to learn function/rule that can classify new data points
• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…
• Class labels of the data are unknown
• Group/cluster similar data points without prior knowledge
• Goal is to discover structure or pattern in the data
• Examples: k-means, EM, DBScan, HCA…
INTRODUCTION: MACHINE LEARNING
Supervised Learning Unsupervised Learning (clustering)
5
• Class labels are known and pre-defined
• Training and testing datasets are (manually) labeled with same classes
• Goal is to learn function/rule that can classify new data points
• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…
• Class labels of the data are unknown
• Group/cluster similar data points without prior knowledge
• Goal is to discover structure or pattern in the data
• Examples: k-means, EM, DBScan, HCA…
INTRODUCTION: MACHINE LEARNING
Supervised Learning Unsupervised Learning (clustering)
PatchWork ➔
6
• Popular algorithm: k-means (implemented in MLLib)
• Relies on distance function between data points
• Easy to implement
• Linear complexity (big-data)
• Easy to distribute
• Discovers spherical clusters of similar sizes only
• Sensitive to noise and local optima
• Prior knowledge of k.
• Popular algorithm: DBScan(not in MLLib)
• Relies on the density of data points in feature space
• Natural protection against noise and outliers
• Discovers clusters of arbitrary shape and size
• No prior knowledge of k
• Discovers clusters of similar densities only
• Quadratic complexity: not scalable
INTRODUCTION: CLUSTERING
Distance-based Density-based
6
• Popular algorithm: k-means (implemented in MLLib)
• Relies on distance function between data points
• Easy to implement
• Linear complexity (big-data)
• Easy to distribute
• Discovers spherical clusters of similar sizes only
• Sensitive to noise and local optima
• Prior knowledge of k.
• Popular algorithm: DBScan(not in MLLib)
• Relies on the density of data points in feature space
• Natural protection against noise and outliers
• Discovers clusters of arbitrary shape and size
• No prior knowledge of k
• Discovers clusters of similar densities only
• Quadratic complexity: not scalable
INTRODUCTION: CLUSTERING
Distance-based Density-based
PatchWork ➔
7
PATCHWORK ALGORITHM
2 main steps:
1. createCells( dataPoints ) à cells à RDD[(string, int)]
2. createClusters( cells) à clusters
8
STEP 1: CELL CREATION
9
STEP 1: CELL CREATION
10
STEP 1: CELL CREATION
4( -1,2 ; )
4( -1,3 ; )
4( -2,2 ; )
1( -3,4 ; )
4( 2,3 ; )
3( 2,4 ; )
3( 3,3 ; )
3( 3,4 ; )
11
STEP 1: CELL CREATION
1( -1,2 ; )
1( -2,2 ; )
1( -1,2 ; )
1( -1,2 ; )
.
.
.
1( 3,4 ; )
1( 3,4 ; )
1( 3,4 ; )
)
)
)
)
)
)
)
)
4( -1,2 ;
4( -1,3 ;
4( -2,2 ;
1( -3,4 ;
4( 2,3 ;
3( 2,4 ;
3( 3,3 ;
3( 3,4 ;
setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)
12
STEP 2: CLUSTER CREATION
•
13
EXPERIMENTAL SETUP
• 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage
• Cloudera CDH 5.4.0 • Apache Spark 1.3
14
DATASETSAggregation Compound
Jain Spiral
15
RESULTS (JAIN DATASET)K-means
DBScan PatchWork
16
RESULTS (SPIRAL DATASET)K-means
DBScan PatchWork
17
RESULTS (AGGREGATION DATASET)K-means
DBScan PatchWork
18
RESULTS (COMPOUND DATASET)K-means
DBScan PatchWork
19
PERFORMANCER
unni
ng T
ime
(sec
onds
)
1
10
100
1,000
10,000
100,000
Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0
DBSCAN PatchWork MLLib k-means ||
20
PERFORMANCE: SCALABILITYN
orm
aliz
ed e
xecu
tion-
time
0
0.25
0.5
0.75
1
Number of servers1 2 3 4 5
MLLib k-means|| PatchWork
21
CONCLUSION
22
FUTURE WORK
• Tests against new clustering algorithms available in Spark 1.6
• Better distribution of step 2
• Indexing for region query using R-trees
• Streaming version
Q & A
Contact: [email protected]
Availability: https://github.com/crim-ca/patchwork (MIT Licence)
Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track
WWW.CRIM.CA
Suivez-nous Dialoguez avec nous
Suivez-nous #CRIM_ca wwwCRIMca
Tous droits réservés © 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746
Thomas Triplet, Ph.D., Eng. [email protected]
Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.