26
WWW.CRIM.CA Principal partenaire financier Scalable Density Clustering for Spark THOMAS TRIPLET, PH.D., ENG. MARCH 9 TH 2016

Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

W W W . C R I M . C A

Principal partenaire financier

Scalable Density Clustering for Spark

THOMAS TRIPLET, PH.D., ENG.

MARCH 9TH 2016

Page 2: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

2

TECHNOLOGIES BIG-DATA

• Hadoop Core – HDFS: Système de fichiers distribué – YARN: Gestion des ressources CPU et planification – MapReduce: Traitement en lot (batch) des données à grande échelle

• Écosystème Hadoop – NoSQL: HBase, Cassandra, Accumulo, etc… – SQL: Hive, Stinger (Hortonworks), Impala (Cloudera), Presto (FB), Tajo, Drill (MapR) – Transfert: Sqoop, Flume – Calcul/ML: Spark, Storm, Giraph, Mahout – Scripts: Pig, Cascading – Administration: Hue, ZooKeeper, Knox – Recherche: Solr, ElasticSearch

Page 3: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

3

APACHE

• Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear horizontal scalability • Fault tolerant (RDDs) • Applications range from long-running batch jobs to stream processing • High-level Scala, Java, Python and R APIs

Page 4: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

4

AGENDA

• Clustering algorithms (unsupervised learning) – Distance-based (k-means) – Density-based (DBSCAN)

• PatchWork – Algorithm – Results – Performance

• Conclusion

• Future Work

Page 5: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

5

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

Page 6: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

5

• Class labels are known and pre-defined

• Training and testing datasets are (manually) labeled with same classes

• Goal is to learn function/rule that can classify new data points

• Examples: SVMs, Neural nets, Bayesian classifiers, Decision trees…

• Class labels of the data are unknown

• Group/cluster similar data points without prior knowledge

• Goal is to discover structure or pattern in the data

• Examples: k-means, EM, DBScan, HCA…

INTRODUCTION: MACHINE LEARNING

Supervised Learning Unsupervised Learning (clustering)

PatchWork ➔

Page 7: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

6

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

Page 8: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

6

• Popular algorithm: k-means (implemented in MLLib)

• Relies on distance function between data points

• Easy to implement

• Linear complexity (big-data)

• Easy to distribute

• Discovers spherical clusters of similar sizes only

• Sensitive to noise and local optima

• Prior knowledge of k.

• Popular algorithm: DBScan(not in MLLib)

• Relies on the density of data points in feature space

• Natural protection against noise and outliers

• Discovers clusters of arbitrary shape and size

• No prior knowledge of k

• Discovers clusters of similar densities only

• Quadratic complexity: not scalable

INTRODUCTION: CLUSTERING

Distance-based Density-based

PatchWork ➔

Page 9: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

7

PATCHWORK ALGORITHM

2 main steps:

1. createCells( dataPoints ) à cells à RDD[(string, int)]

2. createClusters( cells) à clusters

Page 10: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

8

STEP 1: CELL CREATION

Page 11: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

9

STEP 1: CELL CREATION

Page 12: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

10

STEP 1: CELL CREATION

4( -1,2 ; )

4( -1,3 ; )

4( -2,2 ; )

1( -3,4 ; )

4( 2,3 ; )

3( 2,4 ; )

3( 3,3 ; )

3( 3,4 ; )

Page 13: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

11

STEP 1: CELL CREATION

1( -1,2 ; )

1( -2,2 ; )

1( -1,2 ; )

1( -1,2 ; )

.

.

.

1( 3,4 ; )

1( 3,4 ; )

1( 3,4 ; )

)

)

)

)

)

)

)

)

4( -1,2 ;

4( -1,3 ;

4( -2,2 ;

1( -3,4 ;

4( 2,3 ;

3( 2,4 ;

3( 3,3 ;

3( 3,4 ;

setOfCells = dataPoints.map(Pà(cellID(P),1)) .reduceByKey(_ + _)

Page 14: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

12

STEP 2: CLUSTER CREATION

Page 15: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

13

EXPERIMENTAL SETUP

• 6 servers, each with: – Intel Xeon E5-2650 8 cores @2.6GHz – 192GB memory – 30TB storage

• Cloudera CDH 5.4.0 • Apache Spark 1.3

Page 16: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

14

DATASETSAggregation Compound

Jain Spiral

Page 17: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

15

RESULTS (JAIN DATASET)K-means

DBScan PatchWork

Page 18: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

16

RESULTS (SPIRAL DATASET)K-means

DBScan PatchWork

Page 19: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

17

RESULTS (AGGREGATION DATASET)K-means

DBScan PatchWork

Page 20: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

18

RESULTS (COMPOUND DATASET)K-means

DBScan PatchWork

Page 21: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

19

PERFORMANCER

unni

ng T

ime

(sec

onds

)

1

10

100

1,000

10,000

100,000

Millions of data points10,000.0 100,000.0 1,000,000.0 10,000,000.0 100,000,000.0 1,000,000,000.0 10,000,000,000.0

DBSCAN PatchWork MLLib k-means ||

Page 22: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

20

PERFORMANCE: SCALABILITYN

orm

aliz

ed e

xecu

tion-

time

0

0.25

0.5

0.75

1

Number of servers1 2 3 4 5

MLLib k-means|| PatchWork

Page 23: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

21

CONCLUSION

Page 24: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

22

FUTURE WORK

• Tests against new clustering algorithms available in Spark 1.6

• Better distribution of step 2

• Indexing for region query using R-trees

• Streaming version

Page 25: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

Q & A

Contact: [email protected]

Availability: https://github.com/crim-ca/patchwork (MIT Licence)

Reference: Frank Gouineau, Tom Landry, Thomas Triplet (2016) PatchWork, a Scalable Density-Grid Clustering Algorithm. In Proc. 31th ACM Symposium On Applied Computing, Data-Mining track

Page 26: Scalable Density Clustering for Spark - crim.ca · 3 APACHE • Popular distributed in-memory computing framework • 10-100x faster than Hadoop MapReduce and low latency • Linear

WWW.CRIM.CA

Suivez-nous Dialoguez avec nous

Suivez-nous #CRIM_ca wwwCRIMca

Tous droits réservés © 2016 CRIM. 405, avenue Ogilvy, bureau 101, Montréal (Québec) H3N 1M3/514 840-1234/1 877 840-2746

Thomas Triplet, Ph.D., Eng. [email protected]

Principal partenaire financierLe CRIM est un centre de recherche appliquée en TI qui développe, en mode collaboratif avec ses clients et partenaires, des technologies innovatrices et du savoir-faire de pointe, et les transfère aux entreprises et aux organismes québécois afin de les rendre plus productifs et plus compétitifs localement et mondialement. Le CRIM dispose de quatre équipes de recherche en TI de calibre mondial. Le CRIM œuvre principalement dans les domaines des interactions et interfaces personne-système, de l’analytique avancée et des architectures et technologies avancées de développement et tests. Détenteur d’une certification ISO 9001:2008, son action s’inscrit dans les politiques et stratégies pilotées par le ministère de l'Économie, de l'Innovation et des Exportations (MEIE), son principal partenaire financier.