MINISTERE DE L’ENSEIGNEMENT SUPERIEUR ET DE LA … · UNIVERSITE FARHAT ABBAS ... v. TABLE OF CONTENTS 5.4.4.1 Scalability results for the smaller size problem set CSP.. 76 5.4.4.2

MINISTERE DE L’ENSEIGNEMENT SUPERIEUR ET DE LA RECHERCHE

SCIENTIFIQUE

UNIVERSITE FARHAT ABBAS – SETIF-1

(UFAS-1). (ALGERIE)

THESE

Présenté à la Faculté des Sciences

Département d’Informatique

Pour l’Obtention du Diplôme de

DOCTORAT en SCIENCES

Ecole Doctorale Sciences et Technologies de l’Information et de la Communication

Option : Ingénierie des systèmes informatiques

Par

M. : Lyazid TOUMI

Thème

Optimisation des performances par administration et

tuning d’entrepôt de données

Soutenu le : devant la commission d’examen :

Pr. TOUAHRIA Mohamed Prof. à l’UFA- Setif-1 Président

Pr. MOUSSAOUI Abdelouahab Prof. à l’UFA- Setif-1 Rapporteur

Pr. ATHMANI Baghdad Prof. à l’Univ- Oran-1 Examinateur

Pr. HIDOUSSI Khaled-Walid Prof. à ESI -Alger Examinateur

Dr. BOUKHALFA Kamel MC-A à l’USTHB –Alger Examinateur

Dr. ALTI Adel MC-A à l’UFA Setif-1 Examinateur

DEDICATION

First of all, I would like to express my deep gratitude to my advisor, Professor Abdelouhab

Moussaoui for giving me the opportunity to work on the challenging data warehouse

physical design project and for his continuous encouragement and support since 2005, from

my first graduation project to this Ph.D. thesis.

I would like to thank my co-advisor, Professor Ahmet Ugur for all of his help on this

work and during my visit to Central Michigan University, Mt. Pleasant, U.S.A.

I would like to thank and present my grateful respect to the president and all members

of the thesis committee for carefully reviewing the dissertation.

Finally, I would like to express my gratitude to my parents for their unlimited patience,

prayer and encouragement. I would also like to thank my wife for the help and support

she has provided during my doctorate period (I would not miss mentioning my daughter

Maria-Lyna and my sisters for their psychological support). I would like thank all friends

that have contributed to preparation of this thesis.

Nothing is done without all resources provided by our beloved country, Algeria.

Lyazid Toumi

July 1, 2015

i

TABLE OF CONTENTS

Page

List of Tables ix

List of Figures xi

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis goals and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Academic publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

I Backgrounds and state of art 7

2 Data warehouses architecture and design 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Data warehouses architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 The Data Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Data warehouse physical design . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Data warehouse tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 DBMS tuning tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Indexes selection problem in data warehouses 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Indexation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 B-tree index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.2 Projection index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.3 Hash index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii

TABLE OF CONTENTS

3.2.4 Bitmap Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.5 Join index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.6 Star join index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.7 Bitmap join index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Bitmap join indexes selection problem in data warehouses . . . . . . . . . . 22

3.3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2 Cost models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.2.1 Data access cost in presence of useful SBJIs (IndexCost) . 24

3.3.2.2 Data access cost in absence of useful SBJIs(JoinCost) . . . 24

3.3.2.3 Size of SBJI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Background of indexes selection problem . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 Frank et al. approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.2 Choenni’s approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.3 Gundem’s approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.4 Golfarelli’s approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.4.5 Data mining based approach . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.5.1 Mining closed frequent pattern . . . . . . . . . . . . . . . . 29

3.4.5.2 General schema of the data mining approach . . . . . . . . 30

3.4.6 Extended data mining based approach . . . . . . . . . . . . . . . . . . 31

3.4.7 Genetic algorithm based approach . . . . . . . . . . . . . . . . . . . . 32

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Horizontal partitioning in data warehouses 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Horizontal partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Horizontal partitioning example . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Horizontal partitioning advantages . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5.1 Horizontal Partitioning for databases administration . . . . . . . . . 40

4.5.2 Partitioning for performance optimization . . . . . . . . . . . . . . . 40

4.5.3 Partitioning for the availability . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Horizontal partitioning modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.1 Range partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.2 Hach partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6.3 List partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

iv

TABLE OF CONTENTS

4.6.4 Composite partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6.5 Multicolumn partitioning mode . . . . . . . . . . . . . . . . . . . . . . 46

4.6.6 Reference partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6.7 Virtual column partitioning . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 Horizontal partitioning problem in data warehouses . . . . . . . . . . . . . . 49

4.7.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.7.2 Cost model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.8 Background of approaches for the horizontal partitioning Problem . . . . . 50

4.8.1 Workload-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.8.2 Attribute affinity based approach . . . . . . . . . . . . . . . . . . . . . 51

4.8.3 Cost based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8.4 Data mining based approach . . . . . . . . . . . . . . . . . . . . . . . . 52

4.8.5 Constrained cost based approach . . . . . . . . . . . . . . . . . . . . . 53

4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

II Contributions 57

5 Particle swarm optimization for solving SBJISP 595.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Particle Swarm Optimization (PSO) . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 The PSO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.2 Binary particle swarm optimization (BPSO) . . . . . . . . . . . . . . . 61

5.2.3 Inertia weight considerations for BPSO . . . . . . . . . . . . . . . . . 62

5.3 BPSO for the SBJI selection problem . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.1 Query workload parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.2 Solution coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.3 Fitness function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.1 Problem instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.2 Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4.3 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4.3.1 The smaller size problem set CSP results . . . . . . . . . . 69

5.4.3.2 The moderate size problem set CMP results . . . . . . . . 72

5.4.3.3 The larger size problem set CLP results . . . . . . . . . . . 74

5.4.4 Performance Scalability Study . . . . . . . . . . . . . . . . . . . . . . . 76

v

TABLE OF CONTENTS

5.4.4.1 Scalability results for the smaller size problem set CSP. . 76

5.4.4.2 Scalability results for the moderate size problem set CMP. 79

5.4.4.3 Scalability results for the larger size problem set CLP. . . 82

5.4.5 Performance results for the classes CSP, CMP and CLP using Oracle

DBMS cost models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Mixed-Integer linear programming for SBJISP 916.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2 Linear programming definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2.1 Linear program solution time . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.2 Mixed Integer Programming . . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Mathematical formulation of SBJI selection problem . . . . . . . . . . . . . . 92

6.3.1 SBJISP constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3.2 SBJISP objective Function . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95






6.4.3 Performance Scalability Study . . . . . . . . . . . . . . . . . . . . . . . 101

6.4.3.1 Scalability results for the smaller size problem set CSP. . 101

6.4.3.2 Scalability results for the moderate size problem set CMP. 102

6.4.3.3 Scalability results for the larger size problem set CLP. . . 105

6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7 Efficient methodology for Reference Horizontal partitioning in data ware-houses 1097.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.2 The proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2.1 Predicates attraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2.2 Clustering of Predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2.3 Determining the number of clusters . . . . . . . . . . . . . . . . . . . 113

7.2.3.1 Solution coding . . . . . . . . . . . . . . . . . . . . . . . . . . 113

vi

TABLE OF CONTENTS

7.2.4 Discrete particle swarm optimization for selecting horizontal parti-

tioning schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.2.4.1 Fitness function . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


7.3.2 Preview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116





7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

8 Conclusions 1258.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.2 Future research directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Bibliography 127

vii

LIST OF TABLES

TABLE Page

2.1 The main differences between OLTP and OLAP system design . . . . . . . . . . 11

2.2 Comparison between different design to implement OLAP system . . . . . . . . 13

3.1 Summary of work done on index selection . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Summary of work done on partitioning selection . . . . . . . . . . . . . . . . . . . 54

5.1 Optimized minimum support values on the three problem classes for different

storage sizes. The fact table size is 24,786,000 tuples. . . . . . . . . . . . . . . . . 67

5.2 Optimized minimum support values on the three problem classes for different

storage and fact table sizes. The fact table sizes are in millions. . . . . . . . . . 67

5.3 Querying performance results for the smaller size problem set CSP. . . . . . . . 71

5.4 Optimization rates for the CSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5 Querying performance results for the moderate size problem set CMP. . . . . . 73

5.6 Optimization rates for the CMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.7 Querying performance results for the larger size problem set CLP. . . . . . . . 75

5.8 Optimization rates for the CLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.9 Performance results for the smallest size problem set CSP in scalability. . . . . 78

5.10 Optimization rates for the CSP in scalability. . . . . . . . . . . . . . . . . . . . . . 79

5.11 Performance results for the moderate size problem set CMP in scalability. . . . 81

5.12 Optimization rates for the CMP in scalability. . . . . . . . . . . . . . . . . . . . . 82

5.13 Performance results for the largest size problem set CLP in scalability. . . . . . 84

5.14 Optimization rates for the CLP in scalability. . . . . . . . . . . . . . . . . . . . . . 85

5.15 Querying performance results for the CSP, CMP and CLP using Oracle DBMS

’Explain Plan’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.1 Model notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


ix

LIST OF TABLES



6.5 Performance results for the smallest size problem set CSP in scalability. . . . . 103

6.6 Performance results for the smallest size problem set CMP in scalability. . . . . 104

6.7 Performance results for the smallest size problem set CLP in scalability. . . . . 106




7.4 Optimization rates for the CMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121



x

LIST OF FIGURES

FIGURE Page

2.1 Data warehouses building process. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Building of the Data Cube. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Example of B-Tree index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Example of projection index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Example of hash index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Example of bitmap index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Example of join index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.8 Example of OLAP query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.6 Statement used to build BJI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 Example of bitmap join index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.9 Chaudhuri approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.10 Golfereli approach. [37] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.11 Transnational database example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.12 Aouich’s approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1 Horizontal partitioning example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Range Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Hach Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4 List mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5 Composite partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Example of composite partitioning mode . . . . . . . . . . . . . . . . . . . . . . . 45

4.7 Example of reference partitioning mode . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 BPSO based approach for SBJISP . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Example of Attributes-Queries matrix AUM . . . . . . . . . . . . . . . . . . . . . 64

5.3 Solution coding example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

xi

LIST OF FIGURES

5.4 The star schema for the APB-I data warehouse (primary and foreign keys are

underlined, foreign keys are recognized with # symbol) . . . . . . . . . . . . . . . 66

5.5 OLAP query (Query 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 OLAP query with a SBJI hint statement (Query 2). . . . . . . . . . . . . . . . . . 86

5.7 OLAP query with ’Explain Plan’ statement (Query 3). . . . . . . . . . . . . . . . 86

6.1 Algorithm to generate the bitmap table λ. . . . . . . . . . . . . . . . . . . . . . . 95

6.2 An example of bitmap table λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7.1 EMeD-Part process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 Example of predicates usage matrix PUM . . . . . . . . . . . . . . . . . . . . . . . 111

7.3 Example of attraction matrix (AM) computed using Jaccard Index . . . . . . . . 111

7.4 Result of the hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.5 Solution coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

xii

CH

AP

TE

R

1INTRODUCTION

“We must beat the iron while it is hot, but we may polish it at leisure.”

-John Dryden (1631-1700)

1.1 Introduction

The relational databases become the norm for querying and manipulating the structured

data since the relational model developed by E.F. CODD in 1970 at IBM. Data warehousing

has appeared in IT industry since mid 90’s. Researchers have been trying to find new ways

and improve the existing solutions for the design and development of data warehouses per-

sistently. But unlike traditional transactional systems used for running the daily business

tasks, data warehouses need more maintenance and support. The intensity of the work

on a data warehouses depends largely on how it is managed. Building data warehousing

projects in the reality is very expensive in general. Therefore, a data warehouse should

function effectively. The users queries executed on the data warehouse environment should

be written carefully and efficiently by considering the size of data warehouses tables. Some

tables in a data warehouse can be very large and queries executed on these tables could

take days or sometime weeks to obtain the results.

Nowadays, data warehouses with a Petabyte size become ordinary. This awesome

growing in size increases the data warehouse administration cost. The performance and

maintenance of a data warehouse depends clearly on the physical design. The data ware-

house physical design is a hard problem, and still being investigated by researchers.

1

CHAPTER 1. INTRODUCTION

Optimization structures in data warehouses environment aims at minimizing the query

execution cost. Several optimization structures exist since the transactional database, like

indexes, materialized views, partitioning and parallel processing. These structures are

classified in two main categories, redundant and non-redundant structures based on the

storage space utilized by the underlying structure.

The data warehouse tuning is a primordial task in the data warehouse life cycle (physi-

cal and logical design). For minimizing the queries cost, the data warehouse administrator

(DWA) should furnish necessaries recommendations to choose adequate optimization struc-

tures. Data warehouses undergo change daily, that make the choice decided by DWA hard

when he chooses appropriate optimization structures to be created.

Several tasks should be done by DWA in the data warehouse physical design:

1. DWA should select the adequate optimization(s) structure(s) using his/her prior

knowledge on the data warehouse.

2. After choosing the appropriate structure(s), the DWA should ask questions about the

selected structures. For example, “which columns to be chosen to create indexes?”,

“Which view to be materialized?”, “How many partitions to be created from tables?”

3. All previous questions above are well-known as NP-hard problems in the data

warehouse physical design. For solving these problems the DWA should use existing

approaches or should introduce a new approaches for solving these problems. The

approaches used by DWA should be efficient and consistent with the data warehouse

nature (if you deploy a wrong solutions, you will provoke the catastrophe and the

execution cost can take days or weeks to reach the results).

4. Finally the DWA should deploy the results obtained by the approaches utilized.

1.2 Thesis goals and results

This thesis makes several contributions to solve the Bitmap join indexes selection problem

(BJISP) and the horizontal partitioning problem (HPP) in data warehouses.

1. First motivation of this work is to develop a new approach to solve the BJISP which

is more effective. The proposed binary particle swarm optimization outperforms the

best well-known approaches in several aspects (convergence, accuracy and time).

2. The linear programming has proved their efficiency to solve a large class of optimiza-

tion problems. The second motivation of this dissertation is to use mathematical

2

1.3. THESIS OUTLINE

modelling for the BJISP in Linear programming context, and solving the proposed

mathematical model using CPLEX solver. The proposed linear programming model

surprisingly outperforms the well-known approaches in all aspects (time and accu-

racy).

3. We conclude our work by proposing a new methodology to partition data warehouse

efficiently, mainly using a relatively large size of predicate set as constraints. The

motivation is inspired from nature, in particular, utilizing the attraction between

individuals. We have introduced a new methodology based on attraction, data mining

and meta-heuristics to solve the HPP and compared against the best well-known

proposed approach in the literature. The proposed approach outperforms the best

well-known proposed approach in several aspects (convergence, accuracy and time).

1.3 Thesis outline

The outline of this dissertation is as follows: Chapter 2 provides the basics and the back-

ground of data warehouses. It presents a short review about the data warehouse objectives

and requirements, and clarifies the difference between classical databases environment

and data warehouse. The well-known systems used to implement the multidimensional

data, like the Relational Online Analytical Process (ROLAP) and multidimensional online

analytical process (MOLAP) are presented. The remaining part of the chapter discusses the

data warehouse physical design and tuning. It briefly discusses the optimization structures

and the tuning tools found in commercial DBMSs.

In Chapter 3, indexation techniques used in both classical databases and data ware-

houses are reviewed. The background work on indexes selection problem are discussed

in detail, the remaining part of this chapter introduces the bitmap join indexes selection

problem in detail. Chapter 4 first discusses the well-known horizontal partitioning model

present in commercial DBMSs and then presents a comparative study of the current

approaches to solve the horizontal partitioning problem in both transactional and data

warehouse environments.

Chapter 5 is devoted to our first contribution. First, we present the basics of the

particle swarm optimization and an adaptation of this meta-heuristic to solve the bitmap

join indexes selection problem in data warehouse (BJISP). We have also improved the

performance of the genetic algorithm proposed by BOUCHAKRI AND BELLATRECHE

for solving the BJISP [16]. The rest of this chapter presents the experimental results

using three classes of problem sets (smaller, moderate, and larger in size) to prove the

3

CHAPTER 1. INTRODUCTION

effectiveness of our approach against the improved genetic algorithm, the genetic algorithm

and the data mining based approaches. The scalable tests have also been performed to

observe the behavior of the proposed approach using a larger size fact table.

Chapter 6 presents a mixed-integer linear programming model of the BJISP where

each parameter is discussed in detail (i.e., a discussion of decision variables, constants,

constraints and objective function). The rest of this chapter presents extensive set of tests

performed to prove the efficiency of this approach against the well-known approaches

presented in Chapter 5.

Chapter 7 presents a new methodology based on statistics, data mining and meta-

heuristic for data warehouse partitioning. First, we find the attraction between the predi-

cates using a well-known statistical method. Then a clustering of the attraction between

predicates using a data mining algorithm is applied to reduce the complexity of the problem.

Finally, a meta-heuristic method is used to select the optimal partitioning schema. A set of

experimentations have been performed to compare the proposed approach against the best

well-known approach, the genetic algorithm based approach, to prove the effectiveness of

the proposed approach.

Chapter 8 presents a conclusion of the thesis including suggestions on how to extend

the proposed work for future research.

1.4 Academic publication

1. Journal: Lyazid TOUMI, Abdelouahab Moussaoui, Ahmet UGUR, Particle swarm

optimization for bitmap join indexes selection problem in data warehouses, J Super-

comput (2014) 68:672-708, DOI 10.1007/s11227-013-1058-9 [72]

2. Edited proceeding: Lyazid TOUMI, Abdelouahab Moussaoui, Ahmet UGUR, A

linear programming approach for indexes selection in data warehouses, J Procedia

Computer Science, Elsevier (2015). [73]

3. Conference: Lyazid TOUMI, Abdelouahab Moussaoui, Ahmet UGUR, A linear

programming approach for bitmap join indexes selection in data warehouses, The 6th

International Conference on Ambient Systems, Networks and Technologies June 2-5,

2015, London, United Kingdom. [73]

4. Conference: Lyazid TOUMI, Abdelouahab Moussaoui, Ahmet UGUR, EMeD-Part:

Efficient Methodology for Horizontal Partitioning in Data Warehouses, International

4

1.4. ACADEMIC PUBLICATION

conference on Intelligent Information Processing, Security and Advanced Communi-

cation, November 23-25, 2015, Batna, Algeria, Edited by ACM. (Accepted paper)

5

Part I

Backgrounds and state of art

7

CH

AP

TE

R

2DATA WAREHOUSES ARCHITECTURE AND DESIGN

“Man’s mind, once stretched by a new idea, never regains its original dimensions.”

-Oliver Wendell Holmes (1809-1894)

2.1 Introduction

Data warehouses manipulate huge amount of data where the size increase significantly in

case of scientific databases. Data warehouses are designed with star schema or snowflake

schema [49]. Queries executed on a data warehouse schema can take enormous amount

of time due to join operation between one or several dimension tables and the fact table.

In commercial DBMSs, several optimization structures are proposed to minimize the

query execution cost. These structures are divided on two main categories: redundant and

non-redundant structures [8].

Indexes, materialized views and vertical partitioning are redundant structures. These

structures are characterized by a high storage and maintenance cost. Horizontal par-

titioning, referential horizontal partitioning and parallel treatment are non-redundant

structures. These structures avoid the duplication of data and permit reduction of the

storage cost and minimize enormously the maintenance cost. The present work is in the

area of the data warehouse design for which essentials will be presented next.

9

CHAPTER 2. DATA WAREHOUSES ARCHITECTURE AND DESIGN

2.2 Data warehouses architecture

Bill Inmon, known as the father of data warehouses, say: "A data warehouse is a subject-

oriented, integrated, time-variant, non-volatile collection of data in support of management’s

decision-making process."[40]

• Subject oriented: The idea is to build a data warehouses that help to analyze

data. For example, to analyze the calls of a Telecom company, you can build a data

warehouse that focuses on calls. Using the data warehouse, you can answer question

like "What is the best time used by young customers to make calls ?" This ability

to define a data warehouse by a subject matter, in this case "calls", makes the data

warehouse subject oriented.

• Integrated: Subject orientation of a data warehouse needs the integration of data.

Focus on subject requires that data warehouse stores data from different sources

into a persistent format. The process of integration avoids the problems like naming

conflicts and inconsistencies.

• Time Variant: In business world, huge amount of data need to be analyzed to project

trends, this is very much in contrast to operational systems, where performance

requirements demand that historical data be moved to an archive. A data warehouse’s

focus on change over time is what is meant by the term time variant.

• Nonvolatile: The data should not be changed after being stored in a data warehouse.

The objective is the ability to analyze what has occurred.

Data warehouses use dimensional design model as opposed to relational model used in

the operational databases. The dimensional design is not useful for operational systems

due to redundancy and loss of referential integrity of the data. OLAP (On Line Analytical

Processing) uses queries on data warehouses and OLTP (Online transaction processing)

uses queries on the operational databases. Table. 2.1 summarizes the main difference

between operational databases and data warehouses.

10

2.3. THE DATA CUBE

Operational databases Data Warehouse(OLTP System) (OLAP System)

Objective application-oriented subject-oriented

Source of data Operational data Consolidation data

Purpose of data Fundamental business tasks Decision support

What the data Snapshot of ongoing data Multi-dimensional views

Inserts and Updates Short and fast inserts and updates Periodic long-running refresh

Queries Relatively standardized queries Ad-hoc complex queries

Processing Speed Typically very fast Depends on the amount of data

Database Design Highly normalized Typically de-normalized

Tables Many tables Fewer tables

Table 2.1: The main differences between OLTP and OLAP system design

The process to build a data warehouse is presented in the Figure. 2.1.

Data smart

Data smart

Data warehouseOLAP

Data smart

Databases

Extraction Integration Analysis

Reporting

Data warehouse Admin

Files

Reports

Statistics analysis

ETL

Data mining

Figure 2.1: Data warehouses building process.

2.3 The Data Cube

Online Analytical Process (OLAP) is supported by a multidimensional data model called

the data cube (DC). The DC is a data abstraction that provides an aggregated view of

data from some perspective. A DC is composed of dimensions and measures. Dimensions

11


(a) MOLAP

Product key Customer Key Location Key Measure attribute(s)1 1 1 141 2 1 91 3 1 971 4 1 453 1 1 6114 1 1 72

(b) ROLAP

Figure 2.2: Building of the Data Cube.

are represented by attributes, each attribute is either a feature or measure. The feature

attribute represents an entity, like Customer, Time and Product. The measure attribute is

an aggregate of feature attributes.

For a n-dimensional space {A1, A2, ..., An}, generate 2n views (cuboids or group-bys)

.Each view represents a combinations of feature attributes, and can be seen as aggregation

of measure attributes. The DC has full and partial descriptions. The full description

contains all the possible views (2n), the partial description contains only a subset of views.

The aggregations can be pre-computed and materialized to improve the query cost. If the

data is materialized as a multidimensional array, the obtained result is a MOLAP design.

The MOLAP performance can decline, when the space is sparse (high dimensionality/high

cardinality). The relational OLAP, or ROLAP, stores aggregation (view) as an additional

table. Fig. 2.2 illustrates the building of the cuboid of a three dimensional cube (Product,

Customer, Location). Fig. 2.2.(a) presents the MOLAP model, while Fig. 2.2.(b) shows the

ROLAP case. HOLAP (Hybrid Online Analytical Processing) is a combination of ROLAP

and MOLAP which are other possible implementations of OLAP.

12

2.4. DATA WAREHOUSE PHYSICAL DESIGN

ROLAP MOLAP HOLAP

Data Storage Relational database

Multidimensional

database

(Hypercube) Relational database

Aggregation storage Relational database

Multidimensional

database

(Hypercube)

Multidimensional

database

(Hypercube)

Database structure

Star, snowflake

and fact

constellation

schemes Ad-hoc

Between

ROLAP/MOLAP

Performance Low performance High performance Moderate performance

Table 2.2: Comparison between different design to implement OLAP system

2.4 Data warehouse physical design

McFadden and Hoffer [53] state that “Physical database design is concerned with trans-

forming the logical database structures into an internal model consisting of stored records,

files, indexes and other physical structures.”. In other words, physical database design is

the process to make description of the implementation of database on a secondary storage

medium. It describes the base relations, file organizations, and indexes used to achieve

efficient access to the data, and any associated integrity constraints and algorithms to

select optimization structures [23]. During the physical database design, a design decision

directly affects the speed of the database. Nowadays, the queries are more complex and

involved in huge amount of data.

During the physical database design, the decision affect directly the speed of the

database. In the first generation of databases the physical database design has not the

same impotence like today. Nowadays, the queries are more complex and manage a huge

amount of data. The database physical design is the process to make description of the

implementation of the database on secondary storage; it describes the base relations, file

organizations, and indexes used to achieve efficient access to the data, and any associated

integrity constraints and algorithm to select optimization structures [23]. During the

physical database design, a design decision directly affects the speed of the database.

Nowadays, the queries are more complex and involved in huge amount of data.

Several optimization physical design techniques are proposed and supported by the

most commercial DBMS. Bellatreche et al. classify the optimization physical design struc-

13


tures in two main categories: redundant and non-redundant techniques [8]. The redundant

structures need an extra storage space and a high maintenance cost, this category regroup

the materialized view [7, 38, 80], indexes [39, 42, 64], vertical partitioning [3, 33, 56]

and buffer management [26, 57, 74]. The non-redundant structures do not need an extra

storage space and with moderate maintenance cost, this category regroups horizontal

partitioning [3, 9, 20, 51], parallel processing [36, 69], Queries scheduling [4, 50].

2.5 Data warehouse tuning

Tuning is important process in DBMSs. The objective of tuning is providing independence

via physical data [66]. The solution for getting a best optimization structure is to improve

the performance by avoiding surcharges in system resources. The first (lowest) level of data

warehouse performance tuning is at the hardware level [66]. In the hardware level, the

surcharge problem can be solved by increasing resources; for example, using a fast CPU,

increasing the memory capacity, increasing the number of disks, using RAID (Redundant

Arrays of Independent Disks) technology, using a high level of parallelism. The second

administration level is situated in the DBMS (software). The DBMS performance can be

improved by increasing the buffer size and modifying the restoration points. The third

(highest) level of data warehouse performance tuning is at the logical level (schemas). The

queries can be written carefully for better consideration by the DBMS optimizer. Views to

be materialized, indexes to be created and relations to be partitioned should be selected

carefully to obtain better execution cost for the query workload.

Performance tuning of Data warehouse has an important impact on the implemen-

tation of the business intelligence applications (BIAs). Poor performances of BIAs and

system crashes have a catastrophic effect on users‚Äô attitudes about the system. High

performance data warehouse servers supporting BIAs need system configuration features

that are well-tuned to the physical properties of the data warehouse and well-organized

for query and data management processing, and effective processing of client transactions.

Performance tuning is at all times a challenging job for data warehouse administrators

(DWA). The researchers have indicated that it is extremely hard for DWAs and other

IT operators to tune a complex system in gravity states [18, 35]. DWA is susceptible to

make errors when tuning a complex system. The automatic performance tuning of data

warehouse servers is considered as an important solution when minimizing the query

execution cost [78].

14

2.6. DBMS TUNING TOOLS

2.6 DBMS tuning tools

Work on the databases tools have started in early 80‚Äôs. Finkelstein et al. have proposed

an optimizer to evaluate indexes in 80‚Äôs [31]. Several contributions have been made

by commercial DBMS vendors, such as ORACLE [1], IBM [28] and MICROSOFT [82]

for designing database tuning tools. All of such tools use an optimizer for evaluating the

proposed physical design.

We start by the Database Tuning Advisor (DTA) introduced by MICROSOFT. This

tool is the result of the AutoAdmin project1. The DTA is used under SQLSERVER and

integrated several recommendations for the database physical design, such as indexes,

materialized view, horizontal and vertical partitioning [1]. The DTA uses the SQLSERVER

optimizer and DWA constraints for providing recommendations to the physical design.

Similar tool exists in IBM Database Software. IBM has introduced the DB2 Index

Advisor first. Then an extended version called DB2 Design Advisor [82] has introduced.

DB2 Design Advisor contains materialized views and horizontal partitioning. IBM has

later introduced powerful tool called Query Patroller2 for a better management of the

query workload. This tool works under the DB2 Design Advisor and classifies the queries

in two categories: priority and unstable queries [82].

Oracle has proposed the Oracle Tuning Advisor that integrated indexes, material-

ized views and horizontal partitioning. This tool recommends the adequate optimization

structure for minimizing the cost of query workload tracked by the Automatic Workload

Repository [28].

2.7 Conclusions

In this chapter, we have presented a preliminaries of the data warehousing and the

data warehousing physical design. We have started by presenting the objective and the

architecture of data warehouses, followed by presentation of the data cube largely used

in the data warehousing. Secondly, we have presented the physical design in the data

warehouse, where we have presented the optimization structures used to optimize queries.

Finally, we have proposed the tuning and the tuning tools used in the commercial DBMS.

1http://research.microsoft.com/dmx/AutoAdmin2http://www.ibm.com/software/data/db2/querypatroller/

15

CH

AP

TE

R

3INDEXES SELECTION PROBLEM IN DATA WAREHOUSES

“If you don’t find it in the index, look very carefully through the entire catalogue.”

-Sears, Roebuck, and Co., Consumer’s Guide, 1897

3.1 Introduction

The indexation concept has been widely used in dictionaries, encyclopedias, manuscripts,

catalogs and books. In computing, searching data stored in a file is similar to searching

information in a dictionary. Usually, the entire data need to be examined sequentially. If

an index set is used, the entire data do not need to be examined sequentially. Database

transactions are accelerated using appropriate indexation methods, These methods allow

faster access to data. At the end of 70’s, several methods have been used to index relational

and hierarchical database storage structures such as. sequential, sequential indexed, hash,

binary search trees and B-trees. These techniques provide large options for database

designers, but choosing the best indexation method is a complex decision for designers.

For other types of databases such as object-oriented, spatial and temporal, a variety of

indexation methods exist. The B-tree index is still the most commonly used method in

commercial relational DBMSs.

We present an overview of indexation methods used in both classical database and data

warehouse environment. We discuss the main types of indexation methods used in data

warehouses such as Bitmap Join indexes in more detail.

17

CHAPTER 3. INDEXES SELECTION PROBLEM IN DATA WAREHOUSES

3.2 Indexation techniques

3.2.1 B-tree index

B-tree index is the most supported indexing method in the commercial DBMS [47]. B-tree

index is organized like an upside-down tree. The bottom level of the index holds the actual

data values and pointers to the corresponding rows. The tree-based indexes have nearly

the same complexity for searching and updating data. This characteristic gives the B-tree

index more useful for the OLTP environment characterized by the same frequency of

both searching and updating operations [25]. However, this index is not useful for OLAP

environment characterized by a high frequency of searching operations against a low

frequency of update operations [22]. In the DW environment, B-tree indexes should be

used only for unique columns or for the columns with a very high cardinality. For example,

B-Tree index built on Gender column with a very low cardinality is not useful for OLAP

queries since there reduces a very few numbers of Input/Output. The B-tree index is built

on both single and multiple column(s). The Fig. 3.1 present an example of a B-Tree index

built on the column CATEGORY in the PRODUCT table.

Figure 3.1: Example of B-Tree index.

3.2.2 Projection index

Projection index created on attribute A in the table R, allows to store all values of A in a

sorted sequence in the same order as they appear in R [59]. Fig. 3.2 shows a projection

index built on the column CATEGORY in the table PRODUCT. In the DW environment,

the OLAP query retrieve a small of relation columns; so having a projection index built

on these columns reduces tremendously the query cost. SAP Sybase DBMS allows the

creation of the projection index so-called FastProjection Index.

18

3.2. INDEXATION TECHNIQUES

Projection index created on Category

Book Book

Journal Journal

Encyclopedia

Products Table

PID Name Category 101 JAVA Book

1O2 C++ Book 103 VLDB Journal

104 DOLAP Journal 106 DATA WAREHOUSE Encyclopedia

Figure 3.2: Example of projection index.

3.2.3 Hash index

The hash index is created using a hash function provided by the DBMS and returns the

physical location of the records from the primary key values. Fig. 3.3 shows an example of

an hash index built on the PID column in the table PRODUCT. The main limitation of the

hash index is a bad choice of the hash function because the choice influences the search in

great deal, especially when using a hash function that returns the same values for a large

number of keys.

Hash function storage location

Products table

PID Name Category

101 JAVA Book 192 C++ Book

113 VLDB Journal 104 DOLAP Journal

154 DW & DM Book 126 KDD Book

117 ICDM Journal

192,126,117

154

101,113,104 Modulos 3 PID

Figure 3.3: Example of hash index.

3.2.4 Bitmap Index

Pure Bitmap Index was first proposed in the Model 204 DBMS [60]. The bitmap index [79]

is created on a table of bitmap vectors, each bit of this bitmap vector represents a distinct

value from the indexed column. The bit i in the bitmap vector represents the value v, the

bit is setted to 1, if the record i in the indexed table contains the value v. Fig. 3.4 shows an

example of the Pure Bitmap Index on the column CATEGORY in the table PRODUCT.

19


The Bitmap Index is a simple method for representing rowids. The Bitmap Index is

efficient for both storage and CPU usage, especially when the number of distinct values

in the indexed column is low. Using boolean operations such as OR, AND, and NOT in

the restriction predicates allow a better optimization for the complex queries. In a data

warehouse environment bitmap index can be most effective for the non-unique column,

because the B-tree indexes are most effective for high-cardinality columns, e.g. Name or

phoneNumber.

To answer a query in the presence of bitmap index, first useful bitmap vectors are

loaded in the memory. Then boolean operations will be performed on the loaded bitmap

vectors. However, the main problem with the bitmap indexes is the cardinality of the

columns, which bitmap index requires more space and query processing time for columns

with a high cardinality. Most DBMS such as Oracle, Sybase, Informix and Red Brick allow

the implementation of bitmap index.

Products Table

PID Name Category

101 JAVA Book

1O2 C++ Book

103 VLDB Journal

104 J. Supercomp Journal

106 DATA WAREHOUSE Encyclopedia

Bitmap index created on Category

Book Journal Encyclopedia

1 0 0

1 0 0

0 1 0

0 1 0

0 0 1

Figure 3.4: Example of bitmap index.

3.2.5 Join index

In the DW environment, the join operations have a high cost. VALDURIEZ et al. [76] have

introduced the Join Index that improves enormously the processing time of OLAP query

(see Fig. 3.5). The joint index is used to pre-calculate join operations. To execute query

using the join index, the DBMS uses the following steps [76]:

1. Read the Join Index JI

2. Perform Rn JI.

3. Internally sort the join index JIk on s

4. Perform Sn JI.

20

3.2. INDEXATION TECHNIQUES

Customers table

RIDC CID Name Genre Ville

1 223 Lyazid M Setif 2 152 Ahmet M Michigan

3 063 Abdel M Setif 4 051 Moncef M Michigan 5 121 Maria F Alger

Products table

RIDp PID Name Category 1 101 JAVA Book

2 1O2 C++ Book 3 103 VLDB Journal 4 104 DOLAP Journal

Join index

RIDA RIDp RIDC

1 1 1

2 1 2 3 1 4 4 3 1

Actvars table

RIDA AID CID PID Cost 1 1 223 101 1000

2 2 223 102 123 3 3 223 104 1233 4 4 063 101 2334

Figure 3.5: Example of join index.

The join index size dependants from the selectivity factor of the join operation; if the

selectivity factor is lower (near to 0) the join index size is smaller and if the selectivity

factor is higher (near to 1 implies the join became Cartesian product) the join index size is

larger.

3.2.6 Star join index

The Join Index is useful for OLTP environments, used to join two relations. In a data

warehouse environment, OLAP queries have a several joins between dimension table(s)

and fact table (at least one join). Redbrick et al. [70] have introduced a Join Index adapted

for DW designed with star schema, the index is so-called Star Join Index (SJI).

The SJI allows to join all dimension tables with the fact table (Complete SJI). We note

that the complete SJI is useful for all queries, but the complete SJI has voluminous size

and a higher maintenance cost. The SJI is not adapted for the others DW designs, like

Snowflake scheme.

3.2.7 Bitmap join index

By combining the Join Index (JI) and the Bitmap Index (BI), O’Neil introduce the Bitmap

Join Index (BJI) [58, 59] useful for optimizing the performance of OLAP queries. The

BJI allows to pre-calculate join operation between two or more tables. BJI can be built

using a single or multiple attributes. For each value in the attribute, the bitmap join index

stores the rowids of corresponding rows in one or more other tables. In a data warehouse

environment, the join condition is an equi-inner join between the primary key in the

dimension tables and the foreign key in the fact table.

21


SELECT SUM(dollarAmount)FROM sales, customerWHERE sales.cid = customer.cidAND customer.city = ’Setif’;

Figure 3.8: Example of OLAP query

CREATE BITMAP INDEX cust_sales_bji

ON sales(customers.city)

FROM sales, customers

WHERE sales.cid = customers.cid;

Figure 3.6: Statement used to build BJI

Fig. 3.7 presents a BJI built on the attribute CITY in table CUSTOMER and the

second table SALES. Fig. 3.6 illustrates an ORACLE statement used to create BJI on the

attribute CITY.

Customers Table

CID Name Gender City 223 Lyazid M Setif 152 Ahmet M Michigan

063 Moncef M Michigan

051 Abel M Setif

121 Maria F Alger

Actvars Table

AID CID TID PID Cost 1 223 106 101 1000

2 223 103 102 123 3 223 102 104 1233 4 063 106 101 2334

5 051 102 102 3433

6 152 102 101 4454 7 152 103 103 533

8 152 106 101 2332 9 121 106 101 332

10 121 103 102 2232

Bitmap join index created on City

RID Setif Michigan Alger 1 1 0 0

2 1 0 0 3 1 0 0

4 0 1 0

5 1 0 0 6 1 0 0

7 1 0 0 8 1 0 0 9 0 0 1

10 0 0 1

Figure 3.7: Example of bitmap join index.

Since the CUSTOMERS.CITY attribute is referenced in the clause ON of the index.

The queries that access to the table SALES joined with the table CUSTOMERS using the

attribute CITY, need to read the BJI instead to do join operation.

3.3 Bitmap join indexes selection problem in datawarehouses

Within the ISP, the bitmap join index selection problem (BJISP) is more difficult and

known to be NP-hard [5]. There are two variants of BJISP, the first one is based on one

22

3.3. BITMAP JOIN INDEXES SELECTION PROBLEM IN DATA WAREHOUSES

non-key attribute (BJIOSP) and the second one is based on multiple non-key attributes

(BJIMSP). The BJIOSP deals with 2n −1 possibilities to select best configuration and the

(BJIMSP eals with 22n−1 possibilities to select the best solution where n is the number

of non-key attributes. Clearly both are hard problems with the second one being more

complex than the first one. As an example, when n, the number of non-key attributes

equals to 20, there are more than 1 million (1,048,576 exactly) possibilities for BJIOSP and

more than 2 to the power of one million possibilities for BJIMSP, which is an extremely

large number.

3.3.1 Problem statement

The single bitmap join indexes selection problem (SBJISP) is formalized as follows [5, 13]:

• DW with a set of dimension tables D = {D1,D2, ...,Dm} and a fact table F.

• Query workload Q = {Q1,Q2, ...,Qr} defined on the DW schema.

• The set of candidate non-key dimension attributes A = {A1, A2, ..., Ak} extracted from

Q.

• The storage space constraint S.

The problem is to identify a configuration of indexes C = {SBJI1,SBJI2, ...,SBJIn}

defined on non-key attributes in A such that the global cost of the query workload

GlobalCost(Q,C) is minimized and the storage constraint S is satisfied.

3.3.2 Cost models

The size of candidate indexes set is important when the query workload is large. Creating

all of the candidate indexes is not feasible due to potential limitation on the indexes storage

size allowed by the DBMS. Several component cost models, which are mathematical in

nature, are used to estimate the number of input/output (I/O) operations needed for the

execution of queries in the query workload. The global execution cost GlobalCost(Q,C) of

queries in the query workload Q for the selected SBJIs configuration set C is computed

using as follow:

∑r∈Q

∑k∈C

IndexCost(Qr,SBJIk)+ ∑r∈Q

JoinCost(Qr,φr) (3.1)

23


3.3.2.1 Data access cost in presence of useful SBJIs (IndexCost)

CostIndx(Qr,SBJIk) is the execution cost of the query Qr using an index SBJIk built on

attribute Ak from the configuration C and zero otherwise, The function IndexCost(Qr,SBJIk)

is defined by Eq.(3.2).

CostIndex(Qr,SBJIk)= AccessCost+ReadCost (3.2)

The AccessCost is the cost for accessing to bitmap index using the B-tree and is defined

by Eq.(3.3).

AccessCost = |F|(1− e−R|F| ) (3.3)

where R = B|F||Ak|

is the number of read tuples for a given query using SBJIk, |F| is the

number of tuples in the fact table F and |Ak| is the cardinality of the domain of attribute

Ak (We assume that has the standard uniform distribution of data). B is the number of

bitmaps used to evaluate a given query, for example B=2 for the following clause: A=2 or

A=10, A in (5,10).

The ReadCost is the total number of the read tuples using B bitmaps and is defined

by Eq.(3.4).

ReadCost = logn |Ak|−1+ |Ak|n−1

+B|F|8P

(3.4)

where P is the size of disk pages measured in bytes and n is the order of the B-tree.

3.3.2.2 Data access cost in absence of useful SBJIs(JoinCost)

CostJoin(Qr,φr) is the execution cost of the query Qr in absence of useful SBJIs in the

configuration C. First, we identify dimensions tables that contain attributes in Qr without

SBJIs in C. All identified dimension tables are stored in the set φr. The joins operations

between dimension tables in φr and the fact table F are implemented using the hash-join

method. The number of I/O operations needed to join two tables T1 and T2 using hash-join

method is given by (see [55]):

3× (‖T1‖+‖T2‖) (3.5)

where ‖T‖ is a number of pages needed to store table T.

The order of joins is important when joining dimension tables in φr with the fact table

F. We have assumed that the join order is performed with the minimum selectivity method

[68].

24

3.4. BACKGROUND OF INDEXES SELECTION PROBLEM

3.3.2.3 Size of SBJI

The Size(SBJIk) is the size required to store a SBJIk, where the size depends on the

domain cardinality of Ak and the number of tuples in the fact table [5], and given by:

Size(SBJIk)= (|Ak|

8+16)|F| (3.6)

We note that the SBJI building time depends from two parameters: the cardinality of

attribute used to build the SBJI and the number of tuple in the fact table.

3.4 Background of indexes selection problem

The index selection problem (ISP) is the most crucial problem in the physical design [48].

The main objective here is to choose a subset of indexes to be created from available

attributes so that the cost for database query workload is minimized. Several approaches

have been proposed for solving the ISP in traditional or distributed databases [2, 21, 25, 48,

62]. The following subsection present a review of the proposed approaches in the literature

for solving the ISP.

3.4.1 Frank et al. approach

Frank et al have introduced an administration tool allows to select best configuration of

indexes in transactional database [32]. The proposed tool starts from an initial solution

and using a query workload, and after an interaction between this tool and the DBMS

optimizer provide the final configuration of indexes to be created. The gain of each index is

computed using the difference between the probable execution cost without index and the

probable execution cost using index.

The approach is summarized as follow:

1. Query from the query workload and a set of potential indexes are submitted to the

DBMS optimizer.

2. The gains obtained by using the set of submitted indexes are stored.

3. A new submission with a new set of potential indexes is submitted to the DBMS

optimizer, and the step 2 is iterated.

4. All best gains are added.

5. Finally all indexes with positive gains are proposed to the DBA.

25


Candidate indexes Selector

Indexes inumerator

Multiple attributes indexesGenerator

SQL SERVER DBMS

CostEvaluator

What-IFIndexesCreator

Query'sWorkload

FinalConfiguration

of indexes

Figure 3.9: Chaudhuri approach.

3.4.2 Choenni’s approach

Choenni et al. have proposed an approach based on two process, down-up and top-down

[24]. The down-up process starts from an initial configuration containing all the potential

indexes, after each iteration of the process a selected index or a set of indexes that increase

the the query workload cost are dropped using the DROP function. The top-down process

starts from an empty configuration, after each iteration a selected index minimizing the

query workload cost is added using the ADD function. The process stops when all indexes

are created or no cost reduction is possible

26


3.4.3 Gundem’s approach

Gundem has proposed an approach to select indexes based on the type of indexes [34].

Gundem indicate that for each attribute candidate to indexation, different type of indexes

can be used, but each index generates a different storage space and cost. The proposed

approach classifies the indexation attributes into equivalence classes, each class containing

the proposed indexes on the same attribute.

For lunching the selection process, the following features are needed:

1. A set of the potential indexes.

2. The available storage space.

3. For each attribute,choosing the possible multiple indexes.

4. The access frequency and the update frequency.

5. The error threshold tolerance.

This approach uses two main process, the global and the local optimization which will be

described below.

• The local optimization process: allows to select a set of indexes for each equiva-

lence class. Then, for each attribute, a cost function is used to evaluate the gain of

indexes to be created. Finally, a set of indexes I for each equivalence class is created.

• The global optimization process: allows to select the final configuration based

on the selected indexes for each equivalence class. The selection process is based on

the objective function computed by the difference between the cost before and after

creation of the set of indexes.

3.4.4 Golfarelli’s approach

Golfarelli et al. [37] have proposed an heuristic approach to select the best configuration of

indexes. The approach uses several features: system constraints, query workload, statistics

about tables and logical schema. The objective is to find an optimal physical scheme that

minimizing the query workload cost and respect the storage constraint. Golfarelli et al.

have introduced a Rule Based Optimizer (RBO) that generates the execution plane for

each query , and a cost model is used to compare different solutions.

27


Figure 3.10: Golfereli approach. [37]

The proposed approach needs the following component as input:

• logical scheme: contain the fact tables (base and aggregate) and dimension tables.

• Workload: extracted from the LOG file of the DBMS with the frequency of each

query.

• Data volume: contain the statistical information about the data warehouse tables,

attributes, attributes domain.

• System constraints: contain the space constraint authorized by DBMS to store

indexes and the size of the buffer used for the join operations.

• Candidate indexes contain the potential attributes to indexation

The processing component to accomplish the index selection are presented as follow:

• Aggregate navigator is used to select the best materialized views for solving query

using logical scheme and the query workload without access to the indexes.

• Indexable attribute selector is used to choose the useful attributes to indexation

from the dimension tables by analyzing the queries structures.

28


• Candidate indexes selector is used to choose the best index type for each index-

able attribute.

• Optimal indexes selector is used to choose the optimal indexes to be created from

the candidate indexes.

• Cost evaluator used to evaluate both the cost of the indexes and the cost of the

execution plan.

• Plan generator used to choose the best execution plane for executing a query

3.4.5 Data mining based approach

Aouiche et al. [5] proposed the first work dealing with Bitmap Join Indexes (BJI) in a

DW environment. The proposed approach is based on two steps. First the close algorithm

(mining closed frequent pattern) is used for pruning the research space constituted from

the potential candidate BJIs. In the second step, a greedy algorithm is used to select the

best configuration of BJIs to be created. The general scheme of the approach is presented

in the Fig. 3.12.

3.4.5.1 Mining closed frequent pattern

tid items

t1 a1,a3,a4

t2 a2,a3,a5

t3 a1,a2,a3,a5

t4 a2,a5

t5 a1,a2,a3,a5

t6 a2,a3,a5

Figure 3.11: Transnational database example

Let an items set I = {a1,a2, ...an}, and a transaction database DB = {t1, t2, ...tm} which each

Ti has a unique identifier (tid) and contains a set of items in I. The support (or occurrence

frequency) of a pattern X (support(X)), which is a set of items, is the number of transactions

containing X in DB.

support(X )= |ti ∈ DB, X ⊆ ti||ti ∈ DB| (3.7)

29


X, is a frequent pattern if X’s support is no less than a predefined minimum support

threshold, ξ.

Using the Eq. 3.7, the supports of each item in I (Fig. 3.11) are respectively 36 , 5

6 , 56 , 1

6

and 56 . if ξ is equals to 4

6 , the frequent patterns are a2,a3 and a5.

The closed pattern is a maximal set of patterns share a set of items. A pattern X ∈ I

such that support(x) ≤ ξ is called closed frequent pattern. For the example in the Table in

Fig. 3.11 the closed frequent pattern is {a2,a3,a5}

3.4.5.2 General schema of the data mining approach

The data mining based approach for BJI selection in DW is as follows:

1. Extraction the query workload from the DBMS LOG file.

2. Analyzing of query workload and extracting the candidate attributes for indexation.

3. Build the context used by the close algorithm.

4. Using the close Algorithm for mining the closed frequent patterns from the context

built in the previous step .

5. Using the greedy algorithm (see Algorithm 1) to select the final configuration of BJI

to be created

Algorithm 1 Greedy algorithm pseudo-code to select the final BJI configRequire: A, Q, S;

1: A = SORT(A); /* Candidate attributes A*/2: Config = BJImin;3: C ← C ∪ BJIA0

4: S ← S - Size(BJIA0)5: A ← A - A0 /* Remove A0 from A */6: while Size(C) ≤ S do7: for each attribute A i ∈ A do8: if Cost(Q,(C ∪ BJIA i )) < Cost(Q,C) and SIZE(C ∪ BJIA i )≥ S then9: C ← C ∪ BJIA i ;

10: A ← A - A i11: end if12: end for13: end while14: return C

30


Query's workload

A set of frequent patterns

Attribute usage matrice

Candidate indexes

Final configuration of indexes

Indexation Attributes extraction

Close algorithm

Indexes generation

Greedy Algorithm

Cost Model

Data Warehouse

Figure 3.12: Aouich’s approach.

3.4.6 Extended data mining based approach

Bellatreche et al. [13, 14] have proposed an extended version to the approach proposed by

Aouiche et al. .The proposed approach uses a cost model for evaluate the effectiveness of the

closed frequent pattern generated by both close and charm algorithms called DynaClose

and DynaCharm algorithms based approaches. The proposed approach uses a fitness

function to evaluate the effectiveness of the closed frequent pattern generated. For each

closed frequent pattern mi the fitness function is defined as follow:

Fitness(mi)= 1n× (

n∑j=1

α jsup j) (3.8)

31


n represents the number of non-key attributes A j in the closed motif mi. sup j represents

the support of A j, α j represent penalty parameter defined by α j =‖D j‖‖F‖ ,‖D j‖ and ‖F‖ are

respectively the number of page needed to store the dimension table ‖D j‖ and the fact

table ‖F‖. the lower bound value can be reached by the fitness function is called minfit is

computed as follow:

minf it = minsup‖F‖ ×

⌈d∑

j=1

‖D j‖d

⌉(3.9)

The rest of approach is similar to the Aouiche’s approach, the only difference is by replacing

the pruning algorithm Close by Dynaclose and Dynacharm.

3.4.7 Genetic algorithm based approach

Bouchakri et al. [15] proposes genetic algorithm based approach for solving the bitmap

join indexes selection problem. the authors coding the problem using a binary array, each

cell of the array represent a potential attribute, i.e. if the cell value is 1 the BJI on the

attribute represented by this cell is created and 0 otherwise. The general scheme of this

approach is presented in the Algorithm 2.

Algorithm 2 Genetic algorithm pseudo-code to select the final SBJI configurationRequire: A, Q, S;

1: coding the solution using the following binary array:A1 A2 A3 A4 A51 0 0 0 1

;

2: Generate random initial population ;3: Perform selection step;4: while t ≤ MAXITERATION do5: Perform crossover step;6: Perform mutation step;7: Perform selection step ;8: compute the best fitness9: t ← t+1;

10: end while11: return C

3.5 Summary

The indexation is an important method in the databases/data warehouses physical design.

The indexation is largely used to optimize OLAP queries in the data warehouse envi-

ronment. We have presented previously the well-known works to solve indexes selection

32

3.5. SUMMARY

problem on both databases and data warehouse. The proposed approaches start by common

step that is analyzing the query workload using automatically using a parser or manually

by the DBA. To select the optimal/sub-optimal configuration a greedy algorithm or a ge-

netic algorithm based on a mathematical cost model or on DBMS optimizer are used. To

reduce the size of the problem, Aouiche et al. [5] proposed a data mining technique(mining

frequent patterns) to prune the research space, this approach is improved by Bellatreche

et al [13, 14] to take in consideration the cost features. The Table. 3.1 summarize the

well-known approaches to solve the indexes selection problem and the bitmap join indexes

selection problem in both databases and data warehouses.

33


Tabl

e3.

1:Su

mm

ary

ofw

ork

done

onin

dex

sele

ctio

n

App

roac

hIn

dex

type

Heu

rist

icC

ost

mod

elA

ttri

bute

set

size

Que

ryw

orkl

oad

size

Fra

nket

al.(

1992

)[32

]P

rim

ary

Gre

edy

DB

MS

Opt

imiz

er8

10/2

0

Cho

enni

etal

.(19

93)[

24]

Pri

mar

y/Se

cond

ary

Gre

edy

Mat

hem

atic

aln/

an/

a

Gun

dem

etal

.(19

99)[

34]

Pri

mar

yG

reed

yM

athe

mat

ical

n/a

n/a

Gol

fare

lliet

al.(

2002

)[37

]P

rim

ary

Gre

edy

DB

MS

Opt

imiz

ern/

an/

a

Aou

iche

etal

.(20

05)[

5]Se

cond

ary

Gre

edy/

Dat

am

inin

gM

athe

mat

ical

n/a

61

Bel

latr

eche

etal

.(20

07)[

13]

Seco

ndar

yG

reed

y/D

ata

min

ing

Mat

hem

atic

aln/

a61

Bou

chak

riet

al.(

2010

)[15

]Se

cond

ary

Gen

etic

Mat

hem

atic

al18

70

34

3.6. CONCLUSIONS

3.6 Conclusions

In this chapter, we have presented the indexation techniques used in both classical

databases and data warehouses in detail (for each indexation type, we have presented

general information, the objectives and usage. We have also presented the background on

how to select indexes in classical databases (ISP). Finally, we have presented the bitmap

join indexes selection problem in data warehouses in detail and reviewed the existing

approaches to solve the problem.

35

CH

AP

TE

R

4HORIZONTAL PARTITIONING IN DATA WAREHOUSES

“Nothing is particularly hard if you divide it into small jobs.”

-Henry Ford (1863-1947)

4.1 Introduction

Horizontal partitioning (HP) is an important optimization structure in the databases

physical design. The HP has a direct impact on the performances of DBMS [3]. Horizontal

partitioning provides methods to split tables, views and index into partitions. This chapter

present a complete state-of-art of the portioning methods used in the commercial DBMS.

We present the proposed approach used to select the best partitioning schema into either

databases and data warehouse environment .

4.2 Horizontal partitioning

Horizontal partitioning of relation R is based on domains of its attributes. Each horizontal

partition has a subset from R that has the same property [30]. The horizontal partitioning

allows flexibility for database administrator and designer to manipulate smaller units

of data [40]. Horizontal partitioning is divided into two main categories: primary and

referential horizontal partitioning [11]

• Primary horizontal partitioning (PHP) of relation R is performed using a set of

restriction predicates defined on R. PHP helps to reduce the cost of queries on R

37

CHAPTER 4. HORIZONTAL PARTITIONING IN DATA WAREHOUSES

by reducing the access to impertinent data and allows parallel execution of queries

(high level of parallelism). For example, if query Q have restriction predicate in

the WHERE clause, DBMS optimizer prune impertinent partitions and load valid

partitions useful to execute Q. Formally, given a databases D with a set of relations

R = {R1, ...,Rn}, each relation Ri(A1, .., Am) contains a set of attributes A. Each

attribute A j,1≤ j ≤ M has a domain Dom(A j)= (d j1, ...,d jN j ). Partitions Ri1, ...,Rik

are horizontal partitions of Ri resulted by the PHP process using a set of restriction

predicates. The union operation are needed to rebuild a the relation R =⋃ni=1 Ri.

• Referential horizontal partitioning (RHP) is performed using a set of restriction

predicates defined on another relation [6]. RHP is a complex process compare to PHP.

Formally, suppose that R and S are two relations and R has a foreign key referencing

S, R has a foreign key to S. First, R is horizontally partitioned into a set of partitions

{R1, ...,Rk}(1 ≤ k ≤ M) using a set of restriction predicates. S is partitioned with

referential horizontal partitioning into a set of partitions {S1, ...,Sk},1≤ k ≤ M each

partition Sk is created by Sk = SnRk(1 ≤ k ≤ M). The main objective of RHP is to

minimize the cost of join operations.

4.3 Horizontal partitioning example

Given two relations Customer and Sale linked by Custlevel, tuples of both relations are

presented in the Fig. 4.1. First Customers relation are partitioned into three horizontal

partitions using PHP. Each of these partitions are obtained with the following restriction

predicates:

• Customers1 = σCity=′Seti f ′(Customer)

• Customers2 = σCity=′Be jaia′(Customer)

• Customers3 = σCity=′Al giers′(Customer)

Fig. 4.1.c presents a referential horizontal partitioning of SALE based on partitions

obtained after the PHP process on Customers relation. Each Sale partition obtained by a

semi-join between relation SALE and Customer partition and presented as follow:

• Sale1= Sales n Customer1

• Sale2 = Sales n Customer2

• Sale3= Sales n Customer3

38

4.4. COMPLEXITY

(a) Relations: Customer and Sale before partitioning

Customer

CID Name Gender City

223 Guessoum M Sétif

152 Semchedine M Sétif

063 Imlouli M Bejaia

051 Zebar M Sétif

121 Maîza F Algiers

Sale

AID CID TID PID Amount

1 223 106 101 1000

2 223 103 102 123

3 223 102 104 1233

4 063 106 101 2334

5 051 102 102 3433

6 152 102 101 4454

7 152 103 103 533

8 152 106 101 2332

9 121 106 101 332

10 121 103 102 2232

Customer1 = σ(Ville=’Sétif’) (Customer)

CID Nom Gender City

223 Guessoum M Sétif

152 Semchedine M Sétif

051 Zebar M Sétif

Sale1=Sale ⋉ Customer1


1 223 106 101 1000

2 223 103 102 123

3 223 102 104 1233

5 051 102 102 3433

6 152 102 101 4454

7 152 103 103 533

8 152 106 101 2332



4 063 106 101 2334

Customer2 = σ(Ville=’Béjaia’) (Customer)


063 Imlouli M Bejaia

Customer3 = σ(Ville=’Alger’) (Customer)


121 Maîza F Alger



9 121 106 101 332

10 121 103 102 2232

(b) Horizontal partitioning of Customer (c) Referential partitioning of Sale

Figure 4.1: Horizontal partitioning example.

4.4 Complexity

Bellatreche et al. studied the complexity of the horizontal partitioning problem in data

warehouse (HPPDW) proved that the HPPDW is NP-complete [10]. They first reduce the

39


HPPDW to the 3-Partitions problem, then decide whether a multi-set of integers can be

partitioned into triples that all have the same sum. The 3-Partitions problem is shown to

be NP-complete in the strong sense.

4.5 Horizontal partitioning advantages

Horizontal partitioning is an essential functionality for databases applications and used

for improving administrative costs, performance and availability

4.5.1 Horizontal Partitioning for databases administration

Horizontal partitioning allows partitioning of relation into smaller units that are easier

to manipulate. This possibility provides to the databases administrator "Divide and rule"

strategy in data management. With the horizontal partitioning, some partitions are kept

offline, whereas the other partition are kept online. For example, a DBA that loads the

daily Sales, he/she should use horizontal partitioning of Sales such that each partition

contains single day of Sales.

4.5.2 Partitioning for performance optimization

The problem encountered in the very large databases is the increase in the data volume,

which degrades the performance of the DBMS due to the amount of data examined in

each new upload. The horizontal partitioning can solve the problem by minimizing the

amount of data examined. Horizontal partitioning has several advantages for optimizing

the performance of DBMS. The first advantage is the partition pruning that allows to

select only the useful partition(s) to answer queries, e.g., the relation Sales is partitioned

into 48 partitions, each partition contains sales of each department (Wilaya). A query that

the Sales involved by people living in Setif needs access to the partition containing Sales

involved by Setifian citizen only. In this case, the DBMS can prune the irrelevant partitions

and choose only the pertinent partitions to answer query. Horizontal partitioning can

improve the performance of join operations by pre-calculating the join and partitioning the

voluminous join into smaller joins.

4.5.3 Partitioning for the availability

Partitioned relation ensures independence between partitions, i.e. if each partition is stored

into isolated Tablaspace. The breakdown of a Tablespace implies the breakdown of all

partitions stored in this Tablespace, but the other partitions remain online.

40

4.6. HORIZONTAL PARTITIONING MODES

4.6 Horizontal partitioning modes

4.6.1 Range partitioning mode

The range mode is the first partitioning mode integrated in ORACLE 8. This mode uses

the domain Dk of the attribute Ak used as partitioning key of R. Each range has lower and

upper bounds (see the example in Fig. 4.2 below) Example

Customer

Customer 1

Customer 2

0 18 19 45 46 100

Customer 3

Attribut F

Attribut F

Customer 3

Age <18

Age

Age ≥45

18≤ Age <45

Figure 4.2: Range Mode.

The Fig. 4.2 illustrates a range partitioning of the Customers on Age as partitioning

key. The following ORACLE statement allows rang partitioning of Customers:

CREATE TABLE Customers

(CID number(9), Name varchar(25), City varchar(25),

Gender char(1), Age number(3)

PARTITION BY RANGE(Gender)

(PARTITION C-Childs VALUES LESS THAN (18) TABLESPACE TBS-Childs,

PARTITION C-Adults VALUES LESS THAN (45) TABLESPACE TBS-Adults,

PARTITION C-Olds VALUES LESS THAN (MAXVALUE) TABLESPACE TBS-Olds) ;

• The PARTITION BY RANGE clause indicate that the range mode is used. Each

partition is named, e.g. C− Inf ants is the name of partition that contains persons

with Age < 18.

41


• The TABLESPACE clause allows to store the partition into a predefined physical

space.

Each inserted tuple in the relation R is automatically inserted into one of the three

partitions using the value of the Age column. For example, if you insert a tuple with

Age = 40, the DBMS starts by comparing the value of Age with the upper bound of the

partition containing smaller values, the system check then 40 > 18 and passes to the

following partition. Finally the DBMS checks that 40 < 45 and inserts the tuple to the

appropriate partition. This mode is useful for queries contains restriction predicate in the

form of range, for example:

SELECT Name FROM Customers

WHERE Age>45;

The DBMS load TBS−Olds for answering the query above.

4.6.2 Hach partitioning mode

Customer

F(CID)=a

CID Customer 1

Attribut F

Attribut F

Customer 2

F(CID)=b

F(CID)

Figure 4.3: Hach Mode.

This mode uses a hashing algorithm furnished by DBMS. The user should furnish the

partitioning key and the number of partitions needed. The hashing algorithm furnished by

system distributes tuples among partitions evenly yielding approximately the same size

partitions(see Fig. 4.4).

42


• Example: The following statement provides the partitioning of Customers table into

four partitions using CID attribute as partitioning key, each partition is stored in

separate TABLESPACE (TBS1, TBS2, TBS4 et TBS4).

CREATE TABLE CUSTOMER (CID number(9), Name varchar(25),

City varchar(25),Gender char(1), Age number(3))

PARTITION BY HASH (CID)

PARTITION 4 STORE IN (TBS1, TBS2, TBS4, TBS4) ;

The partitions names are automatically assigned by DBMS during the partitioning

process.

4.6.3 List partitioning mode

Algiers Sétif Béjaia

City

City=’Algiers’

Customer 2

Customer 1

Attribut F

Attribut F

City=’Sétif’

Customer 3

City=’Béjaia’ Customer

Figure 4.4: List mode.

List partitioning provides partitions specified by a list of discrete values for the partitioning

key. With list partitioning, you can group and organize unordered and unrelated sets of

data in a natural way.

• Example The following statement shows the partitioning of the relation Customer

into four partitions with the list mode using City attribute as partitioning key, the

fours partitions contain the Customers of Setif, Bejaia, Algiers and others cities

respectively. (see Fig. 4.4).

CREATE TABLE CUSTOMER (CID number(9), Name varchar(25), City varchar(25),

43


Gender char(1), Age number(3))

PARTITION BY LIST (City)

(PARTITION C-Setif VALUES (’Setif’),

PARTITION C-Bejaia VALUES (’Bejaia’),

PARTITION C-Algiers VALUES (’Algiers’),

PARTITION C-Otherwise VALUES (DEFAULT)) ;

4.6.4 Composite partitioning mode

Partition 3 Partition 2 Partition 1

Attribut A Attribut B

SPM 2

SPM 1

SPM 2 SPM 2

Partition 3-2 Partition 3-1 Partition 2-1 Partition 1-2 Partition 1-1 Partition 2-2

Second Level

First Level

Before Partitioning

Figure 4.5: Composite partitioning mode.

Composite partitioning mode (CPO) is a combination of two single partitioning modes

SPM1 and SPM2 (see Fig. 4.5). First the relation is partitioned with a SPM1 and then

each partition is further subdivided into sub-partitions using a SPM2. [61]. Several

composite partitioning modes are obtained by combining single partitioning modes.

44


Customer 3 Customer 2 Customer 1

Age City

List Mode

Range Mode

Customer 3-2 Customer 3-1 Customer 2-1 Customer 1-2 Customer 1-1 Customer 2-2

Customer

City=’Algiers’ City=’Sétif’ City=’Bejaia’

Range Mode Range Mode

Figure 4.6: Example of composite partitioning mode.

The relation Customer is partitioned first using Gender as partitioning key and then

each partition is subdivided into sub-partitions using Age as partitioning key (see Fig. 4.6)

using the following statement :

CREATE TABLE CUSTOMER

(CID number(9), Name varchar(25), City varchar(25),

Gender char(1), Age number(3)

PARTITION BY LIST (Gender)

SUBPARTITION BY RANGE (Age)

SUBPARTITION TEMPLATE

(SUBPARTITION C-Childs VALUES LESS THAN (16) TABLESPACE TBS-Childs,

SUBPARTITION C-Adults VALUES LESS THAN (MAXVALUE) TABLESPACE TBS-Adults))

45


(PARTITION C-Setif VALUES (’Setif’),

PARTITION C-Bejaia VALUES (’Bejaia’),

PARTITION C-Algiers VALUES (’Algiers’)

PARTITION C-Otherwise VALUES (DEFAULT));

4.6.5 Multicolumn partitioning mode

The multicolumn partitioning mode is applied with the range partitioning and the hash

partitioning modes, and at most 16 partitioning key columns can be used. In the multi-

column partitioning, the key used which is composed of several columns defines a higher

granularity than the preceding ones. The most common example is a decomposed DATE

column consisting of separate columns, year, month, and day. In DBMS, the nth partition-

ing key is investigated only when all previous (n−1) partitioning key values exactly match

the (n−1) bounds of a partition.

The following example illustrates the range partitioning of the relation Sales using two

key partitioning Year and Month:

CREATE TABLE sales (

Year NUMBER,

Month NUMBER,

Day NUMBER,

Amount NUMBER)

PARTITION BY RANGE (Year,Month)

(PARTITION before2014 VALUES LESS THAN (2014,1),

PARTITION q1_2014 VALUES LESS THAN (2014,4),




PARTITION future VALUES LESS THAN (MAXVALUE,0));

46


4.6.6 Reference partitioning mode

Customer1 Customer 2 Customer 3

CID City

List Mode

Customer

City=’Algiers’ City=’Sétif’ City=’Bejaia’

Ville=’Alger’

Et age≥18

Ville=’Sétif’

Et age<18

Ville=’Sétif’

Et age≥18

CID

Sale

Reference Mode

Sale1=Sale⋉Customer1 Sale2=Sale⋉Customer2 Sale3=Sale⋉Customer3

Figure 4.7: Example of reference partitioning mode.

Previously, we have presented the single and the composite partitioning used for single

relation partitioning. In this section, we present the reference partitioning mode integrated

in the Oracle 11g environment. The reference mode allows partitioning of two relations

R and S related to each other by referential constraints. The partitioning key is resolved

through an existing parent-child relationship, enforced by enabled and active primary key

47


and foreign key constraints [61].

First the relation R is partitioned using a single or composite partitioning mode. If the

mode applied to R is a single partitioning mode, the number of partition of R is the same

of S. In the case of the composite partitioning applied to R, the number of partition of R

equals to the number of sub-partitions of S.

• ExampleThe relation Customer is partitioned into three partitions Customer1, Customer2,

Customer3 (see Fig. 4.7) using List partitioning mode, and then three Sales parti-

tions are generated, where each partition is related with a Customer partition. The

following statement provides partitioning of the relation Sales into three partitions

using reference partitioning mode:

CREATE TABLE SALES

(CID number(9), Date DATE , Amount Number(10,2)

CONSTRAINT Customer_Cs FOREIGN KEY (CID) REFERENCES Customer(CID))

PARTITION BY REFERENCE(Customer_Cs);

4.6.7 Virtual column partitioning

This partitioning mode uses a virtual column as any regular column. All partitioning

modes are supported when using virtual columns, including range partitioning mode and

all different combinations of composite partitioning modes.

CREATE TABLE sales(

Pid NUMBER(6) NOT NULL

, Cid NUMBER NOT NULL

, Tid DATE NOT NULL

, CHid CHAR(1) NOT NULL

, PROMOid NUMBER(6) NOT NULL

, quantitySold NUMBER(3) NOT NULL

, amountSold NUMBER(10,2) NOT NULL

, totalAmount AS (quantitySold * amountSold)

)

PARTITION BY RANGE (Tid) INTERVAL (NUMTOYMINTERVAL(1,’MONTH’))

SUBPARTITION BY RANGE(totalAmount)

SUBPARTITION TEMPLATE

( SUBPARTITION Psmall VALUES LESS THAN (1000)

48

4.7. HORIZONTAL PARTITIONING PROBLEM IN DATA WAREHOUSES

, SUBPARTITION Pmedium VALUES LESS THAN (5000)

, SUBPARTITION Plarge VALUES LESS THAN (10000)

, SUBPARTITION Pextreme VALUES LESS THAN (MAXVALUE)

)

(PARTITION sales_before_2007 VALUES LESS THAN

(TO_DATE(’01-JAN-2007’,’dd-MON-yyyy’))

)

4.7 Horizontal partitioning problem in data warehouses

4.7.1 Problem statement

The data warehouse horizontal partitioning is formalized as follows [8]:

• DW with a set of dimension tables D = {D2,D1, ...,Ds} and a fact table F.

• Query workload Q = {Q1,Q2, ...,Qr} defined on the DW schema.

• Dimension predicates P = {Ps1,Ps

2, ...,Psk} from Q, where s is a dimension table in D.

• The maintenance constraint B fixed by administrator.

The problem is to identify partition schema PS that generates n sub-star schemes

{S1,S2, ...,Sn}. Each sub-star schema Si is defined by the join between the fact partition and

dimension partitions using hash join method [16]. The dimension partition dsi of the dimen-

sion table Ds is identified by the set of dimension predicates Psi = {Ps

i1,Psi2, ...,Ps

ik}. The fact

partition f i of the fact table F is identified by the set of fact predicates P fi = {P f

i1,P fi2, ...,P f

i j}

such that the global cost of the query workload GlobalCost(Q,PS) is minimized and the

maintenance constraint Bis satisfied.

4.7.2 Cost model

The global execution cost GlobalCost(Q,S) of queries in the query workload Q for the

partitioning scheme S is computed using as follow:

∑r∈Q

∑i∈S

V alid(Qr,SSk)×Cost(SSi) (4.1)

V alid(Qr,SSi) is binary function that return 1 if the sub-scheme SSi is needed to answer

the query Qr and 0 otherwise.

49


The number of disk page accesses (I/O) for loading fact partition P f i is:

| f i |∏j=1

Sel(P fi j)×‖F‖ (4.2)

Psi j is predicate used to specify the dimension partition ds

i belong to the dimension table

Ds. The Sel(P) are the number of pages needed to store table T and the selectivity of the

predicate P respectively.

The cost for loading the dimension partition Pdsi is:

|P si |∏

j=1Sel(Ps

i j)×‖Ds‖ (4.3)

where P fi j is predicate used to specify the fact partition f i, ‖T‖.

The number of disk page needed to join fact partition f i and dimension partition dsi

using hash-join method is given by (see [55]):

3× (‖ f i‖+‖dsi‖) (4.4)

Each sub-schema SSi is specified by a set of dimension partitions and fact partition. The

Cost(SSi) of the sub-schema SSi is computed by joining all dimension tables with the fact

table using the hash-join method. The order of joins is important when joining dimension

partitions with the fact partition in the sub-star schema. We have assumed that the join

order is performed with the minimum selectivity method [68].

4.8 Background of approaches for the horizontalpartitioning Problem

Several work have been proposed for solving HPP problem [9, 10, 12, 20, 43, 52, 56, 56, 62,

81] in the classical databases environment, parallel databases and data warehouses. These

work are classified into four main categories [10] which will be described subsequently.

4.8.1 Workload-based approach

Let a table T and a set of predicates P = {P1,P2, ...,Pn} defined on attribute in T. The

approach are summarized in the following steps [20, 62]:

1. Specification of a complete and minimal set of predicates using the COM-MIN

algorithm [62].

50

4.8. BACKGROUND OF APPROACHES FOR THE HORIZONTAL PARTITIONINGPROBLEM

2. Generation of a set M of minterm predicates M = {mi|mi = ∧(1≤k≤n) p∗

k,1 ≤ j ≤ 2n},

p∗k is p or p.

3. Simplification the set M by eliminating of non useful predicates.

4. Generation of horizontal partitions, each minterm mi generates a set of partitions

using the selection operation σm(R).

This approach is simple, but computationally expensive. For example, with n predicate

this approach generates 2n minterme predicates.

4.8.2 Attribute affinity based approach

The concept of affinity is used for the vertical partitioning approach [56], and Zhang et

al. and Karlapalem et al. have adapted the attribute based affinity for the horizontal

partitioning [43, 81]. The approach can be summarized in the following steps:

1. Enumerate the predicates set from the query workload.

2. Create predicate Usage Matrix PUM[m×n], m is the number of predicates and n is

the number of queries. each cell PUM[i,j] can take 1 if the predicate Pi is used by the

query q j and 0 otherwise.

3. Build Attribute Affinity Matrix AAM[m×m], m is the number of predicates. AAM

is a matrix in which for each pair of attributes, the sum total of frequencies of

transactions accessing that pair of attributes together is stored. Each cell AAM[i,j]

can take numeric or alphanumeric value (⇐,⇒,∗):

• numeric value which is the sum of the frequencies of queries that contain the

predicates pi and P j

• ⇐ value, if the predicate pi implies p j

• ⇒ value, if the predicate p j implies pi

• ∗ value, if pi and p j are defined on the same attribute or

4. Grouping the predicates using the graphical algorithm [56] for the vertical partition-

ing.

5. Generate the horizontal partitions using the groups obtained by the previous step.

51


4.8.3 Cost based approach

BELLATRECHE et al. [12] have proposed an approach based on a mathematical cost model.

The cost model is used to measure (Input/Output) necessary to answer query without using

the DBMS optimizer. The approach can be summarized in the following steps:

1. Extract the set of predicates from the query workload. from the set of predicates

the minterms set M is created. Each mi ∈ M is used to generate one partition,

combination of minterms provide a new minterms and automatically a new partitions.

All combinations between minterms in M allow to generate all possible partitions.

2. Evaluate the generated schemas using a mathematical cost model. The cost model

allows to DWA to measure the number of (Input/Output) operation for executing the

query workload on the partitioned scheme.

3. Use exhaustive and greedy algorithms to select optimal/suboptimal solutions that

generates the minimum cost. The exhaustive algorithm has a high complexity, due

to the time needed to evaluate all generated schemas, this algorithm can be used

with a smaller size set of predicates. The greedy algorithm is used to select the opti-

mal/suboptimal partitioning schema. First the initial solution is generated using the

affinity attribute based approach, followed by an application of two basic operators,

fusion and merge, respectively. The fusion operator is used for merging two partitions,

the split operator is used for splitting a partition into new partitions.

4.8.4 Data mining based approach

Mahboubi et al. [52] have proposed a new approach that allows to select K partitions

using a data mining algorithm (K-means). The K-means algorithm is used to classify

the predicates into set of clusters (partitions). The general schema of this approach is

summarized as follows:

• Extraction of predicates from the query workload.

• Construction of predicate usage matrix (PUM).

• Classification of predicates into clusters using K-Means algorithm.

• Building horizontal partitions using the set of predicates grouped in each cluster.

52

4.9. SUMMARY

4.8.5 Constrained cost based approach

Bellatreche et al. [9, 10] extend the cost based approach by adding of threshold B deter-

mined by DWA. The threshold B is the number of sub-schemas allowed by the DWA (B is

so called maintenance threshold). For example, if the partitioning process generates 10,000

sub-schemas, the DWA cannot maintain that many sub-schemas. In this approach, the

only solutions taken into consideration are those satisfy the maintenance threshold B. The

general schema of this approach is summarized as follows:

1. Extraction of the set of predicates from the query workload.

2. Coding the partitioning problem using the following multidimensional array.

3. Choosing a meta-heuristic algorithm, such as genetic algorithm, hill climbing or

simulated Annealing to select the optimal/suboptimal partitioning schema. All the

proposed meta-heuristics integrate a cost model to evaluate the solutions.

4.9 Summary

The horizontal partitioning (HP) is an important method in the databases/data warehouses

physical design. The HP is largely used to optimize OLAP queries in the data warehouse

environment. The works on the HP are starting soon, whereas Ceri et al. [20] proposed

approach based on the query workload (predicates) to partition databases, the proposed

approach has a high complexity (2n, n is the number of predicate). Zang et. al [81], following

by Karlapalem et al. [43] have used the graphical algorithm based on the queries frequen-

cies to partition databases. The proposed algorithm has a lower complexity comparing to

the workload based approach but using the query frequency is not enough for efficiency

partitioning. Bellatreche et al. [12] have introduced a cost model to measure partition’s cost,

and the problem is modelling as an optimization problem. The authors used a Hill climbing

algorithm to select the optimal/sub-optimal solution. Mahboubi et al. [52] proposed a data

mining algorithm (K-Means) used to classify the predicates into a set of K clusters using

K-Means algorithm. This approach allows to control the number of partition but based on

the queries frequencies. Bellatreche et al. have proposed approach based on the cost model

and allows to control the number of partitions, the authors have used genetic algorithm and

simulated annealing to select the optimal/sub-optimal solution that respect the number of

generated partitions. The Table. 4.1 Summarize the main well-known proposed approaches

to solve horizontal partitioning problem both in databases and data warehouses.

53


Tabl

e4.

1:Su

mm

ary

ofw

ork

done

onpa

rtit

ioni

ngse

lect

ion

App

roac

hP

arti

tion

ing

type

Sele

ctio

nal

gori

thm

Cos

tm

odel

Pre

dica

tese

tsi

ze

Que

ryw

orkl

oad

size

Wor

kloa

dba

sed

(198

2)[2

0]P

rim

ary

n/a

n/a

n/a

n/a

Att

ribu

teaf

finit

yba

sed

(199

5)[8

1]P

rim

ary

Gra

phic

aln/

an/

a61

Cos

tba

sed

(200

0)[1

2]P

rim

ary/

Ref

eren

ceH

illcl

imbi

ngM

athe

mat

ical

n/a

6

Dat

am

inin

gba

sed

(200

8)[5

2]P

rim

ary

K-m

eans

Mat

hem

atic

al20

-30

20

Con

stra

ined

cost

base

d(2

009)

[10]

Pri

mar

y/R

efer

ence

Gen

etic

/Sim

ulat

edan

neal

ing

Mat

hem

atic

al45

60

54

4.10. CONCLUSIONS

4.10 Conclusions

In this chapter, we have presented up-to-date partitioning modes used in commercial

DBMSs in detail. In addition, we have reviewed the existing partitioning approaches

in both classical databases and the data warehouses. Finally, we have compared and

summarized the reviewed approaches.

55

Part II

Contributions

57

CH

AP

TE

R

5PARTICLE SWARM OPTIMIZATION FOR SOLVING SBJISP

“It is a capital mistake to theorize before one has the data. Insensibly one begins to twist

facts to suit theories, instead of theories to suit facts.”

-Sir Arthur Conan Doyle (1859-1930)

5.1 Introduction

In this chapter, We propose a new approach based on binary particle swarm optimization

(BPSO) using a mathematical cost models (as opposed to a DBMS based cost models) for

solving the BJISP. Several experiments were performed to demonstrate the effectiveness of

the BPSO approach and compared to the two well-known approaches, the genetic algorithm

(GA) based approach and the data mining (DM) based approach. We have also improved

the GA approach by using an improved penalty function and then tested the effect of

the penalty function on the GA approach. This is called the improved version of the GA

approach (GAI). The BPSO approach was also compared against the GAI approach.

5.2 Particle Swarm Optimization (PSO)

Particle swarm optimization (PSO) is a bio-inspired algorithm used for global optimization

problem. The global optimization problem is to minimize or to maximize an objective

function f : S−> R. In this chapter, we interest to minimization problem, which means

59

CHAPTER 5. PARTICLE SWARM OPTIMIZATION FOR SOLVING SBJISP

that the goal is to find a solution x∗ ∈ S in such that

∀x ∈ S f (x∗)≤ f (x) (5.1)

The solution x∗ that satisfies the condition in Eq. 5.1 is called a global minimum. If there

exists an ε> 0 such that

∀x with ‖x− x∗‖ < ε : f (x∗)≤ f (x) (5.2)

The solution x∗ is called local minimum. In this section we introduce particle swarm

optimization methods. First the PSO algorithm for continuous problems is presented,

following by the presentation of the PSO used for the binary optimization problems.

5.2.1 The PSO Algorithm

Swarm intelligence based methods are inspired from natural phenomenons and species

behavior for solving problems. Particle swarm optimization (PSO) is population based meta-

heuristic based on swarm intelligence introduced in 1995 by Eberhart [44]. The particle

swarm algorithm has been found to be robust in solving problems featuring non-linearity

and non-differentiability, multiple optima, and high dimensionality through adaptation

which is derived from social-psychological theory. In PSO, the potential solutions, called

particles, fly through the problem space by following the current sub-optimum particles.

PSO for Continuous Problems The PSO algorithm [29, 44, 46, 63] uses the cognitive and

the social component for the learning processes. The cognitive component is based on the

previous experiences of each particles. The social component is based on the imitation of

better group members.

Transforming the previous components to an iteration-based optimization algorithm,

a population of m particles, explores a n-dimensional search space S of an optimization

problem with objective function f : .

Formally each particle i of the swarm is described by its position xi and its velocity

vi, pi is the best position traveled by particle i. The best value of the entire individual

pi values is called g. At each generation, a particle’s position and velocity are updated

according to its own pi and pg values, this behavior is described by Eq.5.3 and Eq. 5.4:

vt+1i j = wvt

i j + c1r1(pi j − xti j)︸︷︷︸

cognitive component

+ c2r2(pg j − xti j)︸︷︷︸

social component

(5.3)

xt+1i j = xt

i j +vt+1i j (5.4)

60

5.2. PARTICLE SWARM OPTIMIZATION (PSO)

Here, vti j is the current velocity at time t, vt+1

i j is the new speed of particle i, w is the inertia

weight, c1 and c2 are two positive constants, r1 and r1 are the uniformly distributed

random numbers in [0,1], xti j is the current position of particle i at time t and xt+1

i j is new

position of particle i.

The velocity vi j is bounded by the range of velocities [−Vmax,Vmax] to prevent the

particle from flying out of the solution space. The pseudo-code of the PSO algorithm is

shown in Fig. 5.2.1.

Algorithm 3 Particle swarm optimization pseudo-code1: S ← InitializeSwarm(x0

i );2: t ← 1;3: while t ≤ MAXITERATION do4: wt ← ChangeInertia(t) using Eq. 5.7;5: for each particle xt−1

i ∈ S do6: Evaluate particle xt−1

i ;7: if fitness (xt−1

i ) is better than fitness (pt−1i ) then

8: pi ← xt−1i ;

9: end if10: if fitness (pt−1

i ) is better than fitness (g) then11: pg ← pt−1

i ;12: end if13: vt

i ← ChangeV elocity(xt−1i ,vt−1

i ) using Eq. 5.3;14: if |vt

i| >Vmax then15: clamp it to |vt

i|←Vmax16: end if17: xt

i ←U pdatePosition(vti) using Eq. 5.4;

18: end for19: t ← t+1;20: end while

Particle swarm optimization was originally provided for solving the continuous op-

timization problems. However, there exist PSO variants for binary problems [45], are

described in the following section.

5.2.2 Binary particle swarm optimization (BPSO)

The continuous-valued PSO introduced in [44] works in a continuous number space. There

are many discrete optimization problems which require the ordering or arranging of

discrete elements, like scheduling and routing problems. Besides these pure combinatorial

problems, researchers frequently cast floating-point problems in binary terms, and solve

them in a discrete number space. As any problem, discrete or continuous, can be expressed

61


in a binary notation, it is seen that an optimization method which works on binary functions

might be advantageous. [45]

Kennedy and Eberhart [45] have adapted the continuous-valued PSO version to search

in binary space. In this version of PSO, the particle moves in a state space restricted to

zero and one, where each velocity vi j represents the probability of bit xi j taking the value

1. In other words, if vi j = 0.60, then there is a 60 % chance that xi j will be a one, and

40 % chance it will be a zero. A sigmoid transformation is applied to the velocity vi j to

compute the probability of the jth component value of the particle position taking the value

1. Velocities are updated as in Eq. 5.3, but positions are updated using the following rule:

xt+1i j =

1, if Random < S(vt+1i j ))

0, otherwise(5.5)

S(vt+1i j )= 1

1+ e−vt+1i j

(5.6)

The binary PSO also limited vi j by a value vmax like the continuous-valued PSO. In the

binary PSO vmax limits the probability that bit xi j will take on a zero or one value. For

example, if vmax = 6.0, then probabilities are between 0.9975 and 0.0025. We note that

the smaller vmax in binary PSO allows a higher mutation rate against high vmax in the

continuous-valued PSO increases the range explored by a particle [45].

5.2.3 Inertia weight considerations for BPSO

The time-varying inertia weight was proposed in order to improve the performance of

the BPSO [75]. This inertia weight linearly decreases with respect to time. Generally for

initial stages of the search process, larger inertia weight to enhance the global exploration

(searching new area) is recommended while the inertia weight is reduced for local explo-

ration in final stages (fine tuning the current search area). This can be expressed in the

following equation:

w(t)= wmax − (wmax −wmin)K

× t (5.7)

where wmax is initial value of the inertia weight, wmin final values of the inertia weight, t

is current iteration and K is maximum number of allowable iterations.

5.3 BPSO for the SBJI selection problem

The general schema of the bitmap join indexes selection process is based on BPSO and

illustrated in Fig. 5.1. First, the queries workload Q is parsed and a set of candidate

62

5.3. BPSO FOR THE SBJI SELECTION PROBLEM

attributes A is generated. The BPSO algorithm then takes the set A along with the

storage constraint S and the cost models as input and undergoes the selection process. The

process continues until a feasible solution is found, or the generation limit is exceeded (if

the storage constraint is violated, then the solution is marked as infeasible). A feasible

solution, if found, is the final configuration of SBJI indexes.

Data warehouse

Selection Strategy ModuleBased on BPSO

Queries workload

(Q)

Query parser Module

Selectedconfiguration

of SBJI

Storage constraint

(S)

Knowledge(Metadata, schemas,

statistics,...)

Cost modelsModule

Non-key attributesto indexation (A)

Figure 5.1: BPSO based approach for SBJISP

63


5.3.1 Query workload parsing

The OLAP queries present in the query workload Q are parsing for extracting candidate

attributes for indexation A. Extracted attributes are those presented in the WHEREclause of the OLAP queries in Q. The OLAP systems use SELECT queries (read-only)

and the update queries are preceded using batch update [5]. After parsing step, we build

Attributes-usage matrix (AUM). For illustrate the building of AUM, we have a set of

restriction predicates with one of the following forms

• Table.Attribute BETWEEN V alue1 AND V alue2

• Table.Attribute θ V alue, where θ is an operator in ( =, 6=,<, >,≤,≥)

• Table.Attribute LIKE V alue

• Table.Attribute IN (V alue1,V alue2)

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10Q1 1 0 0 0 1 0 1 0 0 0Q2 0 1 1 0 0 0 0 1 1 0Q3 0 0 0 1 0 1 0 0 0 1Q4 0 1 0 0 0 0 1 1 0 0Q5 1 1 1 0 1 0 1 1 1 0Q6 1 0 0 0 1 0 0 0 0 0Q7 0 0 1 0 0 0 0 0 1 0Q8 0 0 1 1 0 1 0 0 1 1

Figure 5.2: Example of Attributes-Queries matrix AUM

5.3.2 Solution coding

Year City Gender Family Monthxi1 xi2 xi3 xi4 xi50 1 1 1 0

Figure 5.3: Solution coding example

One of the most important issues when designing a BPSO algorithm is the coding

of a solution. The search space of the proposed BPSO algorithm is composed of a set of

indexable attributes A for building a set of bitmap join indexes. A binary vector of particle

64

5.4. EXPERIMENTAL RESULTS

positions xi is used for the coding of a solution, where xi j is set to 1 if the attribute A j is

used to build SBJI, 0 otherwise. Fig. 5.3 illustrates solution coding of A being equal to

the set {Year, City, Gender, Family, Month}, for example xi2 indicates that a SBJI on City

attribute is built and xi5 indicates that a SBJI on Month attribute is not built.

5.3.3 Fitness function

Both the BPSO and GA generate infeasible solutions that violate storage constraint S

when solving the SBJISP. We deal with the violation of constraint (S) by using a penalty

function. The penalty function decreases the fitness of a solution candidate that violates a

constraint. For the GA the following penalty function has been proposed [15] - here, Q and

C are as defined in Section 2.1:

f itness(C)=OvCost(Q,C)×Pen(C) , i f Pen(C)> 1

OvCost(Q,C) , otherwise(5.8)

Pen(C)=∑

k∈C Size(BJIk)S

(5.9)

The principal limit of the GAI described in [15] is the infeasible solutions generated. To

find any feasible solution in many cases is itself NP-hard [54]. In order for the BPSO

generate more feasible solutions, we present a new fitness function based on an exponential

penalty. The fitness function is described as follows:

f itness(C)=OvCost(Q,C)×2Pen(C) , i f Pen(C)> 1

OvCost(Q,C) , otherwise(5.10)

The fitness function in the genetic algorithm (GA) is replaced with a new fitness function

based on an exponential penalty. As stated earlier, this improved genetic algorithm is

called GAI. Furthermore, the performance of GAI was tested and compared against the

BPSO, GA and DM approaches.

5.4 Experimental Results

5.4.1 Problem instances

In the experiments to follow, the benchmark APB-I [27] is used to generate the data

warehouse and ORACLE 11g DBMS environment is used to implement the data warehouse.

In this benchmark, the star schema contains four dimension tables: CHANLEVEL (9

tuples), CUSTLEVEL (900 tuples), PRODLEVEL (9,000 tuples), TIMELEVEL (24 tuples)

65


Customers

CustID CHAR(12)

RETAILER CHAR(12)

CITY CHAR(13)

STATE CHAR(15)

GENDER CHAR(1)

TYPE CHAR(12)

EDUCATIONAL CHAR(10)

MARITAL CHAR(1)

Channels

ChanID CHAR(12)

ALL CHAR(12)

CATEGORY CHAR(12)

Times

TimesID CHAR(12)

YEAR CHAR(4)

QUARTER CHAR(2)

MONTH CHAR(2)

DAY CHAR(10)

Products

ProdID CHAR(12)

GROUP CHAR(12)

FAMILY CHAR(12)

LINE CHAR(12)

DIVISION CHAR(12)

SUPPLIER CHAR(12)

STATUS CHAR(12)

Actvars

CustID # CHAR(12)

ChanID # CHAR(12)

TimesID # CHAR(12)

ProdID # CHAR(12)

UNITSSOLD FLOAT(126)

DOLLARSALES FLOAT(126)

DOLLARCOST FLOAT(126)

Figure 5.4: The star schema for the APB-I data warehouse (primary and foreign keys areunderlined, foreign keys are recognized with # symbol)

and the fact table ACTVARS (24,786,000 tuples). The data warehouse schema is presented

in Fig. 3. Three classes of experiments are performed:

1. the class of smaller size problem set (CSP): this class contains 100 OLAP queries

and 12 non-key attributes from dimension tables: {all, year, retailer, quarter, month,

line, group, family, division, class, gender, city} with cardinalities: 9, 2, 99, 4, 12, 15,

300, 75, 4, 605, 2 and 255 respectively.

2. the class of moderate size problem set (CMP): this class consists of 250 OLAP

queries and 16 non-key attributes from dimension tables: {division, line, family, group,

class, status, year, quarter, month, day, state, city, retailer, type, gender, all} with

cardinalities: 4, 15, 75, 300, 605, 5, 2, 4, 12, 5, 45, 255, 99, 10, 2 and 9 respectively.

3. the class of larger size problem set (CLP): this class consists of 500 OLAP queries

and 20 non-key attributes from dimension tables: {all, year, retailer, quarter, month,

day, line, group, family, division, class, gender, city, state, type, educational, marital,

supplier, status, category} with cardinalities: 9, 2, 99, 4, 12, 5, 15, 300, 75, 4, 605, 2,

255, 45, 10, 6, 4, 15, 5 and 3 respectively.

66


5.4.2 Preview

All tests are performed under Intel i7 (4 cores) processor with 8 GB RAM. We have

implemented the BPSO algorithm with Java Development Kit. The Genetic algorithm

is implemented with Jenetics: Java Genetic Algorithm framework1 and the data mining

algorithm with SPMF framework2 for BJIOSP (see [5, 15] for the details about DM and

GA implementations). For the cost models, the disk page size P was set to 65,536 bytes

and the B-tree order was set to 2. In the stochastic algorithms the GA, GAI, and BPSO, the

Table 5.1: Optimized minimum support val-ues on the three problem classes for dif-ferent storage sizes. The fact table size is24,786,000 tuples.

S CSP CMP CLP(MB) MinSup MinSup MinSup

500 0.12 0.20 0.12600 0.12 0.14 0.12700 0.12 0.14 0.12800 0.12 0.20 0.12900 0.12 0.14 0.121,000 0.12 0.14 0.121,100 0.12 0.14 0.121,200 0.12 0.14 0.121,300 0.12 0.14 0.121,400 0.22 0.14 0.121,500 0.12 0.14 0.121,600 0.12 0.14 0.121,700 0.12 0.14 0.121,800 0.12 0.14 0.121,900 0.12 0.14 0.122,000 0.12 0.14 0.12

Table 5.2: Optimized minimum support val-ues on the three problem classes for differentstorage and fact table sizes. The fact tablesizes are in millions.

S |F| CSP CMP CLP(MB) (M) MinSup MinSup MinSup

500 30 0.20 0.18 0.1260 0.20 0.20 0.1290 0.22 0.18 0.12120 0.22 0.18 0.14150 0.22 0.18 0.14

1,000 30 0.20 0.16 0.1260 0.20 0.18 0.1290 0.20 0.18 0.12120 0.20 0.20 0.12150 0.22 0.18 0.12

1,500 30 0.20 0.16 0.1260 0.20 0.18 0.1290 0.20 0.18 0.12120 0.20 0.18 0.12150 0.22 0.18 0.12

2,000 30 0.20 0.16 0.1260 0.20 0.16 0.1290 0.20 0.18 0.12120 0.20 0.18 0.12150 0.20 0.20 0.14

maximum number of evaluations is equal to the number of iterations × the population size.

For the GA, the parameter setup proposed in Bouchakri and Bellatreche [15] was used.

The population size, crossover probability, mutation rate and number of iterations were set

to 70, 0.95, 0.01 and 200 respectively yielding 14,000 maximum evaluations (200×70). For

the GAI, the same parameter setup as the GA was used except the crossover probability

was set to 0.80. For the BPSO, the parameters c1, c2, Vmax, wmax, wmin were set to 2.0, 2.0,

1http://jenetics.sourceforge.net/2http://www.philippe-fournier-viger.com/spmf/

67


6.0, 0.95 and 0.5 respectively [44, 45, 65]. The population size and the number of iterations

were set to 30 and 200 respectively yielding 6000 maximum evaluations (200×30). For

the DM, empirical tests were performed to find the best minimum support on the three

problem sets CSP, CMP and CLP (the minimum support was systematically increased

from 0.08 to 0.50 in 0.02 increments). Table 5.1 shows the best minimum support found for

the problem classes CSP, CMP and CLP as the storage size was increased systematically

from 500 MB to 2,000 MB (here, the fact table size is 24,786,000 - see Section 4.1 for

details). Also, for the storage sizes 500, 1000, 1500, and 2000 (all in MB), the fact table

size was systematically increased from 30 million to 150 million in 30 million increments

and the best minimum support found was recorded (shown in Table 5.2).

5.4.3 Performance Study

A set of experiments using the cost models described in Section 2.2 was used to analyze

the efficiency of the BPSO approach against the three other approaches: GA, DM and GAI

using the three problem sets CSP, CMP and CLP. For the stochastic algorithms BPSO,

GA and GAI, the average cost of feasible solutions over 5 independent runs are reported.

The DM method runs on a single problem set, therefore is not considered to be stochastic

in that sense. The storage size S was systematically increased from 500 MB to 2000 MB

(in 100 MB increments, yielding a total of 16 different cases) to determine the cost for

each of the algorithms. Therefore, a total of 80 (16×5) runs is under consideration for

the algorithms BPSO, GA and GAI, and 16 (16×1) runs is for the DM. Tables 5.3, 5.5

and Table 5.7 show the number of disk page accesses (I/Os costs) using the cost models

described in Section 2.2 when executing the set of queries in the problem sets CSP, CMP

and CLP respectively for 16 different storage sizes. For the BPSO, GA and GAI methods,

a solution can be either feasible or infeasible with respect to the storage size constraint

S. The column Avg represents the average cost of solutions found over the 5 independent

runs. The column Best represents the value of the best minimum cost solution found. The

column Evals represents the average of the number of candidate evaluations performed for

reaching the best solution for the 5 independent runs (i.e., a measure about how fast an

algorithm finds an optimal/sub-optimal solution or converges). The column Err represents

the rate of infeasible solutions (i.e., solutions that do not satisfy the storage size constraint).

In the case of an infeasible solution, the cost was computed using the hash-join method

without using a BJIO configuration. The column Time shows the average computation time

in seconds. The last column represents the following: for each storage size S, the Kruskal

Wallis test was applied to the results of the BPSO, GA and GAI for the 5 independent

68

runs to see if results are significantly different or not (using a standard significance level

of 0.05). Symbol⊙

indicates that there is no statistically significant difference between

the results of the three algorithms BPSO, GA and GAI. Symbol⊕

indicates that the

results of the BPSO are statistically different than those of the GA and GAI. The last row

provides an overall average for the runs. In each table row the best (i.e., the minimum)

and the average querying performance result were presented in bold font for each of the

algorithms considered. In Tables 5.4, 5.6 and Table 5.8 the columns %Best represents the

best performance rates and %Avg represents the average performance rates computed

using Eq. 7.14 shown below.

%Best = 100×(1− Best

CostWoBJI

); %Avg = 100×

(1− Avg

CostWoBJI

)(5.11)

CostWoBJI represents the workload cost without using the SBJI configuration (i.e., using

the hash-join method).

5.4.3.1 The smaller size problem set CSP results

The querying performance for the smaller size problem set CSP presented in Table 5.3

indicates that the BPSO algorithm has generated better results in general. The BPSO

has outperformed the methods GAI, GA and DM for both the best solutions found and

average solution quality obtained. In terms of the best solutions found, the BPSO algorithm

has generated the best solutions in the 74 out of 80 cases (92.5%) while the GAI method

has generated the best solutions in 58 out of 80 cases (72.50%). The GA method has

only generated the best solutions in 53 out of 80 cases (66.25%). The DM method did

not generate any best solution in all 16 cases (0%). In terms of the average number of

evaluations performed, the BPSO needed about 10 times (10.39 exactly) and about 14

times (13.57 exactly) less evaluations than the GAI and GA respectively, yet achieving

better solution quality. In terms of the infeasible solutions generated, both the BPSO and

the GAI have always generated feasible solutions (i.e., 0% infeasible or 100% feasible)

while the GA has generated infeasible solutions in 21 out of 80 cases, which is 59 feasible

ones - i.e., 26.25% infeasible or 73.75% feasible (all generated solutions were infeasible

in the case when the storage size was equal to 1100, 1300, or 1500 MB. In terms of the

computation time, the BPSO was about 4 times (3.61 exactly) faster than the GAI and 7

times (7.03 exactly) faster than the GA. In fact, the DM computation times were smallest

on average (about 20 seconds), but the approach did not generate any best solution in all

16 cases. The last column of Table 5.3 indicates that there was a statistically significant

difference between the results of the BPSO, GAI and GA in 6 out of 16 cases. Table 5.4

69

shows the optimization rates of the algorithms BPSO, GAI, GA and DM for the smaller

problem set CSP. The best (%Best) and the average (%Avg) performance rates obtained

by the BPSO (81.06% and 80.95 % respectively) were higher than those for the GAI, GA

and DM (the average performance rates for the BPSO were slightly higher than those for

the GAI and DM). In summary, as also indicated by the last row of Table 5.3, the BPSO

algorithm has shown better performance than the GAI, GA and DM methods in all aspects.

70

Tabl

e5.

3:Q

uery

ing

perf

orm

ance

resu

lts

for

the

smal

ler

size

prob

lem

set

CS

P.

SB

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

Bes

tA

vgE

rrE

vals

Tim

eB

est

Avg

Err

Eva

lsT

ime

Bes

tA

vgE

rrE

vals

Tim

eB

est

Tim

e

500

20,2

32,2

38.3

20,2

32,2

38.3

0%28

869

20,2

32,2

38.3

20,2

40,7

14.8

0%5,

670

801

20,2

32,2

38.3

20,2

40,7

14.8

0%4,

438

960

20,2

74,6

20.5

19⊙

600

20,2

32,2

38.3

20,2

32,2

38.3

0%25

860

20,2

32,2

38.3

20,2

32,2

38.3

0%4,

396

600

20,2

32,2

38.3

33,7

74,7

69.1

20%

3,64

054

220

,274

,620

.519

⊙70

019

,366

,563

.419

,539

,698

.30%

648

436

20,2

32,2

38.3

20,2

32,2

38.3

0%2,

534

1,98

419

,366

,563

.419

,996

,117

.20%

3,51

41,

918

19,9

17,3

07.6

21⊕

800

19,3

66,5

63.4

19,3

66,5

63.4

0%31

873

19,3

66,5

63.4

19,5

39,6

98.3

0%5,

502

1,00

619

,366

,563

.419

,712

,833

.30%

4,41

01,

273

19,9

17,3

07.6

20⊙

900

19,3

66,5

63.4

19,3

66,5

63.4

0%33

678

19,3

66,5

63.4

19,3

66,5

63.4

0%4,

928

700

19,3

66,5

63.4

19,3

66,5

63.4

0%6,

622

979

19,9

17,3

07.6

20⊙

1,00

019

,366

,563

.419

,366

,563

.40%

324

7519

,366

,563

.419

,539

,698

.30%

5,60

094

719

,366

,563

.419

,366

,563

.40%

4,73

273

119

,917

,307

.620

⊙1,

100

19,3

66,5

63.4

19,3

66,5

63.4

0%30

61,

452

19,3

66,5

63.4

19,3

66,5

63.4

0%3,

556

2,00

287

,944

,892

.087

,944

,892

.010

0%14

,000

6,00

019

,917

,307

.621

⊕1,

200

18,4

18,7

79.2

18,7

97,8

92.9

0%95

477

518

,418

,779

.218

,797

,892

.90%

4,78

81,

268

87,9

44,8

92.0

87,9

44,8

92.0

100%

14,0

006,

000

19,9

17,3

07.6

20⊕

1,30

018

,418

,779

.218

,418

,779

.20%

366

8618

,418

,779

.218

,608

,336

.00%

7,01

41,

331

18,4

18,7

79.2

74,0

39,6

69.4

80%

7,05

64,

854

19,9

17,3

07.6

21⊕

1,40

016

,726

,453

.017

,741

,848

.70%

1,23

01,

123

18,4

18,7

79.2

18,6

08,3

36.0

0%6,

020

2,03

787

,944

,892

.087

,944

,892

.010

0%14

,000

6,00

019

,917

,307

.620

⊕1,

500

12,5

96,5

74.7

12,5

96,5

74.7

0%1,

152

276

14,2

91,1

33.4

16,9

12,5

57.2

0%8,

834

1,95

312

,596

,574

.727

,666

,238

.220

%7,

602

1,95

114

,291

,133

.420

⊕1,

600

12,5

96,5

74.7

12,5

96,5

74.7

0%35

497

12,5

96,5

74.7

12,9

35,4

86.4

0%5,

222

879

12,5

96,5

74.7

12,5

96,5

74.7

0%3,

066

550

14,2

91,1

33.4

20⊙

1,70

012

,596

,574

.712

,596

,574

.70%

336

9012

,596

,574

.712

,596

,574

.70%

4,18

656

312

,596

,574

.712

,596

,574

.70%

3,29

046

414

,291

,133

.420

⊙1,

800

12,5

96,5

74.7

12,5

96,5

74.7

0%28

263

12,5

96,5

74.7

12,5

96,5

74.7

0%4,

102

564

12,5

96,5

74.7

12,5

96,5

74.7

0%3,

864

575

14,2

91,1

33.4

20⊙

1,90

012

,596

,574

.712

,596

,574

.70%

174

3912

,596

,574

.712

,596

,574

.70%

3,47

246

912

,596

,574

.712

,596

,574

.70%

3,93

458

314

,291

,133

.420

⊙2,

000

12,5

96,5

74.7

12,5

96,5

74.7

0%20

446

12,5

96,5

74.7

12,5

96,5

74.7

0%2,

534

343

12,5

96,5

74.7

12,5

96,5

74.7

0%4,

102

604

14,2

91,1

33.4

20⊙

Avg

16,6

52,5

47.1

16,7

50,5

24.8

0%47

130

216

,918

,332

.117

,172

,913

.90%

4,89

71,

090

29,7

35,2

27.1

35,0

61,3

13.6

26.2

5%6,

392

2,12

417

,852

,156

.420

Tabl

e5.

4:O

ptim

izat

ion

rate

sfo

rth

eC

SP

.

SB

PSO

GA

IG

AD

M(M

B)

%B

est

%A

vgE

rr%

Bes

t%

Avg

Err

%B

est

%A

vgE

rr%

Bes

t

500

76.9

9%76

.99%

0%76

.99%

76.9

8%0%

76.9

9%76

.98%

0%76

.95%

600

76.9

9%76

.99%

0%76

.99%

76.9

9%0%

76.9

9%61

.60%

20%

76.9

5%70

077

.98%

77.7

8%0%

76.9

9%76

.99%

0%77

.98%

77.2

6%0%

77.3

5%80

077

.98%

77.9

8%0%

77.9

8%77

.78%

0%77

.98%

77.5

9%0%

77.3

5%90

077

.98%

77.9

8%0%

77.9

8%77

.98%

0%77

.98%

77.9

8%0%

77.3

5%1,

000

77.9

8%77

.98%

0%77

.98%

77.7

8%0%

77.9

8%77

.98%

0%77

.35%

1,10

077

.98%

77.9

8%0%

77.9

8%77

.98%

0%0.

00%

0.00

%10

0%77

.35%

1,20

079

.06%

78.6

3%0%

79.0

6%78

.63%

0%0.

00%

0.00

%10

0%77

.35%

1,30

079

.06%

79.0

6%0%

79.0

6%78

.84%

0%79

.06%

15.8

1%80

%77

.35%

1,40

080

.98%

79.8

3%0%

79.0

6%78

.84%

0%0.

00%

0.00

%10

0%77

.35%

1,50

085

.68%

85.6

8%0%

83.7

5%80

.77%

0%85

.68%

68.5

4%20

%83

.75%

1,60

085

.68%

85.6

8%0%

85.6

8%85

.29%

0%85

.68%

85.6

8%0%

83.7

5%1,

700

85.6

8%85

.68%

0%85

.68%

85.6

8%0%

85.6

8%85

.68%

0%83

.75%

1,80

085

.68%

85.6

8%0%

85.6

8%85

.68%

0%85

.68%

85.6

8%0%

83.7

5%1,

900

85.6

8%85

.68%

0%85

.68%

85.6

8%0%

85.6

8%85

.68%

0%83

.75%

2,00

085

.68%

85.6

8%0%

85.6

8%85

.68%

0%85

.68%

85.6

8%0%

83.7

5%

Avg

81.0

6%80

.95%

0%80

.76%

80.4

7%0%

66.1

9%60

.13%

26.2

5%79

.70%

71

5.4.3.2 The moderate size problem set CMP results

The querying performance for the moderate size problem set CMP presented in Table 5.5

indicates that the BPSO algorithm has generated better results in general. The BPSO has

outperformed the methods GAI, GA and DM for both the best solutions found and average

solution quality obtained. In terms of the best solutions found, the BPSO algorithm has

generated the best solutions in 78 out of 80 cases (97.5%) while the GAI method has

generated the best solutions in 60 out of 80 cases (75%). The GA generated the best

solutions in 55 cases (68.75%), the DM algorithm did not generate any best solution in all

16 cases (i.e., 0 out of 80, or 0%). In terms of the average number of evaluations performed,

the BPSO needed about 8 times (8.46 exactly) less evaluations than the GAI and about 11

times (10.56 exactly) less evaluations than the GA, yet achieving better solution quality. In

terms of the infeasible solutions generated, both the BPSO and GAI have always generated

feasible solutions (i.e., 0% infeasible or 100% feasible) while GAI has generated a significant

number of infeasible solutions, 19 runs out of 80 cases, which means 61 feasible ones - i.e.,

23.75% infeasible or 76.25% feasible (all generated solutions were infeasible in the cases

when the storage size was equal to 1600 or 1700 MB). The last table column indicates that

there was a statistically significant difference between the results of the BPSO and both

the GAI and GA in 5 out of 16 cases. In terms of the computation time, the BPSO was

about 6 times (6.19 exactly) times faster than the GAI and about 7 times (7.33 exactly)

faster than the GA. Table 5.6 shows the optimization rates of the algorithms BPSO, GAI,

GA and DM for the moderate problem set CMP. The best performance rates (%Best) for

the BPSO were similar to those for the GAI and slightly higher than those for the DM and

significantly higher than those for the GA (due to a high the number of infeasible solutions

generated by the GA). The average performance rates (%Avg) for the BPSO were also

slightly higher than those for the GAI and DM, and significantly higher than those for the

GA. In summary, as also indicated by the last row of Table 5.5, the BPSO algorithm has

shown considerably better performance than the GAI, GA and DM methods in all aspects.

72

Tabl

e5.

5:Q

uery

ing

perf

orm

ance

resu

lts

for

the

mod

erat

esi

zepr

oble

mse

tC

MP

.

SB

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

Bes

tA

vgE

rrE

vals

Tim

eB

est

Avg

Err

Eva

lsT

ime

Bes

tA

vgE

rrE

vals

Tim

eB

est

Tim

e

500

57,2

21,5

45.8

57,2

21,5

45.8

0%34

225

957

,221

,545

.857

,221

,545

.80%

6,25

82,

502

57,2

21,5

45.8

57,2

21,5

45.8

0%6,

986

3,02

883

,189

,106

.432

⊙60

057

,221

,545

.857

,221

,545

.80%

366

260

57,2

21,5

45.8

57,2

21,5

45.8

0%3,

066

1,24

657

,221

,545

.857

,221

,545

.80%

6,44

02,

909

62,9

83,1

69.8

37⊙

700

57,2

21,5

45.8

57,2

21,5

45.8

0%39

027

857

,221

,545

.857

,353

,505

.80%

3,68

22,

622

57,2

21,5

45.8

57,2

21,5

45.8

0%4,

858

2,15

562

,983

,169

.837

⊙80

057

,221

,545

.857

,221

,545

.80%

558

403

57,2

21,5

45.8

57,3

53,5

05.8

0%6,

230

3,51

757

,221

,545

.857

,221

,545

.80%

5,68

42,

592

62,5

40,9

98.0

41⊙

900

57,2

21,5

45.8

57,2

21,5

45.8

0%34

224

557

,221

,545

.870

,276

,181

.00%

7,42

04,

381

57,2

21,5

45.8

96,2

10,3

91.7

20%

7,42

02,

246

62,5

40,9

98.0

41⊙

1,00

052

,571

,745

.553

,444

,328

.50%

1,70

41,

405

52,5

71,7

45.5

54,9

95,1

12.6

0%4,

760

4,55

952

,571

,745

.555

,481

,739

.90%

5,55

85,

394

62,5

40,9

98.0

41⊙

1,10

052

,571

,745

.552

,571

,745

.50%

936

684

52,5

71,7

45.5

55,2

46,8

71.6

0%9,

002

6,03

552

,571

,745

.553

,501

,705

.60%

4,55

03,

200

62,5

40,9

98.0

41⊙

1,20

052

,571

,745

.552

,571

,745

.50%

684

513

52,5

71,7

45.5

54,4

31,6

65.6

0%7,

546

4,29

552

,571

,745

.554

,431

,665

.60%

6,98

64,

717

57,5

49,1

35.1

41⊙

1,30

052

,571

,745

.552

,571

,745

.50%

480

368

52,5

71,7

45.5

53,5

01,7

05.6

0%4,

760

3,28

452

,571

,745

.592

,358

,591

.420

%6,

888

2,96

357

,549

,135

.142

⊙1,

400

52,5

71,7

45.5

52,5

71,7

45.5

0%94

271

552

,571

,745

.553

,598

,060

.10%

6,94

44,

269

52,5

71,7

45.5

211,

719,

129.

180

%11

,886

7,69

757

,549

,135

.142

⊕1,

500

52,5

71,7

45.5

52,5

71,7

45.5

0%85

265

452

,571

,745

.552

,827

,584

.90%

6,83

23,

649

52,5

71,7

45.5

172,

862,

243.

360

%9,

730

6,15

457

,549

,135

.142

⊕1,

600

52,5

71,7

45.5

52,5

71,7

45.5

0%41

433

652

,571

,745

.553

,489

,226

.20%

5,23

63,

767

251,

505,

975.

025

1,50

5,97

5.0

100%

14,0

0010

,000

57,5

49,1

35.1

42⊕

1,70

051

,631

,175

.451

,819

,289

.40%

1,60

22,

123

51,6

31,1

75.4

53,6

92,1

02.3

0%7,

672

4,90

525

1,50

5,97

5.0

251,

505,

975.

010

0%14

,000

10,0

0057

,549

,135

.142

⊕1,

800

39,1

36,6

45.4

39,1

36,6

45.4

0%1,

236

969

39,1

36,6

45.4

50,3

35,9

91.1

0%7,

098

5,72

439

,136

,645

.439

,136

,645

.40%

6,98

63,

588

57,5

49,1

35.1

42⊕

1,90

039

,136

,645

.439

,136

,645

.40%

498

386

39,1

36,6

45.4

41,6

35,5

51.4

0%7,

420

3,98

139

,136

,645

.442

,741

,146

.10%

7,01

43,

450

57,5

49,1

35.1

42⊙

2,00

039

,136

,645

.439

,136

,645

.40%

360

283

39,1

36,6

45.4

39,1

36,6

45.4

0%5,

264

2,39

639

,136

,645

.439

,136

,645

.40%

4,71

82,

352

44,6

22,4

15.0

42⊙

Avg

51,4

46,9

41.2

51,5

13,2

34.8

0%73

261

751

,446

,941

.253

,894

,800

.10%

6,19

93,

821

76,3

72,5

05.5

99,3

42,3

77.3

23.7

%7,

732

4,52

860

,270

,933

.440

Tabl

e5.

6:O

ptim

izat

ion

rate

sfo

rth

eC

MP

.

SB

PSO

GA

IG

AD

M(M

B)

%B

est

%A

vgE

rr%

Bes

t%

Avg

Err

%B

est

%A

vgE

rr%

Bes

t

500

77.2

5%77

.25%

0%77

.25%

77.2

5%0%

77.2

5%77

.25%

0%66

.92%

600

77.2

5%77

.25%

0%77

.25%

77.2

5%0%

77.2

5%77

.25%

0%74

.96%

700

77.2

5%77

.25%

0%77

.25%

77.2

0%0%

77.2

5%77

.25%

0%74

.96%

800

77.2

5%77

.25%

0%77

.25%

77.2

0%0%

77.2

5%77

.25%

0%75

.13%

900

77.2

5%77

.25%

0%77

.25%

72.0

6%0%

77.2

5%61

.75%

20%

75.1

3%1,

000

79.1

0%78

.75%

0%79

.10%

78.1

3%0%

79.1

0%77

.94%

0%75

.13%

1,10

079

.10%

79.1

0%0%

79.1

0%78

.03%

0%79

.10%

78.7

3%0%

75.1

3%1,

200

79.1

0%79

.10%

0%79

.10%

78.3

6%0%

79.1

0%78

.36%

0%77

.12%

1,30

079

.10%

79.1

0%0%

79.1

0%78

.73%

0%79

.10%

63.2

8%20

%77

.12%

1,40

079

.10%

79.1

0%0%

79.1

0%78

.69%

0%79

.10%

15.8

2%80

%77

.12%

1,50

079

.10%

79.1

0%0%

79.1

0%79

.00%

0%79

.10%

31.2

7%60

%77

.12%

1,60

079

.10%

79.1

0%0%

79.1

0%78

.73%

0%0.

00%

0.00

%10

0%77

.12%

1,70

079

.47%

79.4

0%0%

79.4

7%78

.65%

0%0.

00%

0.00

%10

0%77

.12%

1,80

084

.44%

84.4

4%0%

84.4

4%79

.99%

0%84

.44%

84.4

4%0%

77.1

2%1,

900

84.4

4%84

.44%

0%84

.44%

83.4

5%0%

84.4

4%83

.01%

0%77

.12%

2,00

084

.44%

84.4

4%0%

84.4

4%84

.44%

0%84

.44%

84.4

4%0%

82.2

6%

Avg

79.5

4%79

.52%

0%79

.54%

78.5

7%0%

69.6

3%60

.50%

23.7

5%76

.04%

73

5.4.3.3 The larger size problem set CLP results

Table 5.7 shows the querying performance for the BPSO,GAI, GA and DM approaches

for the larger size problem set CLP. Note that this class is the hardest one. The BPSO

algorithm has again generated better results in general. The BPSO has outperformed the

methods GAI, GA and DM for both the best solutions found and average solution quality

obtained. In terms of the best solutions found, the BPSO algorithm has generated the best

solutions in 68 out of 80 cases (85%) while the GAI method has generated the best solutions

in 43 out of 80 cases (53.75%). The GA generated the best solutions in 39 out of 80 runs

(48.75%, the DM algorithm has not generated any best solution in all 16 cases (i.e., 0 out of

80, or 0%). In terms of the average number of evaluations performed, the BPSO needed

about 9 times (9.47 exactly) less evaluations than the GAI and about 11 times (11.11

exactly) less evaluations than the GA, yet achieving better solution quality. In terms of the

infeasible solutions generated, the BPSO and GAI have both always generated feasible

solutions (i.e., 0% infeasible or 100% feasible) while the GA has generated infeasible

solutions in 9 runs out of 80 cases, which means 71 feasible ones (i.e., 11.25% infeasible

or 88.75% feasible). In terms of the computation time, the BPSO was about 4 times (3.50

exactly) faster than the GAI and about 4 times (3.74 exactly) faster than the GA. The

last table column indicates that there was a statistically significant difference between

the results of BPSO and both the GAI and GA in 4 out of 16 cases. Table 5.8 shows the

optimization rates of the BPSO, GAI, GA and DM for the larger problem set CMP. The best

performance rates (%Best) for the BPSO were slightly higher than those for the GAI and

GA and reasonably higher than those for the DM. The average performance rates (%Avg)

for the BPSO were also slightly higher than those for the GAI and reasonably higher than

those for the GA. In summary, as also indicated by the last row of Table 5.7, the BPSO

algorithm has again shown considerably better performance than the GAI, GA and DM

methods in all aspects.

74

Tabl

e5.

7:Q

uery

ing

perf

orm

ance

resu

lts

for

the

larg

ersi

zepr

oble

mse

tC

LP

.

SB

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

Bes

tA

vgE

rrE

vals

Tim

eB

est

Avg

Err

Eva

lsT

ime

Bes

tA

vgE

rrE

vals

Tim

eB

est

Tim

e

500

120,

162,

008.

112

0,16

2,00

8.1

0%50

463

312

0,16

2,00

8.1

120,

162,

008.

10%

6,45

44,

407

120,

162,

008.

112

0,16

2,00

8.1

0%7,

252

5,30

022

4,98

3,17

3.4

123

⊙60

012

0,16

2,00

8.1

120,

162,

008.

10%

444

558

120,

162,

008.

112

1,43

7,24

1.8

0%6,

454

6,17

212

0,16

2,00

8.1

120,

162,

008.

10%

8,34

46,

331

164,

811,

723.

913

5⊙

700

120,

162,

008.

112

0,16

2,00

8.1

0%49

260

612

0,16

2,00

8.1

120,

162,

008.

10%

5,72

34,

008

120,

162,

008.

112

3,56

4,30

3.8

0%6,

608

4,34

016

4,81

1,72

3.9

128

⊕80

012

0,16

2,00

8.1

120,

162,

008.

10%

408

514

120,

162,

008.

112

0,16

2,00

8.1

0%8,

330

5,93

212

0,16

2,00

8.1

120,

162,

008.

10%

7,09

86,

423

161,

576,

842.

412

5⊙

900

120,

162,

008.

112

0,16

2,00

8.1

0%51

667

412

0,16

2,00

8.1

122,

314,

119.

80%

9,67

410

,276

120,

162,

008.

112

0,41

1,93

4.8

0%6,

188

5,69

716

1,57

6,84

2.4

126

⊙1,

000

120,

162,

008.

112

0,16

2,00

8.1

0%49

265

312

0,16

2,00

8.1

121,

596,

749.

20%

5,34

86,

279

120,

162,

008.

112

0,87

9,37

8.7

0%7,

420

7,50

716

1,57

6,84

2.4

127

⊙1,

100

120,

162,

008.

112

1,03

3,81

0.5

0%49

83,

646

120,

162,

008.

112

2,08

7,01

7.3

0%5,

782

6,81

212

0,16

2,00

8.1

221,

182,

491.

520

%6,

762

5,67

214

9,38

8,38

4.0

124

⊙1,

200

112,

513,

169.

411

2,51

3,16

9.4

0%1,

566

2,02

111

6,47

4,24

6.7

119,

424,

455.

80%

6,70

610

,139

112,

513,

169.

411

8,63

2,24

0.4

0%6,

580

10,4

9514

9,38

8,38

4.0

126

⊕1,

300

112,

513,

169.

411

8,63

2,24

0.4

0%57

66,

211

112,

513,

169.

411

8,02

1,79

6.9

0%8,

050

10,1

5411

2,51

3,16

9.4

118,

632,

240.

40%

6,45

49,

355

149,

388,

384.

012

7⊙

1,40

011

2,51

3,16

9.4

114,

042,

937.

20%

462

2,12

811

2,51

3,16

9.4

118,

632,

240.

40%

7,70

09,

019

112,

513,

169.

411

7,10

2,47

2.6

0%7,

112

9,06

114

9,38

8,38

4.0

123

⊙1,

500

112,

513,

169.

411

4,04

2,93

7.2

0%52

82,

141

112,

513,

169.

411

7,10

2,47

2.6

0%6,

552

8,61

311

2,51

3,16

9.4

118,

632,

240.

40%

5,75

49,

389

149,

388,

384.

012

3⊙

1,60

011

2,51

3,16

9.4

115,

572,

704.

90%

1,15

24,

363

112,

513,

169.

411

8,63

2,24

0.4

0%8,

372

9,77

712

0,16

2,00

8.1

120,

162,

008.

10%

5,27

810

,369

149,

388,

384.

012

2⊙

1,70

011

2,51

3,16

9.4

115,

572,

704.

90%

918

3,99

811

2,51

3,16

9.4

116,

609,

606.

10%

6,80

49,

616

112,

513,

169.

432

0,67

3,20

7.1

40%

9,92

611

,379

149,

388,

384.

013

1⊙

1,80

011

2,51

3,16

9.4

114,

042,

937.

20%

948

2,71

711

2,51

3,16

9.4

114,

042,

937.

20%

7,16

86,

721

120,

162,

008.

132

2,20

2,97

4.9

40%

8,42

811

,131

149,

388,

384.

012

4⊕

1,90

011

1,01

3,98

4.0

111,

013,

984.

00%

1,27

21,

799

111,

013,

984.

011

3,02

9,24

1.8

0%9,

660

11,0

4512

0,73

1,10

0.0

524,

357,

760.

080

%12

,810

14,6

1512

9,69

4,92

9.8

123

⊕2,

000

87,9

26,2

65.2

96,9

01,7

73.1

0%1,

530

3,88

387

,926

,265

.210

7,26

7,71

0.9

0%7,

826

9,13

087

,926

,265

.210

0,93

4,38

0.7

0%7,

966

9,68

012

9,69

4,92

9.8

123

⊙A

vg11

4,22

9,15

5.8

115,

896,

328.

00%

769

2,28

411

4,47

6,72

3.1

118,

167,

740.

90%

7,28

88,

006

115,

792,

580.

317

5,49

0,85

3.6

11.2

%7,

499

8,54

615

5,86

4,63

0.0

126

Tabl

e5.

8:O

ptim

izat

ion

rate

sfo

rth

eC

LP

.

SB

PSO

GA

IG

AD

M(M

B)

%B

est

%A

vgE

rr%

Bes

t%

Avg

Err

%B

est

%A

vgE

rr%

Bes

t

500

80.7

8%80

.78%

0%80

.78%

80.7

8%0%

80.7

8%80

.78%

0%64

.02%

600

80.7

8%80

.78%

0%80

.78%

80.5

8%0%

80.7

8%80

.78%

0%73

.64%

700

80.7

8%80

.78%

0%80

.78%

80.7

8%0%

80.7

8%80

.24%

0%73

.64%

800

80.7

8%80

.78%

0%80

.78%

80.7

8%0%

80.7

8%80

.78%

0%74

.16%

900

80.7

8%80

.78%

0%80

.78%

80.4

4%0%

80.7

8%80

.74%

0%74

.16%

1,00

080

.78%

80.7

8%0%

80.7

8%80

.55%

0%80

.78%

80.6

7%0%

74.1

6%1,

100

80.7

8%80

.64%

0%80

.78%

80.4

7%0%

80.7

8%64

.63%

20%

76.1

1%1,

200

82.0

1%82

.01%

0%81

.37%

80.9

0%0%

82.0

1%81

.03%

0%76

.11%

1,30

082

.01%

81.0

3%0%

82.0

1%81

.12%

0%82

.01%

81.0

3%0%

76.1

1%1,

400

82.0

1%81

.76%

0%82

.01%

81.0

3%0%

82.0

1%81

.27%

0%76

.11%

1,50

082

.01%

81.7

6%0%

82.0

1%81

.27%

0%82

.01%

81.0

3%0%

76.1

1%1,

600

82.0

1%81

.52%

0%82

.01%

81.0

3%0%

80.7

8%80

.78%

0%76

.11%

1,70

082

.01%

81.5

2%0%

82.0

1%81

.35%

0%82

.01%

48.7

1%40

%76

.11%

1,80

082

.01%

81.7

6%0%

82.0

1%81

.76%

0%80

.78%

48.4

7%40

%76

.11%

1,90

082

.25%

82.2

5%0%

82.2

5%81

.92%

0%80

.69%

16.1

4%80

%79

.26%

2,00

085

.94%

84.5

0%0%

85.9

4%82

.84%

0%85

.94%

83.8

6%0%

79.2

6%

Avg

81.7

3%81

.46%

0%81

.69%

81.1

0%0%

81.4

8%71

.93%

11.2

5%75

.07%

75

5.4.4 Performance Scalability Study

Experiments were extended (i.e., scaled up) to further analyze the effectiveness of the

BPSO algorithm against the GAI, GA and DM algorithms. The cost models was the same

as the one used in the previous experiments. In the scalability study, the fact table size has

been increased 30 million to 150 million tuples (in 30 millions tuple increments, yielding a

total of 5 different cases) and for each increment of the fact table size |F|, the storage size

S was systematically increased from 500 MB to 2000 MB (in 500 MB increments, yielding

a total of 4 different cases). Again, the average cost of solutions over 5 independent runs

was reported for the stochastic algorithms BPSO, GAI and GA. Therefore, a total of 100

runs (20×5) was under consideration for the algorithms BPSO, GAI and GA, and 20 runs

(20×1) was for the DM. Other parameters were kept the same (see Section 4.3 for details).

The querying performance of scalability experiments for the three problem sets CSP, CMP

and CLP are presented in Table 5.9, Table 5.11 and Table 5.13 (see Section 4.3 for table

details). The additional column |F| in the tables represents the fact table size in millions.

5.4.4.1 Scalability results for the smaller size problem set CSP.

The performance of scalability experiments for the smaller size problem set (CSP) presented

in Table 5.9 indicates that the BPSO algorithm was again superior to the competitor

algorithms GAI, GA and DM. The BPSO algorithm has generated the best solutions in 96

of 100 cases (96% - the overall best and the best average of five independent runs, except

the case when the storage size was equal to 1500 MB and the fact table size, |F|, was equal

to 60 M or 150 M) while the GAI approach has generated the best solutions in 55 out of 100

cases (55%). The GA approach has only generated the best solutions in 67 out of 100 cases

(67%). The DM approach has generated the best solutions in 7 out of 20 cases only (i.e., 35

out of 100, or 35%). In terms of the average number of evaluations performed, the BPSO

needed about 6 times (5.64 exactly) less evaluations than the GAI and about 7 times (6.95

exactly) less evaluations than the GA, yet achieving better solution quality. In terms of the

infeasible solutions generated, the BPSO and GAI have always generated feasible solutions

(i.e., 0% infeasible or 100% feasible) while the GA has generated infeasible solutions in 22

runs out of 100 cases (i.e., 22% infeasible or 68% feasible). In terms of the computation

time, the BPSO was about 10 times (10.25 exactly) and about 12 times (12.22 exactly)

faster than the GAI and GA respectively. The DM approach was about 10 times faster

(9.98 exactly) than the BPSO, but it only generated 30% of the best solutions. There was

a statistically significant difference between the results of the BPSO, GAI and GA in 5

out of 20 cases. Table 5.10 shows the optimization rates of the algorithms BPSO, GAI, GA

76

and DM for the smaller problem set CSP. The best performance rates for the BPSO were

slightly higher than those for GAI. The average performance rates for BPSO were also

higher than those for the GAI, GA and DM. In summary, as also indicated by the last row

of Table 5.9, the BPSO algorithm has shown considerably better performance than the

GAI,GA and DM methods in almost all aspects.

77

Tabl

e5.

9:Pe

rfor

man

cere

sult

sfo

rth

esm

alle

stsi

zepr

oble

mse

tC

SP

insc

alab

ility

.

S|F

|B

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

(M)

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Avg

Err

Tim

eE

vals

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Tim

e

500

3024

,487

,250

.724

,487

,250

.70%

354

7624

,487

,250

.724

,507

,767

.90%

5,76

866

124

,487

,250

.724

,487

,250

.70%

6,53

875

324

,538

,543

.758

⊙60

80,6

02,5

13.0

80,6

02,5

13.0

0%62

412

780

,602

,513

.092

,404

,997

.00%

7,05

676

980

,602

,513

.015

9,96

9,39

4.0

60%

10,0

101,

013

80,6

02,5

13.0

55⊙

9014

3,97

1,54

3.4

143,

971,

543.

40%

750

137

143,

971,

543.

418

5,83

8,65

7.1

0%5,

880

573

143,

971,

543.

418

4,28

5,16

5.7

20%

4,42

445

714

3,97

1,54

3.4

57⊙

120

226,

926,

383.

722

6,92

6,38

3.7

0%36

069

226,

926,

383.

728

0,53

4,35

5.1

0%2,

842

283

226,

926,

383.

722

6,92

6,38

3.7

0%2,

408

263

360,

946,

312.

358

⊙15

028

3,65

5,55

4.5

283,

655,

554.

50%

288

5428

3,65

5,55

4.5

362,

790,

543.

70%

4,49

447

728

3,65

5,55

4.5

283,

655,

554.

50%

2,03

020

245

1,17

9,76

4.2

54⊙

1000

3023

,438

,905

.123

,438

,905

.10%

528

109

23,4

38,9

05.1

23,8

58,2

43.4

0%5,

278

605

23,4

38,9

05.1

23,6

48,5

74.2

0%6,

006

843

24,1

05,4

92.0

54⊙

6048

,968

,732

.548

,968

,732

.50%

612

121

48,9

68,7

32.5

55,1

92,9

51.1

0%3,

332

340

48,9

68,7

32.5

48,9

89,2

44.8

0%4,

620

534

49,0

71,2

93.9

56⊙

9073

,603

,565

.773

,603

,565

.70%

354

6473

,603

,565

.794

,321

,748

.60%

5,29

253

673

,603

,565

.773

,603

,565

.70%

5,86

663

673

,603

,565

.752

⊙12

016

1,19

8,24

0.6

161,

198,

240.

60%

624

112

161,

198,

240.

617

3,50

3,06

3.3

0%9,

548

955

425,

751,

612.

042

5,75

1,61

2.0

100%

14,0

001,

400

161,

198,

240.

652

⊕15

023

4,05

2,27

3.8

234,

052,

273.

80%

486

8723

4,05

2,27

3.8

236,

410,

610.

50%

6,10

458

023

4,05

2,27

3.8

472,

559,

299.

680

%11

,214

1,12

123

9,94

8,11

5.6

54⊕

1500

3022

,290

,604

.222

,290

,604

.20%

708

142

22,9

57,1

91.1

23,3

42,5

62.3

0%3,

080

365

106,

444,

128.

010

6,44

4,12

8.0

100%

14,0

001,

400

24,1

05,4

92.0

54⊕

6048

,202

,475

.448

,662

,229

.70%

1,42

227

748

,968

,732

.548

,989

,244

.80%

6,69

272

248

,968

,732

.548

,968

,732

.50%

3,12

236

848

,202

,475

.453

⊙90

73,4

49,7

36.8

73,4

49,7

36.8

0%43

887

73,4

49,7

36.8

73,4

80,5

02.6

0%4,

718

518

73,4

49,7

36.8

73,4

80,5

02.6

0%6,

230

741

73,6

03,5

65.7

53⊙

120

98,1

36,3

15.9

98,1

36,3

15.9

0%57

610

598

,136

,315

.913

5,11

2,12

2.6

0%7,

224

741

98,1

36,3

15.9

98,1

36,3

15.9

0%6,

594

749

98,1

36,3

15.9

57⊕

150

170,

349,

574.

017

6,21

9,90

9.1

0%66

612

117

0,34

9,57

4.0

190,

114,

021.

70%

6,49

664

517

0,34

9,57

4.0

459,

818,

759.

680

%11

,914

1,19

917

0,34

9,57

4.0

53⊕

2000

3015

,243

,142

.015

,243

,142

.00%

408

8415

,243

,142

.017

,652

,374

.00%

5,93

670

415

,243

,142

.015

,243

,142

.00%

4,22

852

717

,294

,140

.555

⊙60

46,8

69,3

80.0

46,8

69,3

80.0

0%52

210

246

,869

,380

.047

,709

,121

.00%

4,80

253

146

,869

,380

.046

,869

,380

.00%

4,66

254

648

,202

,475

.453

⊙90

73,4

49,7

36.8

73,4

49,7

36.8

0%28

253

73,4

49,7

36.8

73,4

49,7

36.8

0%6,

426

691

73,4

49,7

36.8

73,4

49,7

36.8

0%4,

466

506

73,6

03,5

65.7

54⊙

120

97,9

31,2

18.7

97,9

31,2

18.7

0%33

663

97,9

31,2

18.7

97,9

31,2

18.7

0%5,

810

593

97,9

31,2

18.7

97,9

72,2

38.1

0%4,

914

548

98,1

36,3

15.9

55⊙

150

122,

412,

223.

012

2,41

2,22

3.0

0%64

211

812

2,41

2,22

3.0

147,

098,

606.

20%

5,81

058

212

2,41

2,22

3.0

122,

463,

495.

90%

6,97

278

712

2,66

8,58

7.7

63⊙

Avg

103,

461,

968.

510

3,77

8,47

3.0

0%54

910

510

3,53

3,61

0.7

119,

212,

122.

40%

5,62

959

312

0,93

5,62

6.1

153,

336,

123.

822

%6,

711

730

119,

173,

394.

655

78

Table 5.10: Optimization rates for the CSP in scalability.

S |F| BPSO GAI GA DM(MB) (M) %Best %Avg Err %Best %Avg Err %Best %Avg Err %Best

500 30 77.00% 77.00% 0% 77.00% 76.98% 0% 77.00% 77.00% 0% 76.95%60 62.14% 62.14% 0% 62.14% 56.59% 0% 62.14% 24.85% 60% 62.14%90 54.91% 54.91% 0% 54.91% 41.80% 0% 54.91% 42.29% 20% 54.91%120 46.70% 46.70% 0% 46.70% 34.11% 0% 46.70% 46.70% 0% 15.22%150 46.70% 46.70% 0% 46.70% 31.83% 0% 46.70% 46.70% 0% 15.22%

1000 30 77.98% 77.98% 0% 77.98% 77.59% 0% 77.98% 77.78% 0% 77.35%60 77.00% 77.00% 0% 77.00% 74.07% 0% 77.00% 76.99% 0% 76.95%90 76.95% 76.95% 0% 76.95% 70.46% 0% 76.95% 76.95% 0% 76.95%120 62.14% 62.14% 0% 62.14% 59.25% 0% 0.00% 0.00% 100% 62.14%150 56.02% 56.02% 0% 56.02% 55.58% 0% 56.02% 11.20% 80% 54.91%

1500 30 79.06% 79.06% 0% 78.43% 78.07% 0% 0.00% 0.00% 100% 77.35%60 77.36% 77.14% 0% 77.00% 76.99% 0% 77.00% 77.00% 0% 77.36%90 77.00% 77.00% 0% 77.00% 76.99% 0% 77.00% 76.99% 0% 76.95%120 76.95% 76.95% 0% 76.95% 68.27% 0% 76.95% 76.95% 0% 76.95%150 67.99% 66.89% 0% 67.99% 64.28% 0% 67.99% 13.60% 80% 67.99%

2000 30 85.68% 85.68% 0% 85.68% 83.42% 0% 85.68% 85.68% 0% 83.75%60 77.98% 77.98% 0% 77.98% 77.59% 0% 77.98% 77.98% 0% 77.36%90 77.00% 77.00% 0% 77.00% 77.00% 0% 77.00% 77.00% 0% 76.95%120 77.00% 77.00% 0% 77.00% 77.00% 0% 77.00% 76.99% 0% 76.95%150 77.00% 77.00% 0% 77.00% 72.36% 0% 77.00% 76.99% 0% 76.95%

Avg 70.53% 70.46% 0% 70.48% 66.51% 0% 63.45% 55.98% 22% 67.07%

5.4.4.2 Scalability results for the moderate size problem set CMP.

The performance of scalability experiments for the moderate size problem set presented in

Table 5.11 indicates that the BPSO algorithm has again generated better results than the

algorithms GAI, GA and DM. The BPSO has outperformed the algorithms GAI, GA and

DM for both the best solutions found and the average solution quality obtained. The BPSO

has always generated the best solutions in all runs except in the cases when the storage

size, S, was equal to 2000 and the fact table size, |F|, was equal to 30M or 150M where

the BPSO has not generated the best solutions in 3 cases (97%). The GAI approach has

generated the best solutions in 37 out of 100 runs (37%) and the GA has generated best

solutions in 33 out of 100 runs (33%) only. In terms of the infeasible solutions generated,

the BPSO and GAI has always generated feasible solutions (i.e., 0% infeasible or 100%

feasible) while the GA has generated feasible solutions in 77 out of 100 cases - i.e., 23%

infeasible or 77% feasible (all generated solutions were infeasible in the case when S is

equal to 2000 MB and |F| is equal to 30M). In terms of the average number of evaluations

performed, the BPSO needed about 3 times (3.08 exactly) and about 6 times (5.59 exactly)

79

less evaluations than the GAI and GA respectively, yet achieving better solution quality. In

terms of the computation time, the BPSO was about 11 times (10.57 exactly) and about 13

times (12.64 exactly) faster than the GAI and GA respectively, except the DM approach

was about 13 times faster (13.06 exactly), but it has not generated any best solution in all

cases. There was a statistically significant difference between the results of the BPSO, GAI

and GA in 4 out of 20 cases. Table 5.12 shows the optimization rates of BPSO, GAI, GA and

DM for the moderate problem set CSP. The best performance rates (%Best) for the BPSO

were slightly higher than those for the GAI and GA. The average performance rates (%Avg)

for the BPSO were also higher than those for the GAI, GA and DM. In summary, as also

indicated by the last row of Table 5.11, the BPSO algorithm has again shown considerably

better performance than the GAI, GA and DM approaches in all aspects.

80

Tabl

e5.

11:P

erfo

rman

cere

sult

sfo

rth

em

oder

ate

size

prob

lem

set

CM

Pin

scal

abili

ty.

S|F

|B

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

(M)

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Avg

Err

Tim

eE

vals

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Tim

e

500

3069

,255

,881

.669

,255

,881

.60%

498

313

69,2

55,8

81.6

69,2

55,8

81.6

0%7,

252

2,71

969

,255

,881

.669

,255

,881

.60%

4,76

01,

913

137,

574,

898.

639

⊙60

271,

645,

392.

927

1,64

5,39

2.9

0%54

632

327

1,64

5,39

2.9

274,

126,

815.

40%

5,53

01,

871

271,

645,

392.

935

5,93

4,18

7.7

20%

6,79

01,

149

501,

491,

611.

537

⊙90

443,

986,

821.

144

3,98

6,82

1.1

0%45

646

344

3,98

6,82

1.1

516,

062,

183.

30%

6,13

21,

101

443,

986,

821.

146

2,02

8,66

0.2

0%8,

330

1,64

983

0,97

3,20

5.7

41⊙

120

642,

016,

981.

264

2,01

6,98

1.2

0%39

639

364

2,01

6,98

1.2

842,

175,

326.

70%

8,19

01,

489

642,

016,

981.

264

2,01

6,98

1.2

0%4,

452

812

1,18

0,04

4,92

5.8

40⊙

150

802,

514,

641.

680

2,51

4,64

1.6

0%25

225

880

2,51

4,64

1.6

1,07

7,47

4,95

0.1

0%3,

010

528

802,

514,

641.

680

2,51

4,64

1.6

0%4,

326

766

1,47

5,04

6,75

8.3

40⊙

1000

3069

,255

,881

.669

,255

,881

.60%

420

431

69,2

55,8

81.6

69,2

55,8

81.6

0%6,

272

3,20

969

,255

,881

.669

,255

,881

.60%

4,24

21,

899

76,2

29,4

20.9

45⊙

6013

8,49

6,19

1.0

138,

496,

191.

00%

384

356

138,

496,

191.

014

9,26

3,04

2.6

0%7,

210

2,70

113

8,49

6,19

1.0

138,

496,

191.

00%

7,61

63,

150

246,

317,

728.

444

⊙90

343,

011,

227.

534

3,01

1,22

7.5

0%1,

266

1,25

034

3,01

1,22

7.5

366,

666,

984.

30%

8,20

43,

029

343,

011,

227.

579

9,14

9,94

5.5

80%

11,7

466,

608

652,

553,

573.

143

⊕12

054

3,27

1,42

4.9

543,

271,

424.

90%

366

333

543,

271,

424.

957

7,80

6,86

9.7

0%8,

036

2,94

054

3,27

1,42

4.9

815,

474,

127.

440

%4,

921

9,11

41,

002,

958,

761.

537

⊙15

073

9,96

5,81

9.0

739,

965,

819.

00%

510

466

739,

965,

819.

076

6,49

8,12

9.9

0%9,

268

3,08

673

9,96

5,81

9.0

753,

321,

671.

30%

3,66

81,

366

1,38

4,94

0,21

2.6

41⊙

1500

3063

,625

,903

.363

,625

,903

.30%

534

664

63,6

25,9

03.3

63,6

25,9

03.3

0%6,

132

2,09

863

,625

,903

.365

,877

,894

.70%

5,51

62,

021

69,6

50,3

46.3

53⊙

6013

8,49

6,19

1.0

138,

496,

191.

00%

390

464

138,

496,

191.

013

8,49

6,19

1.0

0%4,

844

1,49

713

8,49

6,19

1.0

138,

496,

191.

00%

5,12

41,

613

152,

442,

561.

145

⊙90

207,

735,

149.

920

7,73

5,14

9.9

0%34

238

320

7,73

5,14

9.9

207,

735,

149.

90%

5,09

61,

462

207,

735,

149.

920

7,73

5,14

9.9

0%6,

202

1,78

236

9,46

6,04

5.4

44⊙

120

384,

642,

679.

438

4,64

2,67

9.4

0%87

694

845

7,34

3,47

8.6

523,

215,

167.

60%

4,63

41,

046

384,

642,

679.

41,

050,

988,

227.

980

%11

,746

7,37

286

8,83

6,15

2.3

44⊕

150

644,

733,

058.

464

4,73

3,05

8.4

0%53

456

364

4,73

3,05

8.4

682,

749,

224.

80%

6,77

61,

435

735,

095,

994.

51,

364,

586,

133.

380

%11

,942

8,19

21,

087,

575,

464.

143

⊕20

0030

62,4

84,6

06.9

62,9

41,1

25.5

0%1,

440

1,91

563

,625

,903

.364

,961

,554

.50%

4,07

41,

376

304,

410,

582.

030

4,41

0,58

2.0

100%

14,0

005,

600

69,6

50,3

46.3

53⊕

6013

8,49

6,19

1.0

138,

496,

191.

00%

318

390

138,

496,

191.

013

8,49

6,19

1.0

0%5,

334

1,63

713

8,49

6,19

1.0

138,

814,

738.

40%

3,80

81,

288

151,

368,

563.

053

⊙90

207,

735,

149.

920

7,73

5,14

9.9

0%30

035

420

7,73

5,14

9.9

207,

735,

149.

90%

4,34

01,

188

207,

735,

149.

920

7,73

5,14

9.9

0%5,

628

1,67

122

8,65

4,21

4.9

45⊙

120

276,

975,

459.

227

6,97

5,45

9.2

0%49

256

427

6,97

5,45

9.2

276,

975,

459.

20%

5,20

81,

447

276,

975,

459.

227

6,97

5,45

9.2

0%7,

504

2,15

449

2,61

6,76

4.5

44⊙

150

480,

797,

856.

849

8,97

2,98

5.4

0%91

21,

041

480,

797,

856.

856

5,41

8,60

0.7

0%3,

318

733

480,

797,

856.

81,

105,

494,

343.

560

%9,

800

6,31

969

2,66

9,88

2.7

39⊙

Avg

333,

457,

125.

433

4,38

8,70

7.8

0%56

259

433

7,14

9,23

0.2

378,

899,

732.

90%

5,94

31,

830

350,

071,

571.

048

8,42

8,10

1.9

23%

7,10

63,

322

583,

553,

071.

843

81

Table 5.12: Optimization rates for the CMP in scalability.


500 30 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 59.54%60 55.38% 55.38% 0% 55.38% 54.97% 0% 55.38% 41.54% 20% 17.63%90 51.38% 51.38% 0% 51.38% 43.49% 0% 51.38% 49.40% 0% 9.00%120 47.27% 47.27% 0% 47.27% 30.83% 0% 47.27% 47.27% 0% 3.08%150 47.27% 47.27% 0% 47.27% 29.20% 0% 47.27% 47.27% 0% 3.08%

1,000 30 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 74.96%60 77.25% 77.25% 0% 77.25% 75.48% 0% 77.25% 77.25% 0% 59.54%90 62.44% 62.44% 0% 62.44% 59.85% 0% 62.44% 12.49% 80% 28.54%120 55.38% 55.38% 0% 55.38% 52.54% 0% 55.38% 33.02% 40% 17.63%150 51.38% 51.38% 0% 51.38% 49.64% 0% 51.38% 50.50% 0% 9.00%

1,500 30 79.10% 79.10% 0% 79.10% 79.10% 0% 79.10% 78.36% 0% 77.12%60 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 74.96%90 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 59.54%120 68.41% 68.41% 0% 62.44% 57.03% 0% 68.41% 13.68% 80% 28.64%150 57.64% 57.64% 0% 57.64% 55.14% 0% 51.70% 10.34% 80% 28.54%

2,000 30 79.47% 79.32% 0% 79.10% 78.66% 0% 0.00% 0.00% 100% 77.12%60 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.20% 0% 75.14%90 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 74.96%120 77.25% 77.25% 0% 77.25% 77.25% 0% 77.25% 77.25% 0% 59.54%150 68.41% 67.22% 0% 68.41% 62.85% 0% 68.41% 27.36% 60% 54.49%

Avg 67.08% 67.01% 0% 66.76% 63.48% 0% 62.81% 51.46% 23% 44.60%

5.4.4.3 Scalability results for the larger size problem set CLP.

Table 5.13 shows the performance for the BPSO,GAI, GA and DM approaches for the larger

size problem set CLP in scalability. Note that this class is the hardest one (it gets even

harder when a parameter size is increased). The BPSO algorithm has again generated

better results than the algorithms GAI, GA and DM. The BPSO has outperformed the

algorithms GAI, GA and DM for both the best solutions found and the average solution

quality obtained. The BPSO has generated the best solutions in 92 out of 100 runs (92%),

both the GAI and GA have reached the best solutions in 60 out of 100 runs (60%). The DM

approach has not generated any best solution in all runs (0%). In terms of the infeasible

solutions generated, the BPSO and GAI have always generated feasible solutions (i.e., 0%

infeasible or 100% feasible) while the GA has generated feasible solutions in 65 runs out of

100 cases -i.e., 35% infeasible or 65% feasible (all generated solutions were infeasible in

the case when the storage size was equal to 1500 MB and the fact table size, |F|, was equal

to 20M or 150M). In terms of the average number of evaluations performed, the BPSO

needed about 7 times (7.44 exactly) and about 23 times (22.75 exactly) less evaluations

82

than the GAI and GA respectively, yet achieving better solution quality. In terms of the

computation time, the BPSO was about 15 times (15.20 exactly) and about 20 times (20.06

exactly) faster than the GAI and GA respectively, except the DM approach was about 13

times faster (13.06 exactly) but it did not generate any best solution in all cases. There

was a statistically significant difference between the results of the BPSO, GAI and GA in 5

out of 20 cases. Table 5.14 shows the optimization rates of the BPSO, GAI, GA and DM

for the largest problem set CLP. The best performance rates (%Best) for the BPSO were

slightly higher than those for the GAI and GA. The average performance rates (%Avg)

for the BPSO were also higher than those for the GAI, GA and DM. In summary, as also

indicated by the last row of Table 5.13, the BPSO algorithm has again shown considerably

better performance than the GAI, GA and DM methods in all aspects.

83

Tabl

e5.

13:P

erfo

rman

cere

sult

sfo

rth

ela

rges

tsi

zepr

oble

mse

tC

LP

insc

alab

ility

.

S|F

|B

PSO

GA

IG

AD

MSt

at.

sign

.(M

B)

(M)

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Avg

Err

Tim

eE

vals

Bes

tA

vgE

rrT

ime

Eva

lsB

est

Tim

e

500

3014

5,43

3,66

9.2

145,

433,

669.

20%

420

454

145,

433,

669.

214

5,43

3,66

9.2

0%5,

978

3,16

814

5,43

3,66

9.2

267,

705,

092.

420

%8,

932

5,67

627

4,73

9,32

9.3

156

⊙60

615,

890,

014.

061

5,89

0,01

4.0

0%51

662

861

5,89

0,01

4.0

875,

514,

914.

10%

5,67

03,

374

618,

607,

278.

81,

155,

564,

502.

560

%10

,486

7,59

61,

186,

121,

710.

915

6⊕

9098

9,25

7,48

7.4

989,

257,

487.

40%

516

674

989,

257,

487.

41,

355,

480,

263.

60%

7,79

85,

244

989,

257,

487.

41,

245,

459,

274.

920

%6,

048

3,79

71,

914,

887,

550.

614

5⊙

120

2,19

7,56

5,20

2.2

2,19

7,56

5,20

2.2

0%49

271

92,

197,

565,

202.

22,

310,

318,

283.

60%

7,81

23,

343

2,19

7,56

5,20

2.2

2,86

1,12

2,34

0.4

80%

11,9

1410

,707

2,80

9,11

4,17

6.9

110

⊕15

02,

746,

938,

290.

42,

746,

938,

290.

40%

348

342

2,74

6,93

8,29

0.4

3,01

1,87

4,62

7.7

0%7,

924

4,11

62,

746,

938,

290.

43,

161,

659,

800.

240

%7,

224

6,20

53,

511,

371,

353.

510

9⊙

1000

3014

5,43

3,66

9.2

145,

433,

669.

20%

636

679

145,

433,

669.

214

5,43

3,66

9.2

0%7,

700

4,82

314

5,43

3,66

9.2

145,

433,

669.

20%

6,39

84,

106

195,

559,

057.

615

7⊙

6029

0,83

5,76

3.7

290,

835,

763.

70%

480

581

290,

835,

763.

729

0,83

5,76

3.7

0%6,

510

2,53

429

0,83

5,76

3.7

290,

835,

763.

70%

4,06

01,

747

549,

444,

314.

415

8⊙

9079

9,45

7,92

2.1

799,

457,

922.

10%

402

545

799,

457,

922.

180

9,30

1,45

3.5

0%8,

834

4,10

779

9,45

7,92

2.1

1,68

1,94

3,02

3.8

60%

11,1

448,

671

1,50

3,29

1,89

9.9

151

⊙12

01,

231,

739,

658.

11,

231,

739,

658.

10%

276

271

1,23

1,73

9,65

8.1

1,43

8,35

8,30

1.5

0%9,

324

5,18

41,

231,

739,

658.

12,

308,

902,

838.

260

%12

,404

9,79

82,

372,

190,

492.

714

9⊙

150

1,64

8,73

7,35

8.9

1,64

8,73

7,35

8.9

0%54

054

41,

648,

737,

358.

92,

156,

181,

169.

10%

4,46

61,

574

1,64

8,73

7,35

8.9

2,50

2,73

9,24

1.3

20%

9,50

67,

489

3,19

1,44

6,90

1.7

146

⊙15

0030

136,

172,

020.

113

9,87

6,67

9.7

0%52

256

413

6,17

2,02

0.1

139,

876,

679.

70%

6,42

62,

844

136,

172,

020.

114

1,72

9,00

9.6

0%10

,010

4,99

518

0,80

4,25

2.3

156

⊙60

290,

835,

763.

729

0,83

5,76

3.7

0%40

249

929

0,83

5,76

3.7

293,

061,

055.

80%

6,39

83,

156

290,

835,

763.

729

0,83

5,76

3.7

0%2,

707

1,00

939

8,91

6,44

7.2

161

⊙90

436,

235,

022.

343

6,23

5,02

2.3

0%19

820

243

6,23

5,02

2.3

512,

509,

261.

20%

532

167

436,

235,

022.

343

6,23

5,02

2.3

0%6,

510

3,14

882

4,14

3,94

1.6

155

⊙12

099

2,30

8,83

0.6

1,03

6,48

3,96

9.0

0%45

043

799

2,30

8,83

0.6

1,24

0,72

2,84

4.3

0%5,

250

2,10

33,

027,

011,

625.

03,

027,

011,

625.

010

0%14

,000

13,0

001,

997,

202,

397.

515

0⊕

150

1,47

9,67

5,77

8.4

1,47

9,67

5,77

8.4

0%52

253

71,

553,

247,

802.

41,

891,

091,

760.

30%

5,93

62,

736

3,78

3,74

2,06

5.0

3,78

3,74

2,06

5.0

100%

14,0

0013

,000

2,50

5,45

7,10

2.4

149

⊕20

0030

136,

172,

020.

114

1,72

9,00

9.6

0%50

359

113

6,17

2,02

0.1

142,

552,

572.

60%

10,1

928,

405

145,

433,

669.

251

2,24

7,93

8.7

60%

12,9

789,

087

180,

804,

252.

316

2⊕

6029

0,83

5,76

3.7

290,

835,

763.

70%

228

229

290,

835,

763.

729

4,30

4,78

3.2

0%7,

350

5,48

229

0,83

5,76

3.7

290,

835,

763.

70%

6,04

84,

258

391,

077,

873.

815

7⊙

9043

6,23

5,02

2.3

436,

235,

022.

30%

372

359

436,

235,

022.

343

8,65

0,15

0.5

0%7,

126

4,50

843

6,23

5,02

2.3

438,

342,

296.

60%

6,60

84,

484

916,

741,

135.

512

0⊙

120

581,

637,

116.

858

1,63

7,11

6.8

0%46

245

658

1,63

7,11

6.8

581,

637,

116.

80%

6,76

23,

701

581,

637,

116.

858

1,63

7,11

6.8

0%6,

356

4,12

41,

098,

848,

926.

715

3⊙

150

1,20

0,89

7,79

3.9

1,20

0,89

7,79

3.9

0%64

573

61,

208,

331,

731.

11,

230,

428,

362.

00%

7,67

24,

165

1,20

0,89

7,79

3.9

3,26

7,17

3,21

0.8

80%

11,6

3411

,421

2,22

6,42

3,40

4.8

117

⊙A

vg83

9,61

4,70

8.4

842,

286,

547.

70%

446

502

843,

665,

006.

496

5,17

8,33

5.1

0%6,

783

3,73

71,

057,

152,

108.

11,

419,

557,

768.

035

%8,

948

6,71

61,

411,

429,

326.

114

6

84

Table 5.14: Optimization rates for the CLP in scalability.


500 30 80.78% 80.78% 0% 80.78% 80.78% 0% 80.78% 64.63% 20% 63.70%60 59.31% 59.31% 0% 59.31% 42.15% 0% 59.13% 23.65% 60% 21.63%90 56.43% 56.43% 0% 56.43% 40.29% 0% 56.43% 45.14% 20% 15.65%120 27.40% 27.40% 0% 27.40% 23.68% 0% 27.40% 5.48% 80% 7.20%150 27.40% 27.40% 0% 27.40% 20.40% 0% 27.40% 16.44% 40% 7.20%

1,000 30 80.78% 80.78% 0% 80.78% 80.78% 0% 80.78% 80.78% 0% 74.16%60 80.78% 80.78% 0% 80.78% 80.78% 0% 80.78% 80.78% 0% 63.70%90 64.79% 64.79% 0% 64.79% 64.35% 0% 64.79% 25.91% 60% 33.78%120 59.31% 59.31% 0% 59.31% 52.48% 0% 59.31% 23.72% 60% 21.63%150 56.43% 56.43% 0% 56.43% 43.01% 0% 56.43% 33.86% 20% 15.65%

1,500 30 82.01% 81.52% 0% 82.01% 81.52% 0% 82.01% 81.27% 0% 76.11%60 80.78% 80.78% 0% 80.78% 80.64% 0% 80.78% 80.78% 0% 73.64%90 80.78% 80.78% 0% 80.78% 77.43% 0% 80.78% 80.78% 0% 63.70%120 67.22% 65.76% 0% 67.22% 59.01% 0% 0.00% 0.00% 100%34.02%150 60.89% 60.89% 0% 58.95% 50.02% 0% 0.00% 0.00% 100%33.78%

2,000 30 82.01% 81.27% 0% 82.01% 81.16% 0% 80.78% 32.31% 60% 76.11%60 80.78% 80.78% 0% 80.78% 80.56% 0% 80.78% 80.78% 0% 74.16%90 80.78% 80.78% 0% 80.78% 80.68% 0% 80.78% 80.69% 0% 59.62%120 80.79% 80.79% 0% 80.79% 80.79% 0% 80.79% 80.79% 0% 63.70%150 68.26% 68.26% 0% 68.07% 67.48% 0% 68.26% 13.65% 80% 41.16%

Avg 67.89% 67.75% 0% 67.78% 63.40% 0% 61.41% 46.65% 35% 46.02%

5.4.5 Performance results for the classes CSP, CMP and CLP usingOracle DBMS cost models

A set of experiments using the Oracle DBMS Optimizer (Cost Based Optimizer, CBO) was

performed to analyze the efficiency of the BPSO approach against the three competitor

approaches: the GA, its improved version, GAI, and the DM approaches using the three

problem sets CSP, CMP and CLP. This was for validating the results obtained previously

(recall that these results were generated using a mathematical cost models - see Section

2.2 for details). First, all the final SBJI indexes generated in the experiments described

in Sections 5.4.3.1, 5.4.3.2 and 5.4.3.3 (i.e., for the problem sets CSP, CMP and CLP

respectively) were implemented in Oracle DBMS for the algorithms BPSO, GAI, GA and

DM. This was done by inserting an appropriate bitmap join index hint statement followed

by an insertion of the ’Explain Plan’ statement into every query (e.g. Fig. 5.5) in the

workload (see (Figs. 5.4.5 and 5.4.5 for details). Then, the I/O costs were calculated by the

Oracle DBMS Optimizer CBO by means of the ’Explain Plan’ utility. The storage size S

is systematically increased from 1200 MB to 2000 MB (in 200 MB increments, yielding a

85

SELECT SUM(Actvars.DOLLARSALES)FROM Actvars, CustomersWHERE Actvars.custId = Customers.custIdAND Customers.state = ’Michigan’;

Figure 5.5: OLAP query (Query 1).

SELECT /*+ INDEX(Actvars Cust_Actvars_bji) */SUM(Actvars.DOLLARSALES)FROM Actvars, CustomersWHERE Actvars.custId = Customers.custIdAND Customers.state = ’Michigan’;

Figure 5.6: OLAP query with a SBJI hint statement (Query 2).

EXPLAIN PLAN SET STATEMENT_ID =’Q4’ INTO plan_tableFOR SELECT /*+ INDEX(Actvars Cust_Actvars_bji) */SUM(Actvars.DOLLARSALES)FROM Actvars, CustomersWHERE Actvars.custId = Customers.custIdAND Customers.state = ’Michigan’;

Figure 5.7: OLAP query with ’Explain Plan’ statement (Query 3).

total of 5 different cases. Therefore, a total of 25 (5×5) runs was under consideration for

the algorithms BPSO, GA and GAI, and 5 (5×1) runs was for the DM. Table 5.15 shows the

number of disk page accesses (I/O costs) using the ’Explain Plan’ utility of Oracle DBMS

when executing the corresponding query workload in the problem sets CSP, CMP and CLP

respectively for the five different storage sizes. The DBMS based cost results presented

in Table 5.15 has confirmed the results obtained in the previous experiments (shown in

in Table 5.3, Table 5.5 and Table 5.7 about the effectiveness of the BPSO algorithm. We

can note that the mathematical cost models made a good estimation of cost compare to the

estimation made by Oracle DBMS (near values were obtained between the mathematical

cost models and the Oracle DBMS cost models). In summary, as also indicated by Table

5.15, the BPSO algorithm has again shown considerably better performance than the GAI,

GA and DM methods in all aspects.

86

Tabl

e5.

15:Q

uery

ing

perf

orm

ance

resu

lts

for

the

CS

P,C

MP

and

CL

Pus

ing

Ora

cle

DB

MS

’Exp

lain

Pla

n’.

Cla

sses

SB

PSO

GA

IG

AD

M(M

B)

Bes

tA

vgE

rrB

est

Avg

Err

Bes

tA

vgE

rr

CSP

1,20

020

,602

,490

.023

,227

,456

.00%

20,6

02,4

90.0

23,2

27,4

56.0

0%32

,000

,771

.032

,000

,771

.010

0%27

,682

,977

.01,

400

15,3

96,6

96.0

18,5

20,1

72.4

0%20

,602

,490

.021

,914

,973

.00%

32,0

00,7

71.0

32,0

00,7

71.0

100%

27,6

82,9

77.0

1,60

013

,940

,766

.013

,940

,766

.00%

13,9

40,7

66.0

14,0

41,3

30.2

0%13

,940

,766

.013

,940

,766

.00%

14,4

43,5

87.0

1,80

013

,940

,766

.013

,940

,766

.00%

13,9

40,7

66.0

13,9

40,7

66.0

0%13

,940

,766

.013

,940

,766

.00%

14,4

43,5

87.0

2,00

013

,940

,766

.013

,940

,766

.00%

13,9

40,7

66.0

13,9

40,7

66.0

0%13

,940

,766

.013

,940

,766

.00%

14,4

43,5

87.0

Avg

15,5

64,2

96.8

16,7

13,9

85.3

0%16

,605

,455

.617

,413

,058

.20%

21,1

64,7

68.0

21,1

64,7

68.0

40%

19,7

39,3

43.0

CM

P1,

200

48,1

84,2

85.0

48,1

84,2

85.0

0%48

,184

,285

.060

,774

,551

.80%

48,1

84,2

85.0

60,7

74,5

51.8

0%56

,399

,910

.0

1,40

048

,184

,285

.048

,184

,285

.00%

48,1

84,2

85.0

48,7

59,8

41.4

0%48

,184

,285

.074

,050

,309

.880

%56

,399

,910

.01,

600

48,1

84,2

85.0

48,1

84,2

85.0

0%48

,184

,285

.046

,989

,529

.60%

80,5

16,8

16.0

80,5

16,8

16.0

100%

56,3

99,9

10.0

1,80

029

,560

,250

.029

,560

,250

.00%

29,5

60,2

50.0

40,9

92,4

59.4

0%29

,560

,250

.029

,560

,250

.00%

56,3

99,9

10.0

2,00

029

,560

,250

.029

,560

,250

.00%

29,5

60,2

50.0

29,5

60,2

50.0

0%29

,560

,250

.029

,560

,250

.00%

38,7

14,0

62.0

Avg

40,7

34,6

71.0

40,7

34,6

71.0

0%40

,734

,671

.045

,415

,326

.40%

47,2

01,1

77.2

54,8

92,4

35.5

36%

52,8

62,7

40.4

CL

P1,

200

86,3

12,2

43.0

86,3

12,2

43.0

0%88

,513

,446

.014

8,47

8,04

2.0

0%86

,312

,243

.014

8,03

7,80

1.4

0%96

,386

,617

.0

1,40

086

,312

,243

.010

1,74

3,63

2.6

0%86

,312

,243

.014

8,03

7,80

1.4

0%86

,312

,243

.013

2,60

6,41

1.8

0%96

,386

,617

.01,

600

86,3

12,2

43.0

117,

175,

022.

20%

86,3

12,2

43.0

148,

037,

801.

40%

163,

469,

191.

016

3,46

9,19

1.0

0%96

,386

,617

.01,

800

86,3

12,2

43.0

101,

743,

632.

60%

86,3

12,2

43.0

101,

743,

632.

60%

163,

469,

191.

016

3,46

9,19

1.0

40%

96,3

86,6

17.0

2,00

054

,750

,718

.056

,639

,656

.20%

54,7

50,7

18.0

119,

981,

801.

80%

54,7

50,7

18.0

77,7

34,3

01.6

0%62

,394

,839

.0

Avg

79,9

99,9

38.0

92,7

22,8

37.3

0%80

,440

,178

.613

3,25

5,81

5.8

0%11

0,86

2,71

7.2

137,

063,

379.

48%

89,5

88,2

61.4

87


5.5 Conclusions

We have presented a new approach based on binary particle swarm optimization BPSO to

solve the BJIO selection problem in data warehouses. The approach is stochastic in nature

and has some similarities to genetic algorithm based approach in the coding of solution

set as well as some differences, such as the penalty function. We have used three class

of problem sets, the smaller size set, the moderate size set and the larger size set to test

the effectiveness of the BPSO approach against two well-known approaches, the genetic

algorithm based approach, GA, and the data mining based approach, DM, on a fairly large

benchmark data warehouse (APB-I benchmark).

We have improved the genetic algorithm approach GA to avoid the infeasible solutions

and optimize the number of evaluations and in turn, the average execution times. This

was called the GAI approach. Then, we have also tested the effectiveness of the BPSO

approach against the GAI approach. We have also performed a scalability study to further

analyze the effectiveness of the BPSO algorithm against the GA, the GAI and DM methods

by systematically increasing the query workload size.

Both the general and scalability results have shown that the BPSO algorithm outper-

forms the two currently best known algorithms (the GA and its improved version, GAI, and

the DM methods) in many aspects, especially when the problem size grows. Furthermore,

the BPSO was able to obtain optimal/sub-optimal solutions for the problem sets CSP, CMP

and CLP using much smaller number of evaluations compare to those for the GAI and GA

- i.e., the BPSO algorithm has converged much faster. The Experiments have confirmed

that using the exponential penalty function can greatly avoid the generation of infeasible

solutions in genetic algorithms (this can be observed in the results generated by the GAI).

However, further investigation is necessary to see the effects of other types of penalty

functions. We have also performed a set of experiments using a DBMS specific cost function

(using the ’Explain Plan’ utility of Oracle DBMS) to analyze the effectiveness of the BPSO

approach against the three competitor approaches. These experiments have confirmed the

effectiveness of the our proposition for all of the three class of problems sets considered

when an actual implementation specific cost measure is used. We need to note that all

tests were performed on a query workload containing 100 to 500 queries with 12 to 20

non-key attributes. The workload size used in the experiments was considerably higher

than those for the relevant work found in the literature (varies between 5 and 70 - the

number of non-key attributes varies between 10 and 18, see Table 1). The BPSO approach

is a general heuristic and could also be used to solve the BJIMSP and other problems

in data warehouses, such as horizontal partitioning. In the future, we plan to solve the

88

5.5. CONCLUSIONS

horizontal partitioning problem using the BPSO approach.

89

CH

AP

TE

R

6MIXED-INTEGER LINEAR PROGRAMMING FOR SBJISP

“Performance is your reality. Forget everything else.”

-Harold Geneen

6.1 Introduction

We have seen in the previous chapters, that the proposed approaches are either statistic or

meta-heuristic. In the present chapter, we present a mixed-integer linear programming

(MILP) formulation to obtain an optimal solution for the SBJISP. The MILP can be

solved using a commercial MILP solver (such as CPLEX1 10.1). Several experiments were

performed to demonstrate the effectiveness and the advantages of the MILP approach and

compared to the two well-known best approaches in the literature, the improved genetic

algorithm (GAI) based approach and the binary particle swarm optimization (BPSO) based

approach [72].

6.2 Linear programming definition

Linear Programming allows to solve a linear program, by find the values for the decision

variables xi and minimizing a linear objective function. Linear programming allows the

specification of linear constraints on the variables xi over which the objective function is

1http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/

91

CHAPTER 6. MIXED-INTEGER LINEAR PROGRAMMING FOR SBJISP

valid. These constraints are linear functions of the decision variables. For example, the

following describes a simple linear program:

min cTx (6.1)

u.c. Ax≥b (6.2)

x≥ 0 (6.3)

6.2.1 Linear program solution time

Let a given matrix with m rows and n columns. The improved simplex method has O(mn)

computations per iteration in the worst case and O(m2) computations per iteration in the

best case. With this complexity linear programming seems to be slow, really the linear

program solvers can solve problems with tens of thousands of constraints and variables in

a few seconds.

For solving some problems using LP, the number of iterations required can be exponen-

tial. However, for most problems, the number of iterations is O(n). The improved simplex

method has polynomial smoothed complexity [67].

6.2.2 Mixed Integer Programming

Sometimes a problem is modelled with some additional integer variables or constraints

involving integer variables. If we force a linear program to take some integer variables,

then the resulting program is called mixed integer program. In general, the mixed integer

program is described as the following:

min cTx+dTy (6.4)

u.c. Ax+By≥b (6.5)

x≥ 0 (6.6)

y ∈Z (6.7)

6.3 Mathematical formulation of SBJI selection problem

We describe a mixed-integer linear (MILP) formulation to find an optimal single bitmap

join index (SBJI) configuration. Notations used in the model are presented in Table 6.1

followed by the constraints and objective function.

92

6.3. MATHEMATICAL FORMULATION OF SBJI SELECTION PROBLEMTa

ble

6.1:

Mod

elno

tati

ons.

Not

atio

nD

efini

tion

Sets

AC

andi

date

attr

ibut

esfo

rin

dexa

tion

DD

imen

sion

tabl

esQ

Que

ries

inth

equ

ery

wor

kloa

dW

Join

com

bina

tion

,whi

chis

equa

lto

2|D| −

2In

dice

sk

k∈{

1,2,

...,|A

|}i

i∈{1

,2,.

..,|D

|}r

r∈{

1,2,

...,|Q

|}w

w∈{

1,2,

...,|W

|}C

onst

ants

a i,k∈ {

0,1 }

Equ

als

to1

ifat

trib

ute

kis

part

ofdi

men

sion

tabl

ei

b r,k∈ {

0,1 }

Equ

als

to1

ifat

trib

ute

kit

self

isac

cess

edby

quer

yr

λw

,i∈ {

0,1 }

Equ

als

to1

ifdi

men

sion

tabl

eii

sin

the

set

w(s

eeA

lgor

ithm

inF

ig).

c r,k

The

exec

utio

nco

stof

quer

yr

inpr

esen

ceof

inde

xbu

ilton

attr

ibut

ek

cow

The

exec

utio

njo

inco

stbe

twee

ndi

men

sion

tabl

esin

wan

dth

efa

ctta

ble

Fs k

The

stor

age

cost

ofth

eS

BJI

built

onth

eno

n-ke

yat

trib

ute

x kS

Max

imum

stor

age

spac

eal

low

edto

stor

ein

dex

confi

gura

tion

Dec

isio

nVa

riab

les

x k∈ {

0,1 }

Equ

als

to1

ifS

BJI

iscr

eate

don

attr

ibut

ek.

αr,

i∈ {

0,1 }

Equ

als

to1

ifta

ble

iis

load

edto

answ

erth

equ

ery

r.β

r,w∈ {

0,1 }

Equ

als

to1

ifth

ejo

inop

erat

ions

wis

need

edto

answ

erth

equ

ery

r.γ

rD

enot

eth

eex

ecut

ion

cost

ofqu

ery

rin

abse

nce

ofus

eful

SB

JI.

93


6.3.1 SBJISP constraints

The constraints for the model is given below:

αr,i ≥ br,k(1− xk)ai,k ∀r ∈Q,∀i ∈ D,∀k ∈ A (6.8)

βr,w ≥ 1+|D|∑i=1

λw,i(αr,i −1) ∀r ∈Q,∀w ∈W (6.9)

γr ≥βr,wcow ∀r ∈Q, ∀w ∈W (6.10)|A|∑k=1

skxk ≤ S (6.11)

xk ∈ {0,1} ∀k ∈ A (6.12)

αr,i ∈ {0,1} ∀r ∈Q,∀i ∈ D (6.13)

βr,w ∈ {0,1} ∀r ∈Q,∀w ∈W (6.14)

γr ≥ 0 ∀r ∈Q (6.15)

The cost to execute the query Qr that uses attribute Ak is cr,kxk if the SBJIk exist.

Otherwise, in the absence of the SBJIk, the Eq.(7.3) is used to identify dimension table

D i containing attribute Ak to be loaded for answering Qr. In order to answer Qr, all

identified dimension tables by Eq.(7.4) are joined with the fact table using hash-join

method. The problem here is how to formulate the joining operations between identified

dimension tables and the fact table. All possible join combinations of dimension tables

need to be considered. To address this problem, a bitmap table called λ is generated using

the algorithm presented in Fig. 3. The λ bitmap table simply represents the power set

of dimension tables excluding the empty set and contains 2|D|−1 rows and |D| columns,

where each row (a bitmap) indicates dimension tables to be joined with the fact table. The

dimension tables are listed in the minimum selectivity order. For example, in Fig. 4, D2,

D1, D3 is the minimum selectivity order assumed and the row 2 indicates a join between

dimension table D1 and the fact table F, the row 3 indicates a join between dimension

tables D1, D3 and the fact table F. The cost of join(s) corresponding to a row in the bitmap

table λ, cow, is computed by Eq.(4.2). The Eq.(7.5) is used for computing the cost of Qr in

the absence of useful indexes in the index configuration to be selected. The Eq.(7.6) is the

knapsack constraint that controls the size of the index configuration.

94


Require: D, the set of dimension tables1: P ← 2|D|−12: for w ← 1 to P do3: t ← w4: i ← 15: repeat6: λw,i ← t mod 27: t ← t28: i ← i+19: until i > |D|

10: end for

Figure 6.1: Algorithm to generate thebitmap table λ.

w D2 D1 D31 0 0 12 0 1 03 0 1 14 1 0 05 1 0 16 1 1 07 1 1 1

Figure 6.2: An example of bitmap table λ.

6.3.2 SBJISP objective Function

The objective is to minimize the query workload cost, the objective function presented in

Eq.7.7 is similar to the cost model represented in Section 2.1.

min :|Q|∑r=1

(γr +|A|∑k=1

Cr,kxk) (6.16)



We use the same benchmark and instances presented in the section 2.2


A set of experiments were used to analyze the efficiency of the MILP approach against

the two well known algorithms that are best so far: the improved genetic algorithm (GAI)

and particle swarm optimization (BPSO) [72] using the three problem sets CSP, CMP and

CLP mentioned above. All tests are performed under Intel i7 (4 cores) processor with 8

GB RAM. The IBM CPLEX2 10.1 solver under Java Development Kit is used for solving

the proposed model, the MILP model of SBJISP. The CPLEX parameters remain at their

default settings. The parameters for the GAI and BPSO are presented in Toumi et. al. and

the same experimental set up was used [72].

The MILP method runs on a single problem set, therefore is not considered to be

stochastic in that sense. For stochastic algorithms BPSO and GAI, the average cost of2http://www-01.ibm.com/software/info/ilog/

95


solutions over five independent runs are reported. The storage size S was systematically

increased from 500 MB to 2000 MB in 100 MB increments, yielding a total of 16 different

cases. Therefore, a total of 80 (16×5) runs is under consideration for algorithms BPSO

and GAI, and 16 (16×1) runs is for MILP.

Tables 6.2, 6.3, and 6.4 show the number of disk page accesses needed (I/O costs) using

the cost model presented in Section 2.1 in order to execute the query workload for the

problem sets CSP, CMP and CLP respectively for sixteen different storage sizes. The

column Best represents the value of the best minimum cost solution found. The column

Avg represents the average cost of solutions found over five independent runs for the

BPSO and GAI approaches [72]. The column Evals represents the average of the number

of candidate evaluations performed for reaching the best solution for five independent

runs (i.e., a measure about how fast an algorithm finds an optimal/sub-optimal solution or

converges for BPSO and GAI). The column Time shows the average computation time in

seconds. The last row provides an overall average for the runs. In each table row, the best

(i.e., the minimum) and the average querying performance result were presented in bold

font for each of the algorithms considered. The last entry WOI represents the cost of the

query workload without using any index configuration (in this case, the hash-join method

was used - see Section 2.1)

96

6.4. EXPERIMENTAL RESULTSTa

ble

6.2:

Que

ryin

gpe

rfor

man

cere

sult

sfo

rth

esm

alle

rsi

zepr

oble

mse

tC

SP

.

SB

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

20,2

32,2

38.3

20,2

32,2

38.3

288

6920

,232

,238

.320

,240

,714

.85,

670

801

20,2

32,2

38.3

260

020

,232

,238

.320

,232

,238

.325

860

20,2

32,2

38.3

20,2

32,2

38.3

4,39

660

020

,232

,238

.31

700

19,3

66,5

63.4

19,5

39,6

98.3

648

436

20,2

32,2

38.3

20,2

32,2

38.3

2,53

41,

984

19,3

66,5

63.4

180

019

,366

,563

.419

,366

,563

.431

873

19,3

66,5

63.4

19,5

39,6

98.3

5,50

21,

006

19,3

66,5

63.4

190

019

,366

,563

.419

,366

,563

.433

678

19,3

66,5

63.4

19,3

66,5

63.4

4,92

870

019

,366

,563

.42

1,00

019

,366

,563

.419

,366

,563

.432

475

19,3

66,5

63.4

19,5

39,6

98.3

5,60

094

719

,366

,563

.41

1,10

019

,366

,563

.419

,366

,563

.430

61,

452

19,3

66,5

63.4

19,3

66,5

63.4

3,55

62,

002

18,9

69,5

23.4

21,

200

18,4

18,7

79.2

18,7

97,8

92.9

954

775

18,4

18,7

79.2

18,7

97,8

92.9

4,78

81,

268

18,4

18,7

79.2

31,

300

18,4

18,7

79.2

18,4

18,7

79.2

366

8618

,418

,779

.218

,608

,336

.07,

014

1,33

118

,418

,779

.22

1,40

016

,726

,453

.017

,741

,848

.71,

230

1,12

318

,418

,779

.218

,608

,336

.06,

020

2,03

716

,726

,453

.02

1,50

012

,596

,574

.712

,596

,574

.71,

152

276

14,2

91,1

33.4

16,9

12,5

57.2

8,83

41,

953

12,5

96,5

74.7

11,

600

12,5

96,5

74.7

12,5

96,5

74.7

354

9712

,596

,574

.712

,935

,486

.45,

222

879

12,5

96,5

74.7

11,

700

12,5

96,5

74.7

12,5

96,5

74.7

336

9012

,596

,574

.712

,596

,574

.74,

186

563

12,5

96,5

74.7

11,

800

12,5

96,5

74.7

12,5

96,5

74.7

282

6312

,596

,574

.712

,596

,574

.74,

102

564

12,5

96,5

74.7

11,

900

12,5

96,5

74.7

12,5

96,5

74.7

174

3912

,596

,574

.712

,596

,574

.73,

472

469

12,5

96,5

74.7

22,

000

12,5

96,5

74.7

12,5

96,5

74.7

204

4612

,596

,574

.712

,596

,574

.72,

534

343

12,5

96,5

74.7

2

Avg

16,6

52,5

47.1

16,7

50,5

24.8

470.

630

2.5

16,9

18,3

32.1

17,1

72,9

13.9

4,89

7.4

1,09

0.4

16,6

27,7

32.1

1.7

WO

I18

7,92

8,02

3.9

97



The performance for the CSP presented in Table 6.2 indicates that the MLIP approach has

generated better results in general. The MILP has outperformed both approaches BPSO

and GAI for the best solutions found. In terms of the best solutions found, the MILP has

generated the best solution in all 16 cases (100%), the BPSO algorithm has generated the

best solutions in the 69 out of 80 cases (86.25%) while the GAI method has generated the

best solutions in 58 out of 80 cases (72.50%). In terms of the average number of evaluations

performed, the BPSO needed about 10 times (10.4 exactly) less evaluations than the GAI,

yet achieving better solution quality. In terms of the average computation time, the MILP

was about 178 times (177.94 exactly) faster than the BPSO and 641 times (641.41 exactly)

faster than the GAI. In summary, as also indicated by the last row of Table 6.2, the MILP

approach has shown better performance than both the BPSO and GAI methods in all

aspects.


The querying performance for the moderate size problem set CMP presented in Table 6.3

indicates that the MILP algorithm has again generated better results in general. The

MLIP has outperformed both methods BPSO and GAI for both the best solutions found and

the computation time utilized. In terms of the best solutions found, the MILP approach has

generated the best solutions in all 16 cases (100 %), the BPSO algorithm has generated

the best solutions in 78 out of 80 cases (97.5%) while the GAI method has generated the

best solutions in 60 out of 80 cases (75%). In terms of the average number of evaluations

performed, the BPSO needed about 8 times (8.46 exactly) less evaluations than the GAI,

yet achieving better solution quality. In terms of the average computation time, the MILP

was about 4 times (3.85 exactly) times faster than the BPSO and about 24 times (23.87

exactly) faster than the GAI. In summary, as also indicated by the last row of Table 6.3,

the MILP algorithm has shown considerably better performance than both the BPSO and

GAI approaches in all aspects.

98


Tabl

e6.

3:Q

uery

ing

perf

orm

ance

resu

lts

for

the

mod

erat

esi

zepr

oble

mse

tC

MP

.

SB

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

57,2

21,5

45.8

57,2

21,5

45.8

342

259

57,2

21,5

45.8

57,2

21,5

45.8

6,25

82,

502

57,2

21,5

45.8

560

057

,221

,545

.857

,221

,545

.836

626

057

,221

,545

.857

,221

,545

.83,

066

1,24

657

,221

,545

.85

700

57,2

21,5

45.8

57,2

21,5

45.8

390

278

57,2

21,5

45.8

57,3

53,5

05.8

3,68

22,

622

57,2

21,5

45.8

580

057

,221

,545

.857

,221

,545

.855

840

357

,221

,545

.857

,353

,505

.86,

230

3,51

757

,221

,545

.85

900

57,2

21,5

45.8

57,2

21,5

45.8

342

245

57,2

21,5

45.8

70,2

76,1

81.0

7,42

04,

381

57,2

21,5

45.8

81,

000

52,5

71,7

45.5

53,4

44,3

28.5

1,70

41,

405

52,5

71,7

45.5

54,9

95,1

12.6

4,76

04,

559

52,5

71,7

45.5

51,

100

52,5

71,7

45.5

52,5

71,7

45.5

936

684

52,5

71,7

45.5

55,2

46,8

71.6

9,00

26,

035

52,5

71,7

45.5

91,

200

52,5

71,7

45.5

52,5

71,7

45.5

684

513

52,5

71,7

45.5

54,4

31,6

65.6

7,54

64,

295

52,5

71,7

45.5

111,

300

52,5

71,7

45.5

52,5

71,7

45.5

480

368

52,5

71,7

45.5

53,5

01,7

05.6

4,76

03,

284

52,5

71,7

45.5

191,

400

52,5

71,7

45.5

52,5

71,7

45.5

942

715

52,5

71,7

45.5

53,5

98,0

60.1

6,94

44,

269

52,5

71,7

45.5

281,

500

52,5

71,7

45.5

52,5

71,7

45.5

852

654

52,5

71,7

45.5

52,8

27,5

84.9

6,83

23,

649

52,5

71,7

45.5

569

1,60

052

,571

,745

.552

,571

,745

.541

433

652

,571

,745

.553

,489

,226

.25,

236

3,76

752

,571

,745

.563

1,70

051

,631

,175

.451

,819

,289

.41,

602

2,12

351

,631

,175

.453

,692

,102

.37,

672

4,90

551

,631

,175

.41,

804

1,80

039

,136

,645

.439

,136

,645

.41,

236

969

39,1

36,6

45.4

50,3

35,9

91.1

7,09

85,

724

39,1

36,6

45.4

51,

900

39,1

36,6

45.4

39,1

36,6

45.4

498

386

39,1

36,6

45.4

41,6

35,5

51.4

7,42

03,

981

39,1

36,6

45.4

62,

000

39,1

36,6

45.4

39,1

36,6

45.4

360

283

39,1

36,6

45.4

39,1

36,6

45.4

5,26

42,

396

39,1

36,6

45.4

12

Avg

51,4

46,9

41.2

51,5

13,2

34.8

731.

661

7.4

51,4

46,9

41.2

53,8

94,8

00.1

6,19

9.4

3,82

0.7

51,4

46,9

41.2

160

WO

I25

1,52

3,27

8.2

99


Tabl

e6.

4:Q

uery

ing

perf

orm

ance

resu

lts

for

the

larg

ersi

zepr

oble

mse

tC

LP

.

SB

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

120,

162,

008.

112

0,16

2,00

8.1

504

633

120,

162,

008.

112

0,16

2,00

8.1

6,45

44,

407

120,

162,

008.

19

600

120,

162,

008.

112

0,16

2,00

8.1

444

558

120,

162,

008.

112

1,43

7,24

1.8

6,45

46,

172

120,

162,

008.

18

700

120,

162,

008.

112

0,16

2,00

8.1

492

606

120,

162,

008.

112

0,16

2,00

8.1

5,72

34,

008

120,

162,

008.

18

800

120,

162,

008.

112

0,16

2,00

8.1

408

514

120,

162,

008.

112

0,16

2,00

8.1

8,33

05,

932

120,

162,

008.

110

900

120,

162,

008.

112

0,16

2,00

8.1

516

674

120,

162,

008.

112

2,31

4,11

9.8

9,67

410

,276

120,

162,

008.

120

1,00

012

0,16

2,00

8.1

120,

162,

008.

149

265

312

0,16

2,00

8.1

121,

596,

749.

25,

348

6,27

912

0,16

2,00

8.1

247

1,10

012

0,16

2,00

8.1

121,

033,

810.

549

83,

646

120,

162,

008.

112

2,08

7,01

7.3

5,78

26,

812

120,

162,

008.

121

51,

200

112,

513,

169.

411

2,51

3,16

9.4

1,56

62,

021

116,

474,

246.

711

9,42

4,45

5.8

6,70

610

,139

112,

513,

169.

427

1,30

011

2,51

3,16

9.4

118,

632,

240.

457

66,

211

112,

513,

169.

411

8,02

1,79

6.9

8,05

010

,154

112,

513,

169.

472

1,40

011

2,51

3,16

9.4

114,

042,

937.

246

22,

128

112,

513,

169.

411

8,63

2,24

0.4

7,70

09,

019

112,

513,

169.

487

21,

500

112,

513,

169.

411

4,04

2,93

7.2

528

2,14

111

2,51

3,16

9.4

117,

102,

472.

66,

552

8,61

311

2,51

3,16

9.4

1,33

11,

600

112,

513,

169.

411

5,57

2,70

4.9

1,15

24,

363

112,

513,

169.

411

8,63

2,24

0.4

8,37

29,

777

112,

513,

169.

41,

619

1,70

011

2,51

3,16

9.4

115,

572,

704.

991

83,

998

112,

513,

169.

411

6,60

9,60

6.1

6,80

49,

616

112,

513,

169.

41,

811

1,80

011

2,51

3,16

9.4

114,

042,

937.

294

82,

717

112,

513,

169.

411

4,04

2,93

7.2

7,16

86,

721

112,

513,

169.

440

41,

900

111,

013,

984.

011

1,01

3,98

4.0

1,27

21,

799

111,

013,

984.

011

3,02

9,24

1.8

9,66

011

,045

111,

013,

984.

067

92,

000

87,9

26,2

65.2

96,9

01,7

73.1

1,53

03,

883

87,9

26,2

65.2

107,

267,

710.

97,

826

9,13

087

,926

,265

.270

Avg

114,

229,

155.

811

5,89

6,32

8.0

769.

12,

284.

011

4,47

6,72

3.1

118,

167,

740.

97,

287.

78,

006.

211

4,22

9,15

5.8

462.

6

WO

I62

5,19

2,54

9.9

100



Table 6.4 shows the querying performance for the MILP,BPSO and GAI approaches for the

larger size problem set CLP. Note that this class is the hardest one. The MILP approach

has again generated better results in general. The MILP has outperformed both approaches

BPSO and GAI for both the best solutions found and computation time utilized. In terms

of the best solutions found, the MILP approach has generated the best solutions in all 16

cases (100%), while the BPSO algorithm has generated the best solutions in 68 out of 80

cases (85%) and the GAI method has generated the best solutions in 43 out of 80 cases

(53.75%). In terms of the average number of evaluations performed, the BPSO needed

about 9 times (9.47 exactly) less evaluations than the GAI, yet achieving better solution

quality. In terms of the average computation time, the MILP was about 5 times (4.85

exactly) faster than the BPSO and about 17 times (17.30 exactly) faster than the GAI. In

summary, as also indicated by the last row of Table 6.4, the MILP algorithm has again

shown considerably better performance than both the BPSO and GAI approaches in all

aspects.

6.4.3 Performance Scalability Study

Experiments were extended (i.e., scaled up) to further analyze the effectiveness of the

MILP approach against the BPSO and GAI approaches. The cost model was the same

as the one used in the previous experiments. In the scalability study, the fact table size

has been increased 30 million to 150 million tuples in 30 millions tuple increments (five

different cases) and for each of the the fact table size increments, the storage size S was

systematically increased from 500 MB to 2000 MB in 500 MB increments (four different

cases), yielding a total of 20 different cases. Again, the average cost of solutions over five

independent runs was reported for the stochastic algorithms BPSO and GAI. Therefore, a

total of 100 runs (20×5) was under consideration for the algorithms BPSO and GAI. The

querying performance of scalability experiments for the three problem sets CSP, CMP

and CLP are presented in Table 6.5, Table 6.6 and Table 6.7 respectively (see Section 4.3

for table details). The additional column |F| represents the size of the fact table used in

millions. The WOI entries represents the cost of the query workload without using any

index configuration for different fact table sizes.

6.4.3.1 Scalability results for the smaller size problem set CSP.

The performance of scalability experiments for the smaller size problem set (CSP) presented

in Table 6.5 indicates that the MILP approach was again superior to the competitor

101


algorithms BPSO and GAI. The MILP approach has generated the best solution in all 20

cases (100%), for the BPSO approach has generated the best solutions in 96 of 100 cases

(96%) while the GAI approach has generated the best solutions in 55 out of 100 cases (55%).

In terms of the average number of evaluations performed, the BPSO needed about 6 times

(5.64 exactly) less evaluations than the GAI. In terms of the average computation time, the

MILP was about 88 times (87.91 exactly) and about 495 times (494.58 exactly) faster than

the BPSO and GAI respectively. In summary, as also indicated by the last row of Table 6.5,

the MILP algorithm has shown considerably better performance than both the BPSO and

GAI approaches in all aspects.

6.4.3.2 Scalability results for the moderate size problem set CMP.

The performance of scalability experiments for the moderate size problem set presented in

Table 6.6 indicates that the MILP algorithm has again generated better results than the

algorithms BPSO and GAI. The MILP has outperformed the algorithms BPSO and GAI

for both the best solutions found and computation time utilized. The MILP has always

generated the best solutions in all runs (100%). The BPSO has generated the best solution

in 97 out of 100 cases (97 %). The GAI approach has generated the best solutions in 37

out of 100 runs (37%). In terms of the average number of evaluations performed, the

BPSO needed about times (3.08 exactly) less evaluations than the GAI, yet achieving

better solution quality. In terms of the average computation time, the MILP was about 30

times (30.44 exactly) and about 94 times (93.83 exactly) faster than the BPSO and GAI

respectively. In summary, as also indicated by the last row of Table 6.6, the MILP approach

has again shown considerably better performance than both the GAI and GA approaches

in all aspects.

102


ble

6.5:

Perf

orm

ance

resu

lts

for

the

smal

lest

size

prob

lem

set

CS

Pin

scal

abili

ty.

S|F

|B

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

3024

,487

,250

.724

,487

,250

.735

476

24,4

87,2

50.7

24,5

07,7

67.9

5,76

866

124

,487

,250

.72

6080

,602

,513

.080

,602

,513

.062

412

780

,602

,513

.092

,404

,997

.07,

056

769

80,6

02,5

13.0

190

143,

971,

543.

414

3,97

1,54

3.4

750

137

143,

971,

543.

418

5,83

8,65

7.1

5,88

057

314

3,97

1,54

3.4

112

022

6,92

6,38

3.7

226,

926,

383.

736

069

226,

926,

383.

728

0,53

4,35

5.1

2,84

228

322

6,92

6,38

3.7

115

028

3,65

5,55

4.5

283,

655,

554.

528

854

283,

655,

554.

536

2,79

0,54

3.7

4,49

447

728

3,65

5,55

4.5

1

1,00

030

23,4

38,9

05.1

23,4

38,9

05.1

528

109

23,4

38,9

05.1

23,8

58,2

43.4

5,27

860

523

,438

,905

.11

6048

,968

,732

.548

,968

,732

.561

212

148

,968

,732

.555

,192

,951

.13,

332

340

48,9

68,7

32.5

190

73,6

03,5

65.7

73,6

03,5

65.7

354

6473

,603

,565

.794

,321

,748

.65,

292

536

73,6

03,5

65.7

112

161,

198,

240.

616

1,19

8,24

0.6

624

112

161,

198,

240.

617

3,50

3,06

3.3

9,54

895

516

1,19

8,24

0.6

115

234,

052,

273.

823

4,05

2,27

3.8

486

8723

4,05

2,27

3.8

236,

410,

610.

56,

104

580

234,

052,

273.

81

1,50

030

22,2

90,6

04.2

22,2

90,6

04.2

708

142

22,9

57,1

91.1

23,3

42,5

62.3

3,08

036

522

,290

,604

.22

6048

,202

,475

.448

,662

,229

.71,

422

277

48,9

68,7

32.5

48,9

89,2

44.8

6,69

272

248

,202

,475

.41

9073

,449

,736

.873

,449

,736

.843

887

73,4

49,7

36.8

73,4

80,5

02.6

4,71

851

873

,449

,736

.81

120

98,1

36,3

15.9

98,1

36,3

15.9

576

105

98,1

36,3

15.9

135,

112,

122.

67,

224

741

98,1

36,3

15.9

115

017

0,34

9,57

4.0

176,

219,

909.

166

612

117

0,34

9,57

4.0

190,

114,

021.

76,

496

645

170,

349,

574.

01

2,00

030

15,2

43,1

42.0

15,2

43,1

42.0

408

8415

,243

,142

.017

,652

,374

.05,

936

704

15,2

43,1

42.0

160

46,8

69,3

80.0

46,8

69,3

80.0

522

102

46,8

69,3

80.0

47,7

09,1

21.0

4,80

253

146

,869

,380

.01

9073

,449

,736

.873

,449

,736

.828

253

73,4

49,7

36.8

73,4

49,7

36.8

6,42

669

173

,449

,736

.81

120

97,9

31,2

18.7

97,9

31,2

18.7

336

6397

,931

,218

.797

,931

,218

.75,

810

593

97,9

31,2

18.7

115

012

2,41

2,22

3.0

122,

412,

223.

064

211

812

2,41

2,22

3.0

147,

098,

606.

25,

810

582

122,

412,

223.

01

Avg

103,

461,

968.

510

3,77

8,47

3.0

549.

010

5.5

103,

533,

610.

711

9,21

2,12

2.4

5,62

9.4

593.

510

3,46

1,96

8.5

1.2

WO

I30

106,

466,

307.

460

212,

896,

230.

990

319,

298,

166.

812

042

5,75

3,06

5.1

150

532,

186,

781.

4

103


Tabl

e6.

6:Pe

rfor

man

cere

sult

sfo

rth

esm

alle

stsi

zepr

oble

mse

tC

MP

insc

alab

ility

.

S|F

|B

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

3069

,255

,881

.669

,255

,881

.649

831

369

,255

,881

.669

,255

,881

.67,

252

2,71

969

,255

,881

.66

6027

1,64

5,39

2.9

271,

645,

392.

954

632

327

1,64

5,39

2.9

274,

126,

815.

45,

530

1,87

127

1,64

5,39

2.9

790

443,

986,

821.

144

3,98

6,82

1.1

456

463

443,

986,

821.

151

6,06

2,18

3.3

6,13

21,

101

443,

986,

821.

15

120

642,

016,

981.

264

2,01

6,98

1.2

396

393

642,

016,

981.

284

2,17

5,32

6.7

8,19

01,

489

642,

016,

981.

25

150

802,

514,

641.

680

2,51

4,64

1.6

252

258

802,

514,

641.

61,

077,

474,

950.

13,

010

528

802,

514,

641.

66

1,00

030

69,2

55,8

81.6

69,2

55,8

81.6

420

431

69,2

55,8

81.6

69,2

55,8

81.6

6,27

23,

209

69,2

55,8

81.6

560

138,

496,

191.

013

8,49

6,19

1.0

384

356

138,

496,

191.

014

9,26

3,04

2.6

7,21

02,

701

138,

496,

191.

03

9034

3,01

1,22

7.5

343,

011,

227.

51,

266

1,25

034

3,01

1,22

7.5

366,

666,

984.

38,

204

3,02

934

3,01

1,22

7.5

512

054

3,27

1,42

4.9

543,

271,

424.

936

633

354

3,27

1,42

4.9

577,

806,

869.

78,

036

2,94

054

3,27

1,42

4.9

615

073

9,96

5,81

9.0

739,

965,

819.

051

046

673

9,96

5,81

9.0

766,

498,

129.

99,

268

3,08

673

9,96

5,81

9.0

6

1,50

030

63,6

25,9

03.3

63,6

25,9

03.3

534

664

63,6

25,9

03.3

63,6

25,9

03.3

6,13

22,

098

63,6

25,9

03.3

1260

138,

496,

191.

013

8,49

6,19

1.0

390

464

138,

496,

191.

013

8,49

6,19

1.0

4,84

41,

497

138,

496,

191.

06

9020

7,73

5,14

9.9

207,

735,

149.

934

238

320

7,73

5,14

9.9

207,

735,

149.

95,

096

1,46

220

7,73

5,14

9.9

512

038

4,64

2,67

9.4

384,

642,

679.

487

694

845

7,34

3,47

8.6

523,

215,

167.

64,

634

1,04

638

4,64

2,67

9.4

615

064

4,73

3,05

8.4

644,

733,

058.

453

456

364

4,73

3,05

8.4

682,

749,

224.

86,

776

1,43

564

4,73

3,05

8.4

7

2,00

030

62,4

84,6

06.9

62,9

41,1

25.5

1,44

01,

915

63,6

25,9

03.3

64,9

61,5

54.5

4,07

41,

376

62,4

84,6

06.9

277

6013

8,49

6,19

1.0

138,

496,

191.

031

839

013

8,49

6,19

1.0

138,

496,

191.

05,

334

1,63

713

8,49

6,19

1.0

590

207,

735,

149.

920

7,73

5,14

9.9

300

354

207,

735,

149.

920

7,73

5,14

9.9

4,34

01,

188

207,

735,

149.

95

120

276,

975,

459.

227

6,97

5,45

9.2

492

564

276,

975,

459.

227

6,97

5,45

9.2

5,20

81,

447

276,

975,

459.

25

150

480,

797,

856.

849

8,97

2,98

5.4

912

1,04

148

0,79

7,85

6.8

565,

418,

600.

73,

318

733

480,

797,

856.

86

Avg

333,

457,

125.

433

4,38

8,70

7.8

561.

659

3.7

337,

149,

230.

237

8,89

9,73

2.9

5,94

3.0

1,82

9.7

333,

457,

125.

419

.5

WO

I30

304,

421,

457.

660

608,

797,

384.

490

913,

177,

336.

712

01,

217,

555,

436.

015

01,

521,

931,

807.

0

104


6.4.3.3 Scalability results for the larger size problem set CLP.

Table 6.7 shows the performance for the MILP,BPSO and GAI approaches for the larger

size problem set CLP in scalability. Note that this class is the hardest one (it gets even

harder when a parameter size is increased). The MILP approach has again generated better

results than the algorithms BPSO and GAI. The MILP has outperformed the algorithms

BPSO and GAI for both the best solutions found and computation time utilized. The MILP

has generated the best solution in all runs. For the BPSO has generated the best solutions

in 92 out of 100 runs (92%), the GAI has reached the best solutions in 60 out of 100

runs (60%). In terms of the average number of evaluations performed, the BPSO needed

about 7 times (7.44 exactly) less evaluations than the GAI, yet achieving better solution

quality. In terms of the average computation time, the MILP was about 1.5 times (1.26

exactly) and about 20 times (9.43 exactly) faster than the BPSO and GAI respectively.

We have observed the time required to obtain an optimal solution by MILP in the case

when S=2000 and |F|=30 millions is higher compare to the one for the BPSO. In this case

Cplex solver utilized more time to obtain an exact solution. In summary, as also indicated

by the last row of Table 6.7, the MILP algorithm has again shown considerably better

performance than both the BPSO and GAI approaches in almost all aspects. We have

presented a mathematical formulation based on MILP to solve the single bitmap join

indexes selection problem. The approach is different from the stochastic and statistical

approaches that exist to solve the problem. The formulation also utilizes an internal bitmap

called λ for accurately incorporating the cost of joins involved into the model (see Secton

3.1). We have used three classes of problem sets, the smaller size, the moderate size and

the larger size to test the effectiveness of the MILP approach against two well-known best

approaches, the improved genetic algorithm based approach, GAI, and the binary particle

swarm optimization approach BPSO on a fairly large data warehouse benchmark (APB-I

benchmark). We have performed a scalability study to further analyze the effectiveness of

the MILP approach against the BPSO and the GAI approaches by systematically increasing

the fact table size.

Both the general and scalability results have shown that the MILP approach outper-

forms the two currently best known algorithms, the BPSO and the GAI) in many aspects.

The MILP was able to obtain an optimal solution for all problem sets considered (CSP,

CMP and CLP) in considerably smaller amount of time than both the BPSO and GAI.

105


Tabl

e6.

7:Pe

rfor

man

cere

sult

sfo

rth

esm

alle

stsi

zepr

oble

mse

tC

LP

insc

alab

ility

.

S|F

|B

PSO

GA

IM

ILP

(MB

)B

est

Avg

Eva

lsT

ime

Bes

tA

vgE

vals

Tim

eB

est

Tim

e

500

3014

5,43

3,66

9.2

145,

433,

669.

242

045

414

5,43

3,66

9.2

145,

433,

669.

25,

978

3,16

814

5,43

3,66

9.2

860

615,

890,

014.

061

5,89

0,01

4.0

516

628

615,

890,

014.

087

5,51

4,91

4.1

5,67

03,

374

615,

890,

014.

041

9098

9,25

7,48

7.4

989,

257,

487.

451

667

498

9,25

7,48

7.4

1,35

5,48

0,26

3.6

7,79

85,

244

989,

257,

487.

48

120

2,19

7,56

5,20

2.2

2,19

7,56

5,20

2.2

492

719

2,19

7,56

5,20

2.2

2,31

0,31

8,28

3.6

7,81

23,

343

2,19

7,56

5,20

2.2

815

02,

746,

938,

290.

42,

746,

938,

290.

434

834

22,

746,

938,

290.

43,

011,

874,

627.

77,

924

4,11

62,

746,

938,

290.

47

1,00

030

145,

433,

669.

214

5,43

3,66

9.2

636

679

145,

433,

669.

214

5,43

3,66

9.2

7,70

04,

823

145,

433,

669.

221

6029

0,83

5,76

3.7

290,

835,

763.

748

058

129

0,83

5,76

3.7

290,

835,

763.

76,

510

2,53

429

0,83

5,76

3.7

890

799,

457,

922.

179

9,45

7,92

2.1

402

545

799,

457,

922.

180

9,30

1,45

3.5

8,83

44,

107

799,

457,

922.

138

120

1,23

1,73

9,65

8.1

1,23

1,73

9,65

8.1

276

271

1,23

1,73

9,65

8.1

1,43

8,35

8,30

1.5

9,32

45,

184

1,23

1,73

9,65

8.1

2915

01,

648,

737,

358.

91,

648,

737,

358.

954

054

41,

648,

737,

358.

92,

156,

181,

169.

14,

466

1,57

41,

648,

737,

358.

99

1,50

030

136,

172,

020.

113

9,87

6,67

9.7

522

564

136,

172,

020.

113

9,87

6,67

9.7

6,42

62,

844

136,

172,

020.

187

6029

0,83

5,76

3.7

290,

835,

763.

740

249

929

0,83

5,76

3.7

293,

061,

055.

86,

398

3,15

629

0,83

5,76

3.7

890

436,

235,

022.

343

6,23

5,02

2.3

198

202

436,

235,

022.

351

2,50

9,26

1.2

532

167

436,

235,

022.

38

120

992,

308,

830.

61,

036,

483,

969.

045

043

799

2,30

8,83

0.6

1,24

0,72

2,84

4.3

5,25

02,

103

992,

308,

830.

631

150

1,47

9,67

5,77

8.4

1,47

9,67

5,77

8.4

522

537

1,55

3,24

7,80

2.4

1,89

1,09

1,76

0.3

5,93

62,

736

1,47

9,67

5,77

8.4

17

2,00

030

136,

172,

020.

114

1,72

9,00

9.6

503

591

136,

172,

020.

114

2,55

2,57

2.6

10,1

928,

405

136,

172,

020.

17,

547

6029

0,83

5,76

3.7

290,

835,

763.

722

822

929

0,83

5,76

3.7

294,

304,

783.

27,

350

5,48

229

0,83

5,76

3.7

2190

436,

235,

022.

343

6,23

5,02

2.3

372

359

436,

235,

022.

343

8,65

0,15

0.5

7,12

64,

508

436,

235,

022.

38

120

581,

637,

116.

858

1,63

7,11

6.8

462

456

581,

637,

116.

858

1,63

7,11

6.8

6,76

23,

701

581,

637,

116.

88

150

1,20

0,89

7,79

3.9

1,20

0,89

7,79

3.9

645

736

1,20

8,33

1,73

1.1

1,23

0,42

8,36

2.0

7,67

24,

165

1,20

0,89

7,79

3.9

14

Avg

839,

614,

708.

484

2,28

6,54

7.7

446.

550

2.2

843,

665,

006.

496

5,17

8,33

5.1

6,78

3.0

3,73

6.6

839,

614,

708.

439

6.2

WO

I30

7756

6788

20.0

601,

513,

615,

173.

390

2,27

0,50

1,46

2.9

120

3,02

6,94

9,31

4.3

150

3,78

3,66

1,55

7.0

106

6.5. CONCLUSIONS

6.5 Conclusions

We have presented a mathematical formulation based on MILP to solve the single bitmap

join indexes selection problem. The approach is different from the stochastic and statistical

approaches that exist to solve the problem. The formulation also utilizes an internal

bitmap called λ for accurately incorporating the cost of joins involved into the model.

We have used three classes of problem sets, the smaller size, the moderate size and the

larger size to test the effectiveness of the MILP approach against two well-known best

approaches, the improved genetic algorithm based approach, GAI, and the binary particle

swarm optimization approach BPSO on a fairly large data warehouse benchmark (APB-I

benchmark). We have performed a scalability study to further analyze the effectiveness of

the MILP approach against the BPSO and the GAI approaches by systematically increasing

the fact table size.

Both the general and scalability results have shown that the MILP approach outper-

forms the two currently best known algorithms, the BPSO and the GAI) in many aspects.

The MILP was able to obtain an optimal solution for all problem sets considered (CSP,

CMP and CLP) in considerably smaller amount of time than both the BPSO and GAI.

107

CH

AP

TE

R

7EFFICIENT METHODOLOGY FOR REFERENCE HORIZONTAL

PARTITIONING IN DATA WAREHOUSES

“Any intelligent fool can make things bigger and more complex . . . It takes a touch of

genius-and a lot of courage to move in the opposite direction.”

-Albert Einstein

7.1 Introduction

This chapter proposes a new methodology for the reference horizontal partitioning in

data warehouses. The present methodology is based on statistics (Jaccard index), data

mining (hierarchical clustering) and meta-heuristic (particle swarm optimization). First,

we compute attraction coefficient between predicates using Jaccard index [71], then we

apply the Ward algorithm [77] (hierarchical clustering) for clustering the set of predicates.

Finally, we use a discrete particle swarm optimization (DPSO) [41] for selecting the best

partitioning scheme. Several experiments are performed to demonstrate the effectiveness

of the proposed methodology using relatively large query workload and the results are

compared to the best well known method, the genetic algorithm based approach. The

proposed methodology is found to be faster and more effective than the genetic algorithm

based approach for solving the horizontal partitioning problem in data warehouses.

109

CHAPTER 7. EFFICIENT METHODOLOGY FOR REFERENCE HORIZONTALPARTITIONING IN DATA WAREHOUSES

7.2 The proposed methodology

Data warehouse

Selection Strategy Module(EMeD-Part, GA)

Queries workload

(Q)

Query parser Module

Maintenance constraint

(B)

Knowledge(Metadata, schemas,

statistics,...)

Cost modelsModule

Set of Predicates

(P)

Attraction Module

AttractionBetween

Predicates

Hierarchical Clustering Module

Clusteringresults

Clusters Validator Module

Clustersof

predicates(C)

Figure 7.1: EMeD-Part process

The general schema of the EMeD-Part methodology is illustrated in Fig. 7.1.

7.2.1 Predicates attraction

First, the query workload Q is parsed and the set of predicates Pis generated to build the

predicate usage matrix (PUM |Q|× |P|)(see Fig. 7.2). We compute the attraction between

each pair of predicates Pi and P j in the set P. The Jaccard coefficient [71] is a useful

measure of the overlap that Pi and P j share with their queries. Each row of Pi and P j can

110

7.2. THE PROPOSED METHODOLOGY

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10Q1 1 0 0 0 1 0 1 0 0 0Q2 0 1 1 0 0 0 0 1 1 0Q3 0 0 0 1 0 1 0 0 0 1Q4 0 1 0 0 0 0 1 1 0 0Q5 1 1 1 0 1 0 1 1 1 0Q6 1 0 0 0 1 0 0 0 0 0Q7 0 0 1 0 0 0 0 0 1 0Q8 0 0 1 1 0 1 0 0 1 1

Figure 7.2: Example of predicates usage matrix PUM

either be 0 or 1. The total number for each combination of rows for the pair Pi and P j are

specified as follows:

• b : The total number of rows where Pi and P j both have a value of 1.

• r : The total number of rows where the row of Pi is 0 and the row of P j is 1.

• l : The total number of rows where the row of Pi is 1 and the row of P j is 0.

The Jaccard similarity coefficient, J, is given as [71]:

J = br+ l+b

(7.1)

Matrix in Fig. 7.3. summarizes attraction coefficients between each pair of predicates in P.

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10P1 1 0.2 0.17 0 1 0 0.5 0.2 0.17 0P2 0.2 1 0.4 0 0.2 0 0.5 1 0.4 0P3 0.17 0.4 1 0.2 0.17 0.2 0.17 0.4 1 0.2P4 0 0 0.2 1 0 1 0 0 0.2 1P5 1 0.2 0.17 0 1 0 0.5 0.2 0.17 0P6 0 0 0.2 1 0 1 0 0 0.2 1P7 0.5 0.5 0.17 0 0.5 0 1 0.5 0.17 0P8 0.2 1 0.4 0 0.2 0 0.5 1 0.4 0P9 0.17 0.4 1 0.2 0.17 0.2 0.17 0.4 1 0.2P10 0 0 0.2 1 0 1 0 0 0.2 1

Figure 7.3: Example of attraction matrix (AM) computed using Jaccard Index

7.2.2 Clustering of Predicates

The principal dissimilarity between partitional clustering and hierarchical clustering

resides in the fact the hierarchical clustering is not limited to group the data objects in

a flat partition, but it used to organize the data into a tree structure called dendrogram.

Each data object are represented by a leaf of the dendrogram, while internal nodes of the

111


#clusters clusters10 {10}{4}{6}{1}{5}{3}{9}{7}{2}{8}9 {10}{4}{6}{1,5}{3}{9}{7}{2}{8}8 {10}{4}{6}{1,5}{3}{9}{7}{2,8}7 {10}{4}{6}{1,5}{3,9}{7}{2,8}6 {10}{4,6}{1,5}{3,9}{7}{2,8}5 {4,6,10}{1,5}{3,9}{7}{2,8}4 {4,6,10}{1,5}{3,9}{2,7,8}3 {4,6,10}{1,5}{2,3,7,8,9}2 {4,6,10}{1,2,3,5,7,8,9}1 {1,2,3,4,5,6,7,8,9,10}

Figure 7.4: Result of the hierarchical clustering

dendrogram represent groups of objects such that for each pair of elements in such group,

their distance is within a fixed threshold. A flat clustering can be easily obtained by cutting

the dendrogram at a certain level.

The hierarchical clustering needs a high computation time, at least quadratic in the

number of data objects. The complexity of hierarchical clustering is due to the squared

matrix of the distances between all the pairs of points in the data set. Several improved ver-

sions of hierarchical clustering allow the optimization of the performances when clustering

a large scale of data.

Two variants of the hierarchical clustering exist:

• Divisive or top-down variant: the dendrogram of this variant is built from the root

to the leaves. First, The n data are in the same cluster. A series of split operations are

performed until reaching n clusters, each cluster has a single element. Between all

pairs of objects in the same cluster all distances are computed, and after a selection

of two diametral points as kernels, then all the points in the group are assigned to

the closest kernels.

• Agglomerative or bottom-up variant: the dendrogram of this variant is built

from the leaves to the root. At the starting each of the n objects is assigned into

a cluster, after a series of merge operations is performed until all the points are

merging into the same cluster.

Based on the attraction matrix (AM) that contains attraction coefficients between each

pair of predicates in P. we use a hierarchical clustering algorithm, precisely the Ward

algorithm [77], which is based on the minimum variance criterion, for clustering the set of

predicates P. In order to decide which clusters should be combined, we use the euclidean

112

7.2. THE PROPOSED METHODOLOGY

distance for measuring the similarity between each pair of predicates. An example of the

clustering process using the Ward algorithm is presented in the Fig. 7.4.

7.2.3 Determining the number of clusters

In the previous section we have presented the hierarchical clustering and discussed the

advantage and limits of the clustering techniques. In the clustering algorithms the number

of clusters is generally unknown which needs to be either defined by the users based on

their prior knowledge or estimated. Several methods exist in the literature for estimate

the number of cluster. We using in this work the Calinski and Harabasz method [19].

This approach determines the number of clusters by maximizing the index CH(g) over the

number of clusters g, where CH(g) is given by Eq. 7.2:

CH(g)= B(g)/(g−1)W(g)/(n− g)

(7.2)

B(g) and W(g) are the between and within-cluster sum of squared errors, calculated as the

trace of matrix B (Eq. 7.3) and W (Eq. 7.4). CH(g) is only defined when g is greater than 1,

B(g) is not defined when g = 1.

B =g∑

m=1

nm∑l=1

(xml − xm)(xml − xm)′ (7.3)

W =g∑

m=1nm(xm − x)(xm − x)′, x = 1

n

n∑i=1

xi (7.4)

7.2.3.1 Solution coding

P1 P2 P3 P4 P5

P6 P7

P8 P9 P10

P1,P5 P2,P3 P4

P6 P7

P8,P9 P10

After

Clustering

Figure 7.5: Solution coding

Solution coding is an important issues when designing a meta-heuristics. The search

space of the proposed DPSO algorithm is composed of a set of predicate clusters generating

by the clustering module. An integer vector of particle positions xi is used for the coding of

a solution.

113


7.2.4 Discrete particle swarm optimization for selecting horizontalpartitioning schema

Particle Swarm Optimization (PSO) is a meta-heuristic technique inspired from swarm

intelligence developed by Eberhart [44]. Formally, each particle i of the swarm is described

by its position xi and its velocity vi, pi is the best position travelled by the particle i. The

best value of all individual pi values is called g. At each generation, a particle’s position

and velocity are updated according to its own pi and pg values, this behavior is described

by Eq.(7.5) and Eq.(7.6):

vt+1i j = wvt

i j + c1r1(pi j − xti j)+ c2r2(pg j − xt

i j) (7.5)

xt+1i j = xt

i j +vt+1i j (7.6)

Here, vti j is the current velocity at time t, vt+1

i j is the new speed of particle i, w is the

inertia weight, c1 and c2 are two positive constants, r1 and r1 are the uniformly distributed

random numbers in [0,1], xti j is the current position of particle i at time t and xt+1

i j is the

new position of particle i.

Jarboui et al. [41] have adapted the original version of the PSO algorithm proposed by

Eberhart [44] to be used for solving the discrete problems by applying transformation to

the velocity using transition from the discrete state to the continuous state and vice versa

using a dummy variable called Y ti = {yt

i1, yti1, ..., yt

in}

yti j =

1, if xti j = pg j

−1, if xti j = pi j

1 or −1, if xti j = pi j = pg j

0, otherwise

(7.7)

The reformulation of Eq.(7.5) is given by:

vt+1i j = wvt

i j + c1r1(−1− yt−1i j )+ c2r2(1− yt−1

i j ) (7.8)

The solution update is given by:

λti j = yt−1

i j +vti j (7.9)

where

yti j =

1, if λt

i j >α

−1, if λti j <α

0, otherwise

(7.10)

114


Finally, the rewriting of Eq.(7.6) are given by:

xti j =

pt−1

g j , if yti j = 1

pt−1i j , if yt

i j =−1

a random number, otherwise

(7.11)

7.2.4.1 Fitness function

The main problem for solving the constrained problem is defining the penalty function.

We have shown in [72] that the principal limit of the GA with linear penalty function is

the unfeasible solutions generated. We have proposed an efficient exponential penalty for

solving the bitmap join indexes selection problem. In order that the DPSO generate more

feasible solutions, we describe the fitness function as follow:

f itness(PS)=GlobalCost(Q,PS)×2|PS| , i f P(PS)> 1

GlobalCost(Q,PS) , otherwise(7.12)

where:

P(PS)= |PS|B

(7.13)

|PS| are the number of the sub-star schemes generated by PS and B is the maintenance

constraint fixed by DWA. The GlobalCost is defined in the chapter 3.



In the experiments to follow, the benchmark APB-I [27] is used to generate the data ware-

house and ORACLE 11g DBMS environment is used to implement the data warehouse. In

this benchmark, the star schema contains four dimension tables: CHANLEVEL (9 tuples),

CUSTLEVEL (900 tuples), PRODLEVEL (9,000 tuples), TIMELEVEL (24 tuples) and the

fact table ACTVARS (24,786,000 tuples). Three classes of experiments are performed:

• the class of smaller size problem set (CSP): this class contains 100 OLAP queries

and 12 non-key attributes (280 selection predicates) from dimension tables: all,

year, retailer, quarter,month, line, group, family, division, class, gender, city with

cardinalities: 4, 15, 75, 300, 605, 5, 2, 4, 12, 5, 45, 255 respectively.

• the class of moderate size problem set (CMP): this class consists of 250 OLAP

queries and 16 non-key attributes (415 selection predicates) from dimension tables:

115


{division, line, family, group, class, status, year, quarter, month, day, state, city,

retailer, type, gender, all} with cardinalities: 4, 15, 75, 300, 605, 5, 2, 4, 12, 5, 45, 255,

99, 10, 2 and 9 respectively.

• the class of larger size problem set (CLP): this class consists of 500 OLAP queries

and 20 non-key attributes (497 selection predicates) from dimension tables: {all, year,

retailer, quarter, month, day, line, group, family, division, class, gender, city, state,

type, educational, marital, supplier, status, category} with cardinalities: 9, 2, 99, 4,

12, 5, 15, 300, 75, 4, 605, 2, 255, 45, 10, 6, 4, 15, 5 and 3 respectively.

7.3.2 Preview

All tests are performed under Intel i5 processor with 8 GB RAM. We have implemented

the DPSO algorithm with Java Development Kit. The Genetic algorithm is implemented

with Jenetics: Java Genetic Algorithm framework1 (see [17] for the details about GA

implementations). For the cost models, the disk page size P was set to 65,536 bytes and

the B-tree order was set to 2.

In the GA and DPSO, the maximum number of evaluations is equal to the number of

iterations × the population size. For the GA, the parameter setup proposed in [17] was used.

The population size, crossover probability, mutation rate and number of iterations were set

to 30, 0.95, 0.01 and 200 respectively yielding 6,000 maximum evaluations (200×30). For

the DPSO, the parameters c1, c2, Vmax, wmax, wmin and α were set to 1.7, 0.3, 6.0, 0.95,

0.4, 0.5 respectively [41, 44]. The population size and the number of iterations were set to

30 and 200 respectively yielding 6,000 maximum evaluations (200×30).


A set of experiments was used to analyze the efficiency of the EMeD-Part approach against

the GA using the three problem sets CSP, CMP and CLP.

For the stochastic algorithms DPSO in the EMeD-Part approach and GA, the average

cost of solutions over 5 independent runs are reported. The maintenance constraint B

was systematically increased from 400 partitions to 1000 partitions (in 200 partitions

increments, yielding a total of 4 different cases) to determine the cost for each of the

approaches. Therefore, a total of 20 (4×5) runs is under consideration for the algorithms

DPSO and GA. Tables 5.3, 5.5 and Table 5.7 show the number of disk page accesses (I/Os

costs) using the cost models described in Section 2.2 when executing the set of queries in

1http://jenetics.sourceforge.net/

116


the problem sets CSP, CMP and CLP respectively for 4 different maintenance constraint

sizes. For the DPSO and GA approaches, a solution can be either feasible or infeasible

with respect to the maintenance constraint B. The column Avg represents the average

cost of solutions found over the 5 independent runs. The column Best represents the value

of the best minimum cost solution found. The column Evals represents the average of

the number of candidate evaluations performed for reaching the best solution for the 5

independent runs (i.e., a measure about how fast an algorithm finds an optimal/sub-optimal

solution or converges). The column Err represents the rate of infeasible solutions (i.e.,

solutions that do not satisfy the maintenance constraint B). The column Time shows the

average computation time in seconds. The last column represents the following: for each

maintenance cost B, the Mann-Whitney test was applied to the results of the EMeD-

Part and GA for the 5 independent runs to see if results are significantly different or

not (using a standard significance level of 0.05). Symbol⊙

indicates that there is no

statistically significant difference between the results of the two approaches EMeD-Part

and GA. Symbol⊕

indicates that the results of the EMeD-Part are statistically different

than the GA approach. The last row provides an overall average for the runs. In each

table row the best (i.e., the minimum) and the average querying performance result were

presented in bold font for each of the approaches considered. In Tables 5.4, 5.6 and Table

5.8 the columns %Best represents the best performance rates and %Avg represents the

average performance rates computed using Eq. 7.14 shown below.

%Best = 100×(1− Best

CostWoP

); %Avg = 100×

(1− Avg

CostWoP

)(7.14)

CostWoP represents the workload cost without partitioning (i.e., using the hash-join

method).


The querying performance for the smaller size problem set CSP presented in Table. 7.1

indicates that the EMeD-Part approach has generated better results in general. the number

of clusters detected after the hierarchical clustering is 8 clusters. The EMeD-Part approach

has outperformed the GA approach for both best solutions found and average solution

quality obtained. In terms of the best solutions found, the EMeD-Part approach has

generated the best solutions in the 5 out of 20 cases (25%) while the GA method has

generated the best solutions in 0 out of 20 cases (0%). In terms of the average number of

evaluations performed, the EMeD-Part approach needed about 2 times (1.99 exactly) less

evaluations than the GA approach, yet achieving better solution quality. In terms of the

117


unfeasible solutions generated, the EMeD-Part approach has always generated feasible

solutions (i.e., 0% infeasible or 100% feasible) while the GA has generated unfeasible

solutions in 18 out of 20 cases, which is 2 feasible ones - i.e., 90% unfeasible or 10%

feasible. In terms of the computation time, the EMeD-Part approach was about 4 times

(4.49 exactly) faster than the GA approach. The last table column indicates that there was

a statistically significant difference between the results of the EMeD-Part and the GA in 4

out of 4 cases. In summary, as also indicated by the Table 7.1, the EMeD-Part approach

has shown better performance than the GA method in all aspects. Table 5.4 shows the

optimization rates of the approaches EMeD-Part and GA for the smaller size problem set

CSP. The best performance rates (%Best) for the EMeD-Part were slightly higher than the

GA approach (the unfeasible solutions generated by the GA are significantly higher). The

average performance rates (%Avg) for the EMeD-Part were also slightly higher than the

GA.

118


ble

7.1:

Que

ryin

gpe

rfor

man

cere

sult

sfo

rth

esm

alle

rsi

zepr

oble

mse

tC

SP

.

BE

med

-Pa

rtG

ASt

at.

sign

.B

est

Avg

Eva

lsT

ime

Err

Bes

tA

vgE

vals

Err

400

9,71

1,02

1.0

9,78

9,22

3.8

294.

63,

084.

00%

24,3

00,0

00.0

29,8

20,0

00.0

913.

010

0%⊕

600

9,70

9,04

7.0

9,73

3,43

7.6

233.

22,

832.

00%

25,3

00,0

00.0

29,8

00,0

00.0

1,18

6.3

80%

⊕80

09,

719,

742.

09,

736,

885.

836

5.0

3,40

8.0

0%26

,100

,000

.029

,480

,000

.01,

275.

480

%⊕

1,00

09,

719,

742.

09,

742,

590.

625

5.2

2,39

4.0

0%26

,000

,000

.027

,600

,000

.01,

785.

710

0%⊕

Avg

9,71

4,88

8.0

9,75

0,53

4.5

287.

02,

929.

50%

25,4

25,0

00.0

29,1

75,0

00.0

1,29

0.1

90%

Tabl

e7.

2:Q

uery

ing

perf

orm

ance

resu

lts

for

the

smal

ler

size

prob

lem

set

CS

P.

BE

med

-Pa

rtG

A

Bes

tA

vgE

rrB

est

Avg

Err

400

94.8

3%94

.79%

0%87

.07%

84.1

3%10

0%60

094

.83%

94.8

2%0%

86.5

4%84

.14%

80%

800

94.8

3%94

.82%

0%86

.11%

84.3

1%80

%1,

000

94.8

3%94

.82%

0%86

.16%

85.3

1%10

0%

Avg

94.8

3%94

.81%

0%86

.47%

84.4

8%90

%

119



The querying performance for the moderate size problem set CMP presented in Table.

7.3 indicates that the EMeD-Part approach has generated better results in general. the

number of clusters detected after the hierarchical clustering is 18 clusters. The EMeD-Part

approach has outperformed theGA approach for both the best solutions found and average

solution quality obtained. In terms of the best solutions found, the EMeD-Part approach

has generated the best solutions in the 4 out of 20 cases (20%) while the GA method has

generated the best solutions in 0 out of 20 cases (0%). In terms of the average number of

evaluations performed, the EMeD-Part approach needed about 1.5 times (1.26 exactly) less

evaluations than the GA, yet achieving better solution quality. In terms of the unfeasible

solutions generated, the EMeD-Part approach has always generated feasible solutions (i.e.,

0% unfeasible or 100% feasible) while the GA has generated unfeasible solutions in 20 out

of 20 cases, which is 0 feasible ones - i.e., 100% infeasible or 0% feasible. In terms of the

computation time, the EMeD-Part approach was about 6 times (5.54 exactly) faster than

the GA. The last table column indicates that there was a statistically significant difference

between the results of the EMeD-Part and the GA in 4 out of 4 cases. Table 5.6 shows the

optimization rates of the approaches EMeD-Part and GA for the moderate size problem

set CMP. The best performance rates (%Best) for the EMeD-Part were significantly higher

than the GA approach. In summary, as also indicated by the Table. 7.3, the EMeD-Part

approach has shown considerably better performance than the GA method in all aspects.

120


ble

7.3:

Que

ryin

gpe

rfor

man

cere

sult

sfo

rth

em

oder

ate

size

prob

lem

set

CM

P.

BE

med

-Pa

rtG

ASt

at.

sign

.B

est

Avg

Eva

lsT

ime

Err

Bes

tA

vgE

vals

Err

400

31,2

43,1

85.0

31,7

87,7

27.6

4,29

62,

104.

20%

195,

518,

928.

024

3,30

9,07

8.0

5,88

010

0%⊕

600

29,4

21,6

66.0

30,3

10,5

64.2

5,16

63,

491.

60%

129,

575,

730.

019

7,77

7,28

4.8

5,87

410

0%⊕

800

29,9

27,4

33.0

30,6

37,7

20.2

4,50

02,

843.

00%

188,

786,

259.

025

6,01

4,77

2.2

5,75

410

0%⊕

1,00

030

,184

,227

.030

,721

,934

.44,

350

3,46

3.0

0%15

3,25

7,58

0.0

198,

972,

144.

05,

748

100%

⊕A

vg30

,194

,127

.830

,864

,486

.64,

578

2,97

5.5

0%16

6,78

4,62

4.3

224,

018,

319.

85,

814

100%

⊕Ta

ble

7.4:

Opt

imiz

atio

nra

tes

for

the

CM

P.

BE

med

-Pa

rtG

A

Bes

tA

vgE

rrB

est

Avg

Err

400

87.5

8%87

.36%

0%22

.27%

3.27

%10

0%60

088

.30%

87.9

5%0%

48.4

8%21

.37%

100%

800

88.1

0%87

.82%

0%24

.94%

-1.7

9%10

0%1,

000

88.0

0%87

.79%

0%39

.07%

20.8

9%10

0%

Avg

88.0

0%87

.73%

0%33

.69%

10.9

4%10

0%

121



The querying performance for the larger size problem set CLP presented in Table. 7.5

indicates that the EMeD-Part approach has generated better results in general. For this

problem set, the number of clusters detected after the hierarchical clustering is 5 clusters.

The EMeD-Part approach has outperformed the method GA for both the best solutions

found and average solution quality obtained. In terms of the best solutions found, the

EMeD-Part approach has generated the best solutions in the 11 out of 20 cases (55%) while

the GA method has generated the best solutions in 0 out of 20 cases (0%). In terms of the

average number of evaluations performed, the EMeD-Part approach needed about 5 times

(4.65 exactly) less evaluations than the GA, yet achieving better solution quality. In terms of

the unfeasible solutions generated, the EMeD-Part approach has always generated feasible

solutions (i.e., 0% unfeasible or 100% feasible) while the GA has generated unfeasible

solutions in 20 out of 20 cases, which is 0 feasible ones - i.e., 100% unfeasible or 0% feasible.

In terms of the computation time, the EMeD-Part approach was about 17 times 16.99

exactly) faster than the GA. The last table column indicates that there was a statistically

significant difference between the results of the EMeD-Part and the GA in 4 out of 4

cases. Table 5.6 shows the optimization rates of the approaches EMeD-Part and GA for the

moderate size problem set CMP. The best performance rates (%Best) for the EMeD-Part

were slightly higher than the GA approach, we mention that the unfeasible solutions

generated by the GA are significantly higher(100% of infeasible solutions). In summary,

as also indicated by the Fig. 7.5, the EMeD-Part approach has again shown considerably

better performance than the GA method in all aspects.

122


ble

7.5:

Que

ryin

gpe

rfor

man

cere

sult

sfo

rth

ela

rger

size

prob

lem

set

CL

P.

BE

med

-Pa

rtG

ASt

at.

sign

.B

est

Avg

Eva

lsT

ime

Err

Bes

tA

vgE

vals

Err

400

91,2

15,1

89.0

91,2

22,6

18.2

1,42

887

1.0

0%53

1,00

0,00

0.0

697,

200,

000.

05,

904

100%

⊕60

086

,826

,786

.086

,957

,203

.21,

356

1,17

5.8

0%37

6,00

0,00

0.0

761,

600,

000.

05,

940

100%

⊕80

086

,826

,786

.087

,053

,276

.41,

104

1,25

8.4

0%41

0,00

0,00

0.0

707,

000,

000.

05,

904

100%

⊕1,

000

86,8

26,7

86.0

86,9

35,6

38.0

1,20

01,

504.

40%

635,

000,

000.

090

8,40

0,00

0.0

5,94

010

0%⊕

Avg

87,9

23,8

86.8

88,0

42,1

84.0

1,27

21,

202.

40%

488,

000,

000.

076

8,55

0,00

0.0

5,92

210

0%⊕

Tabl

e7.

6:Q

uery

ing

perf

orm

ance

resu

lts

for

the

larg

ersi

zepr

oble

mse

tC

LP

.

BE

med

-Pa

rtG

A

Bes

tA

vgE

rrB

est

Avg

Err

400

85.4

1%85

.41%

0%15

.07%

-11.

52%

100%

600

86.1

1%86

.09%

0%39

.86%

-21.

82%

100%

800

86.1

1%86

.08%

0%34

.42%

-13.

09%

100%

1,00

086

.11%

86.0

9%0%

-1.5

7%-4

5.30

%10

0%

Avg

85.9

4%85

.92%

0%21

.94%

-22.

93%

100%

123


7.4 Conclusions

We have presented a new methodology called EMeD-Part based on Jaccard index, data

mining and discrete particle swarm optimization to solve the horizontal partitioning in data

warehouses. We have used three class of problem sets, the smaller size set, the moderate

size set and the larger size set to test the effectiveness of the c against the well-known

approach, the genetic algorithm based approach (GA) on a fairly large benchmark data

warehouse (APB-I benchmark). The EMeD-Part method is found to be much more effective

than the genetic algorithm based approaches in all cases. Furthermore, the EMeD-Part was

able to obtain optimal/sub-optimal solutions for the problem sets CSP, CMP and CLP using

much smaller number of evaluations compare to those for the GA - i.e., the EMeD-Part has

converged much faster. The experiments have confirmed that using the exponential penalty

function can greatly avoid the generation of unfeasible solutions in genetic algorithms.

The query workload size used in the experiments was considerably higher than those for

the relevant work found in the literature. The EMeD-Part is an efficient methodology and

could also be used to solve other problems in data warehouses.

124

CH

AP

TE

R

8CONCLUSIONS

“The only real valuable thing is intuition.”

-Albert Einstein

8.1 Results

This thesis presents a contribution to the administration and tuning problems in a data

warehouse. All selection of optimization structures in a data warehouse are NP-complete.

In the last decade, a few work have been introduced for solving the bitmap join indexes

selection problem using statistics such as the Data mining based approach [5, 13], and/or

using meta-heuristics such as the genetic algorithm based approach [15].

Our first motivation was to propose a new approach for tackling the bitmap join index

selection problem for the single attribute case yet to be efficient [72]. In this aspect, we

have proposed a new approach based particle swarm optimization that has solved the

BJISP efficiently. Several tests have proved the accuracy of the proposed approach against

other existing approaches, We have also used several classes of problem sets, the smaller,

moderate and larger size to show the effectiveness of the proposed approach. It should be

noted here that the sizes of problem sets used are much larger than those used by other

approaches used in our comparisons.

The second motivation was to propose a mathematical formulation for the BJISP and

solve the problem effectively with an existing solver. In this aspect, we have propose a

mixed-integer linear programming formulation for modelling the BJISP [73]. We have

125

CHAPTER 8. CONCLUSIONS

tested the accuracy and the effectiveness of this approach against particle swarm opti-

mization and improved genetic algorithm, using the CPLEX solver and a data warehouse

benchmark. The experimental results have been very promising in all terms, the accuracy,

effectiveness and execution time.

Finally, we had a motivation to solve another complex problem in the data warehouse

tuning. In this aspect, we have proposed a new methodology called Emed-part to effectively

solve the horizontal partitioning problem in a data warehouse. This method was inspired

from the attraction between predicates using Jaccard index. Then we have used the data

mining, precisely the hierarchical clustering for grouping of predicates with the same

attraction into the same cluster. This can reduce the problem complexity enormously. In

the last step of this methodology we have used the discrete particle swarm optimization

to select the best partitioning schema. Again, several tests have been performed to test

the effectiveness and the accuracy of the proposed methodology against other existing

approach, and extremely promising results have been obtained.

8.2 Future research directions

In the future, several issues could be investigated to solve other problems in the data

warehouse tuning such as:

• Improve our linear model to take the update cost of index into consideration.

• Improve the proposed solution to take the incremental aspect into consideration,

when a change in data warehouse or query workload is made.

• Exploit potential parallelism of GPUs to solve the tuning problems effectively.

• Use the mixed-integer linear programming to solve the reference horizontal parti-

tioning problem in data warehouses.

126

BIBLIOGRAPHY

[1] S. AGRAWAL, S. CHAUDHURI, L. KOLLAR, A. MARATHE, V. NARASAYYA, AND

M. SYAMALA, Database tuning advisor for microsoft sql server 2005: Demo, in Pro-

ceedings of the 2005 ACM SIGMOD International Conference on Management of

Data, SIGMOD ’05, New York, NY, USA, 2005, ACM, pp. 930–932.

[2] S. AGRAWAL, S. CHAUDHURI, AND V. R. NARASAYYA, Automated selection of materi-

alized views and indexes in sql databases.

[3] S. AGRAWAL, V. NARASAYYA, AND B. YANG, Integrating vertical and horizontal

partitioning into automated physical database design, in Proceedings of the 2004

ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, New

York, NY, USA, 2004, ACM, pp. 359–370.

[4] M. AHMAD, A. ABOULNAGA, S. BABU, AND K. MUNAGALA, Interaction-aware

scheduling of report-generation workloads, The VLDB Journal, 20 (2011), pp. 589–615.

[5] K. AOUICHE, J. DARMONT, O. BOUSSAÏD, AND F. BENTAYEB, Automatic selection

of bitmap join indexes in data warehouses, in Data Warehousing and Knowledge

Discovery, Springer, 2005, pp. 64–73.

[6] F. BAIÃO, M. MATTOSO, AND G. ZAVERUCHA, Horizontal fragmentation in object

dbms: New issues and performance evaluation, in Performance, Computing, and

Communications Conference, 2000. IPCCC’00. Conference Proceeding of the IEEE

International, IEEE, 2000, pp. 108–114.

[7] E. BARALIS, S. PARABOSCHI, AND E. TENIENTE, Materialized views selection in a

multidimensional database, in Proceedings of the 23rd International Conference on

Very Large Data Bases, VLDB ’97, San Francisco, CA, USA, 1997, Morgan Kaufmann

Publishers Inc., pp. 156–165.

127

BIBLIOGRAPHY

[8] L. BELLATRECHE, Selection of redundant and non redundant optimization struc-

tures in vldbs, in Database and Expert Systems Applications, 2007. DEXA’07. 18th

International Workshop on, IEEE, 2007, pp. 819–824.

[9] L. BELLATRECHE, K. BOUKHALFA, AND H. I. ABDALLA, Saga: A combination of

genetic and simulated annealing algorithms for physical data warehouse design, in

Proceedings of the 23rd British National Conference on Databases, Conference on

Flexible and Efficient Information Handling, BNCOD’06, Berlin, Heidelberg, 2006,

Springer-Verlag, pp. 212–219.

[10] L. BELLATRECHE, K. BOUKHALFA, P. RICHARD, AND K. Y. WOAMENO, Referential

horizontal partitioning selection problem in data warehouses: Hardness study and se-

lection algorithms, International Journal of Data Warehousing and Mining (IJDWM),

5 (2009), pp. 1–23.

[11] L. BELLATRECHE, K. KARLAPALEM, AND M. MOHANIA, Olap query processing for

partitioned data warehouses, in Database Applications in Non-Traditional Environ-

ments, 1999.(DANTE’99) Proceedings. 1999 International Symposium on, IEEE, 1999,

pp. 35–42.

[12] L. BELLATRECHE, K. KARLAPALEM, M. MOHANIA, AND M. SCHNEIDER, What can

partitioning do for your data warehouses and data marts?, in Database Engineering

and Applications Symposium, 2000 International, IEEE, 2000, pp. 437–445.

[13] L. BELLATRECHE, R. MISSAOUI, H. NECIR, AND H. DRIAS, A data mining approach

for selecting bitmap join indices., JCSE, 1 (2007), pp. 177–194.

[14] , Selection and pruning algorithms for bitmap index selection problem using

data mining, in Data Warehousing and Knowledge Discovery, I. Song, J. Eder, and

T. Nguyen, eds., vol. 4654 of Lecture Notes in Computer Science, Springer Berlin

Heidelberg, 2007, pp. 221–230.

[15] R. BOUCHAKRI AND L. BELLATRECHE, On simplifying integrated physical database

design, in Advances in Databases and Information Systems, Springer, 2011, pp. 333–

346.

[16] R. BOUCHAKRI, L. BELLATRECHE, Z. FAGET, AND S. BRESS, A coding template for

handling static and incremental horizontal partitioning in data warehouses, Journal

of Decision Systems, 23 (2014), pp. 481–498.

128

BIBLIOGRAPHY

[17] K. BOUKHALFA, De la conception physique aux outils d’administration et de tuning

des entrepots de donnees, PhD thesis, ISAE-ENSMA Ecole Nationale Superieure de

Mecanique et d’Aerotechique-Poitiers, 2009.

[18] A. BROWN AND D. PATTERSON, To Err is Human, in First Workshop on Evaluating

and Architecting System dependabilitY (EASY ’01), Göteborg, Sweden, July 2001.

[19] T. CALINSKI AND J. HARABASZ, A dendrite method for cluster analysis, Communica-

tions in Statistics, 3 (1974), pp. 1–27.

[20] S. CERI, M. NEGRI, AND G. PELAGATTI, Horizontal data partitioning in database

design, in Proceedings of the 1982 ACM SIGMOD International Conference on Man-

agement of Data, SIGMOD ’82, New York, NY, USA, 1982, ACM, pp. 128–136.

[21] S. CHAUDHURI, M. DATAR, AND V. NARASAYYA, Index selection for databases: A

hardness study and a principled heuristic solution, IEEE Trans. on Knowl. and Data

Eng., 16 (2004), pp. 1313–1323.

[22] S. CHAUDHURI AND U. DAYAL, An overview of data warehousing and olap technology,

SIGMOD Rec., 26 (1997), pp. 65–74.

[23] S. CHAUDHURI AND V. NARASAYYA, Self-tuning database systems: A decade of

progress, in Proceedings of the 33rd International Conference on Very Large Data

Bases, VLDB ’07, VLDB Endowment, 2007, pp. 3–14.

[24] S. CHOENNI, H. BLANKEN, AND T. CHANG, Index selection in relational databases,

in Proc. International Conference on Computing and Information, 1993, pp. 491–496.

[25] D. COMER, Ubiquitous b-tree, ACM Comput. Surv., 11 (1979), pp. 121–137.

[26] D. W. CORNELL AND P. YU, Integration of buffer management and query optimization

in relational database environment, in Proceedings of the 15th International Confer-

ence on Very Large Data Bases, VLDB ’89, San Francisco, CA, USA, 1989, Morgan

Kaufmann Publishers Inc., pp. 247–255.

[27] O. COUNCIL, Apb-i, olap benchmark, release ii,http:// www.olapcouncil.org/, 1998.

[28] B. DAGEVILLE, D. DAS, K. DIAS, K. YAGOUB, M. ZAIT, AND M. ZIAUDDIN, Automatic

sql tuning in oracle 10g, in Proceedings of the Thirtieth International Conference on

Very Large Data Bases - Volume 30, VLDB ’04, VLDB Endowment, 2004, pp. 1098–

1109.

129

BIBLIOGRAPHY

[29] R. C. EBERHART AND J. KENNEDY, A new optimizer using particle swarm theory.

[30] C. I. EZEIFE, Selecting and materializing horizontally partitioned warehouse views,

2001.

[31] S. FINKELSTEIN, M. SCHKOLNICK, AND P. TIBERIO, Physical database design for

relational databases, ACM Trans. Database Syst., 13 (1988), pp. 91–128.

[32] M. R. FRANK, E. R. OMIECINSKI, AND S. B. NAVATHE, Adaptive and automated

index selection in rdbms, in In Proceedings of International Conference on Extending

Database Technology, 1992, pp. 277–292.

[33] C.-W. FUNG, K. KARLAPALEM, AND Q. LI, Cost-driven evaluation of vertical class

partitioning in object-oriented databases, in Proceedings of the Fifth International

Conference on Database Systems for Advanced Applications (DASFAA), World Scien-

tific Press, 1997, pp. 11–20.

[34] T. GANDEM, Near optimal multiple choice index selection for relational databases,

Computers Mathematics with Applications, 37 (1999), pp. 111 – 120.

[35] A. G. GANEK AND T. A. CORBI, The dawning of the autonomic computing era, IBM

Syst. J., 42 (2003), pp. 5–18.

[36] H. GARCIA-MOLINA, W. J. LABIO, J. L. WIENER, AND Y. ZHUGE, Distributed and

parallel computing issues in data warehousing (abstract), in Proceedings of the Tenth

Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’98, New

York, NY, USA, 1998, ACM, pp. 77–.

[37] M. GOLFARELLI, S. RIZZI, AND E. SALTARELLI, Index selection techniques in data

warehouse systems, 2002.

[38] H. GUPTA, Selection of views to materialize in a data warehouse, in Proceedings of

the 6th International Conference on Database Theory, ICDT ’97, London, UK, UK,

1997, Springer-Verlag, pp. 98–112.

[39] H. GUPTA, V. HARINARAYAN, A. RAJARAMAN, AND J. D. ULLMAN, Index selection for

olap, in Proceedings of the Thirteenth International Conference on Data Engineering,

ICDE ’97, Washington, DC, USA, 1997, IEEE Computer Society, pp. 208–219.

[40] W. H. INMON, Building the data warehouse, John wiley & sons, 2005.

130

BIBLIOGRAPHY

[41] B. JARBOUI, M. CHEIKH, P. SIARRY, AND A. REBAI, Combinatorial particle swarm

optimization (cpso) for partitional clustering problem, Applied Mathematics and

Computation, 192 (2007), pp. 337 – 345.

[42] T. JOHNSON AND D. SHASHA, Some approaches to index design for cube forest, IEEE

Data Eng. Bull., 20 (1997), pp. 27–35.

[43] K. KARLAPALEM AND Q. LI, Partitioning schemes for object oriented databases,

in in Proceeding of the Fifth International Workshop on Research Issues in Data

Engineering- Distributed Object Management, RIDE-DOM’95, 1995, pp. 42–49.

[44] J. KENNEDY AND R. EBERHART, Particle swarm optimization, 1995.

[45] J. KENNEDY AND R. C. EBERHART, A discrete binary version of the particle swarm

algorithm, in Systems, Man, and Cybernetics, 1997. Computational Cybernetics and

Simulation., 1997 IEEE International Conference on, vol. 5, IEEE, 1997, pp. 4104–

4108.

[46] J. KENNEDY AND R. C. EBERHART, Swarm Intelligence, Morgan Kaufmann Publish-

ers Inc., San Francisco, CA, USA, 2001.

[47] R. KIMBALL AND M. ROSS, The Data Warehouse Toolkit: The Complete Guide to

Dimensional Modeling, John Wiley & Sons, Inc., New York, NY, USA, 2nd ed., 2002.

[48] J. KRATICA, I. LJUBIC, AND Y. P. TOSIC, DUVSAN, A genetic algorithm for the index

selection problem.

[49] B. LADJEL, Utilisation des vues matérialisées, des index et de la fragmentation dans

la conception logique et physique d’un entrepp

etdedonnées,PhDthesis,dec2000.

[50] C. MACDONALD, N. TONELLOTTO, AND I. OUNIS, Learning to predict response times

for online query scheduling, in Proceedings of the 35th International ACM SIGIR

Conference on Research and Development in Information Retrieval, SIGIR ’12, New

York, NY, USA, 2012, ACM, pp. 621–630.

[51] H. MAHBOUBI AND J. DARMONT, Data mining-based fragmentation of xml data ware-

houses, in Proceedings of the ACM 11th International Workshop on Data Warehousing

and OLAP, DOLAP ’08, New York, NY, USA, 2008, ACM, pp. 9–16.

[52] H. MAHBOUBI AND J. DARMONT, Data mining-based fragmentation of xml data ware-

houses, in Proceedings of the ACM 11th international workshop on Data warehousing

and OLAP, ACM, 2008, pp. 9–16.

131

BIBLIOGRAPHY

[53] F. R. MCFADDEN AND J. A. HOFFER, Modern Database Management (4th Ed.),

Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1994.

[54] R. G. MICHAEL AND S. J. DAVID, Computers and intractability: a guide to the theory

of np-completeness, WH Freeman & Co., San Francisco, (1979).

[55] P. MISHRA AND M. H. EICH, Join processing in relational databases, ACM Computing

Surveys (CSUR), 24 (1992), pp. 63–113.

[56] S. B. NAVATHE AND M. RA, Vertical partitioning for database design: A graphical

algorithm, SIGMOD Rec., 18 (1989), pp. 440–450.

[57] R. NG, C. FALOUTSOS, AND T. SELLIS, Flexible and adaptable buffer management

techniques for database management systems, IEEE Trans. Comput., 44 (1995),

pp. 546–560.

[58] P. O’NEIL AND G. GRAEFE, Multi-table joins through bitmapped join indices, SIG-

MOD Rec., 24 (1995), pp. 8–11.

[59] P. O’NEIL AND D. QUASS, Improved query performance with variant indexes, SIG-

MOD Rec., 26 (1997), pp. 38–49.

[60] P. E. O’NEIL, Model 204 architecture and performance, in Proceedings of the 2Nd

International Workshop on High Performance Transaction Systems, London, UK, UK,

1989, Springer-Verlag, pp. 40–59.

[61] ORACLE, Partitioning in oracle database 11g, 2007.

[62] M. T. ÖZSU AND P. VALDURIEZ, Principles of Distributed Database Systems (2Nd

Ed.), Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.

[63] R. POLI AND D. BROOMHEAD, Exact analysis of the sampling distribution for the

canonical particle swarm optimiser and its convergence during stagnation, in Pro-

ceedings of the 9th annual conference on Genetic and evolutionary computation, ACM,

2007, pp. 134–141.

[64] S. SARAWAGI, Indexing olap data, Data Engineering Bulletin, 20 (1996), pp. 36–43.

[65] Y. SHI AND R. EBERHART, A modified particle swarm optimizer, in Evolutionary

Computation Proceedings, 1998. IEEE World Congress on Computational Intelligence.,

The 1998 IEEE International Conference on, IEEE, 1998, pp. 69–73.

132

BIBLIOGRAPHY

[66] A. SILBERSCHATZ, H. KORTH, AND S. SUDARSHAN, Database Systems Concepts,

McGraw-Hill, Inc., New York, NY, USA, 5 ed., 2006.

[67] D. SPIELMAN AND S.-H. TENG, Smoothed analysis of algorithms: Why the simplex

algorithm usually takes polynomial time, in Proceedings of the thirty-third annual

ACM symposium on Theory of computing, ACM, 2001, pp. 296–305.

[68] M. STEINBRUNN, G. MOERKOTTE, AND A. KEMPER, Heuristic and randomized

optimization for the join ordering problem, The VLDB Journal‚ÄîThe International

Journal on Very Large Data Bases, 6 (1997), pp. 191–208.

[69] T. STÖHR, H. MÄRTENS, AND E. RAHM, Multi-dimensional database allocation for

parallel data warehouses, in Proceedings of the 26th International Conference on Very

Large Data Bases, VLDB ’00, San Francisco, CA, USA, 2000, Morgan Kaufmann


[70] T. STOHR, H. MARTENS, AND E. RAHM, Multi-dimensional database allocation for

parallel data warehouses, in Proc. 26th VLDB Conference, 2000, pp. 273–284.

[71] P.-N. TAN, M. STEINBACH, AND V. KUMAR, Introduction to Data Mining, (First

Edition), Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.

[72] L. TOUMI, A. MOUSSAOUI, AND A. UGUR, Particle swarm optimization for bitmap

join indexes selection problem in data warehouses, The Journal of Supercomputing,

68 (2014), pp. 672–708.

[73] L. TOUMI, A. MOUSSAOUI, AND A. UGUR, A linear programming approach for

bitmap join indexes selection in data warehouses, Procedia Computer Science, 52

(2015), pp. 169 – 177. The 6th International Conference on Ambient Systems, Networks

and Technologies (ANT-2015), the 5th International Conference on Sustainable Energy

Information Technology (SEIT-2015).

[74] D. N. TRAN, P. C. HUYNH, Y. C. TAY, AND A. K. H. TUNG, A new approach to

dynamic self-tuning of database buffers, Trans. Storage, 4 (2008), pp. 3:1–3:25.

[75] P. UMAPATHY, C. VENKATASESHAIAH, AND M. S. ARUMUGAM, Particle swarm

optimization with various inertia weight variants for optimal power flow solution,

Discrete Dynamics in Nature and Society, 2010 (2010).

[76] P. VALDURIEZ, Join indices, ACM Trans. Database Syst., 12 (1987), pp. 218–246.

133

BIBLIOGRAPHY

[77] J. H. WARD, Hierarchical grouping to optimize an objective function, Journal of the

American Statistical Association, 58 (1963), pp. 236–244.

[78] G. WEIKUM, A. MOENKEBERG, C. HASSE, AND P. ZABBACK, Self-tuning database

technology and information services: From wishful thinking to viable engineering, in

Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02,

VLDB Endowment, 2002, pp. 20–31.

[79] K. WU, E. J. OTOO, AND A. SHOSHANI, Optimizing bitmap indices with efficient

compression, ACM Trans. Database Syst., 31 (2006), pp. 1–38.

[80] J. YANG, K. KARLAPALEM, AND Q. LI, Algorithms for materialized view design in

data warehousing environment, in Proceedings of the 23rd International Conference on

Very Large Data Bases, VLDB ’97, San Francisco, CA, USA, 1997, Morgan Kaufmann


[81] Y. ZHANG AND M. E. ORLOWSKA, On fragmentation approaches for distributed

database design, Information Sciences-Applications, 1 (1994), pp. 117–132.

[82] D. C. ZILIO, J. RAO, S. LIGHTSTONE, G. LOHMAN, A. STORM, C. GARCIA-ARELLANO,

AND S. FADDEN, Db2 design advisor: Integrated automatic physical database design,

in Proceedings of the Thirtieth International Conference on Very Large Data Bases -

Volume 30, VLDB ’04, VLDB Endowment, 2004, pp. 1087–1097.

134

:ملخص

في هذه األطروحة ستعرض تدخل في إطار التصميم الفيزيائي لمستودعات المعطيات. ن األطروحةهذه

مختلف قمنا بتقديمطرق التجزئة األفقية و الفهرسة. ,بالتفصيل .المعطيات مخازنل اإلدارة األلية

األفقية.للتجزئة تصميم نأحس الفهارس و نالختيار أحسالمستعملة تالخوارزميا

بمنهجية األطروحة تحوي ثالثة بحوث أصلية, األول و الثاني يتعلقان بالفهرسة األلية أما الثالث يتعلق

جديدة لتجزئة مخازن المعطيات أفقيا.

المقترحة قعلى إجراء تجارب نظرية و تطبيقية على الطر قمناإلثبات النتائج المحصل عليها, ,في النهاية

و لقد أثبت الطرق المقترحة ,باستعمال نموذج نظري لقياس كلفة البحث مع الطرق الموجودة من قبل

فعليتها من حيث الزمن و الجودة مقارنة بالطرق الموجودة.

, لمستودع المعطيات يالتصميم الفيزيائ, لمستودع المعطيات ذاتية: مستودع المعطيات, اإلدارة المفاتيح

, البرمجة الخطية.ةاألفقي التجزئةات, يالطلب ماثليةت

Résumé : Cette thèse présente un travail sur l’administration et tuning d’entrepôts de données, précisément

aborde le problème de sélection d’indexes binaires de jointure (PSIJB) et le problème de la

fragmentation horizontale dérivé (PFHD). La thèse propose trois contributions à la recherche, la

première contribution utilise les essaims de particule pour résoudre le PSIJB. La deuxième permet de

modéliser le PSIJB en utilisant la programmation linéaire en nombre entier. Dans la troisième

contribution, nous avons proposé une nouvelle méthodologie pour résoudre le PFHD. Toutes les

approches proposées ont été validées par des tests sous un entrepôt de données utilisant des classes de

problèmes. Ces tests ont montré l’efficacité des approches proposées en terme de temps, de qualité

ainsi que d’exactitudes.

Mots clé : Entrepôts de données, auto-administration et tuning, conception physique des entrepôts

de données, optimisation, programmation linéaire, fouille de données, fragmentation

horizontale, sélection d’indexes binaires de jointures.

Abstract:

The present theses presents a work on the administration and tuning of data warehouses, precisely on

the bitmap join indexes selection problem (BJISP) and the reference horizontal partitioning problem

(RHPP). The dissertation proposes three contributions; first, we propose a meta-heuristic approach

for solving the BJISP. The second contribution proposes a mathematical model to solve the BJISP.

The third contribution is a new methodology to solve the RHPP. All contribution are tested on a

benchmark data warehouse using three classes of problems, have showed more accuracy and effective

compared to the well-known existed approaches.

Key words: Data warehouse, auto-administration and tuning, physical design of data warehouse,

Bitmap join index selection, horizontal partitioning, linear programming, optimization, Data mining

Documents

MINISTERE DE L’ENSEIGNEMENT SUPERIEUR ET DE LA … · UNIVERSITE FARHAT ABBAS ... v. TABLE OF CONTENTS 5.4.4.1 Scalability results for the smaller size problem set CSP.. 76 5.4.4.2