9
Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University Paul Verlaine - Metz - Techniques d’optimisation et de recherche opérationnelle en fouille de données évolutives et temporelles

Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

Embed Size (px)

Citation preview

Page 1: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

Ph. D Student: TA Minh Thuy: USTH 2010

Director of thesis: Prof. LE Thi Hoai AnCo-director: Dr. Lydia Boujeloud – Assala

LITA, EA3097 - UFR MIM University Paul Verlaine - Metz - France

Techniques d’optimisation et de recherche opérationnelle en

fouille de données évolutives et temporelles

Page 2: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

About meObjective:

Development new models Development new optimization methods

Problems: unsupervised classification and selection of variables for data mining evolution and temporal (data stream).

Start date: 1 Dec 2010Team work: Algorithms and OptimizationCategory: Information Technology. Fields of research: Data Mining, Data Stream,

Clustering, Classification, Feature Selection

2

Page 3: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

ContextFor many recent applications, the concept of a data

stream is more appropriate than a data set. The volume of such data is so large that it may be

impossible to store the data on disk. Furthermore, even when the data can be stored, the volume of the incoming data may be so large that it may be impossible to process any particular record more than once.

The fact that the data in the streams show the temporal correlations. Such temporal correlations can help detect the important data evolution characteristics, and can used to develop efficient mining algorithms.

3

Page 4: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

ContextThe stream model is motivated by emerging

applications involving massive data sets; Examples: telephone records, customer click

streams, multimedia data, financial transactions,...

In these cases, the data have a evolving continuously.Examples, the dynamism of the services: content,

structure, promotions,... or the change of user’s behavior, client’s interest,...or depend on time: time of the day, day of the week,...or depend on the events: summer vacations, new year,...

Therefore, the data stream poses some special challenges of data mining algorithms. It its necessary to design the mining algorithms effectively in order to account for changes in underlying structure of the data stream.

4

Page 5: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

Problems:Problem 1: Clustering data stream.The existing methods of mining data streams

focus on the whole period of data. Consequently : only detected those predominant

in the entire period of analysis. The behaviors occurring in short periods of time are not detected.

Model for clustering data stream problem: fix windows

Dividing the analyzed time period into more significant sub periods, with the aim of detect the evolution of old patterns or the emergence of the new ones, which would not have been revealed by a global analysis over the whole time period.

5

Page 6: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

Problems:Problem 2: Detecting changes in data

streams.In data stream, the data patterns may evolve over

time. How about the change of data over time? - Disappears in a cluster of behavior

- Appearance in a cluster of behavior - Splitting a cluster of behavior - Combine two or more clusters of behavior- No change

Model for detection change data stream problem: sliding windows

6

Page 7: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

ProblemsProblem 3: Feature selection based

clustering. An object can be presented by variables of

different types (quantitative, qualitative or structured). The nature of the variables is bound to influence the definition of similarity between objects and the choice is very important.

The question is to choose among those relevant variables and eliminating those that are redundant.

Applications include: medical diagnosis (cancer risk assessment,

detection of cardiac arrhythmia,…)text categorization (classification of email -

spam or not, classification of web pages,…) pattern recognition (face recognition,

handwritten digit,...)….

7

Page 8: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

MethodologyUsing mathematic techniques to process the data

mining problem, including optimization techniques. A lot of optimization problems in real-world is non convex.

To solve the optimization problem non convex, we study mathematical techniques DC programming and DCA (Difference convex algorithm).

DC Programming and DCA (DC Algorithms) introduced in 1985 by Pham Dinh Tao and developed by Le Thi Hoai An and Pham Dinh Tao since 1994 to become a classic and now increasingly popular.

8

Page 9: Ph. D Student: TA Minh Thuy: USTH 2010 Director of thesis: Prof. LE Thi Hoai An Co-director: Dr. Lydia Boujeloud – Assala LITA, EA3097 - UFR MIM University

Results:• TA Minh Thuy, LE-THI Hoai An, Lydia

Boudjeloud-Assala: Clustering Data Stream Based on Sub-Windows: A DC Programming Approach – 15th Austrian - French - German conference on Optimization, International conference AFG11 - Toulouse, France, 19-23 Septembre 2011, pp 135-136

9