33
1 DR / CNRS - UPSaclay LAL & LRI BALÁZS KÉGL MdC / Telecom ParisTech LTCI ALEXANDRE GRAMFORT MdC / Mines ParisTech CGS AKIN KAZAKÇI RAPID ANALYTICS AND MODEL PROTOTYPING Center for Data Science Paris-Saclay Postdoc / UPMC LIP6 DJALEL BENBOUZID

DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

1

DR / CNRS - UPSaclay LAL & LRI

BALÁZS KÉGL

MdC / Telecom ParisTech LTCI

ALEXANDRE GRAMFORT

MdC / Mines ParisTech CGS

AKIN KAZAKÇI

RAPID ANALYTICS AND MODEL PROTOTYPING

Center for Data ScienceParis-Saclay

Postdoc / UPMC LIP6

DJALEL BENBOUZID

Page 2: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• Paris-Saclay Center for Data Science

• the data science ecosystem

• Analytics tools

• data challenges

• rapid analytics and model prototyping

2

OUTLINE

Page 3: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

DATA SCIENCE

3

Design of automated methods

to analyze massive and complex data

to extract useful information

Page 4: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay4

DATA SCIENCE=

BIG DATA

We are focusing on inference:

data knowledge

Interfacing with infrastructure, security, production

Page 5: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

5

UNIVERSITÉ PARIS-SACLAY

19 founding partners

Page 6: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

UNIVERSITÉ PARIS-SACLAY

6

+ horizontal multi-disciplinary and multi-partner initiatives (“lidex”) to create cohesion

Page 7: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay7

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

Page 8: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay8

THE DATA SCIENCE ECOSYSTEM

Data domainsenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

Data scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

Page 9: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

TOOLS

9

We are designing and learning to manage

tools

to accompany data science projects

with different needs

Page 10: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

TOOLS: LANDSCAPE TO ECOSYSTEM

10

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

• bootcamps / hackathons • IT platform for linked data • annotation tool • SaaS data science platform

Page 11: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

TWO ANALYTICS TOOLS

11

RAPID ANALYTICS AND MODEL PROTOTYPING

DATA CHALLENGES

Page 12: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• A data challenge is a recently developed unconventional dissemination and communication tool

• a scientific or industrial data producer arrives with a well-defined problem and a corresponding annotated data set

• defines a quantitative goal

• makes the problem and part of the data set (the training set) public on a dedicated site

• data science experts then take the public training data and submit solutions (predictions) for a test set with hidden annotations

• submissions are evaluated numerically using the quantitative measure

• contestants are listed on a leaderboard

• after a predefined time, typically a couple of months, the final results are revealed and the winners are awarded

12

DATA CHALLENGES

Page 13: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

13

DATA CHALLENGES

Page 14: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• Official ATLAS GEANT4 simulations

• 30 features (variables)

• 250K training: input, label, weight

• 100K public test (AMS displayed real-time), only input

• 450K private test (to determine the winner after the closing of the challenge), only input

• public and private tests are shuffled, participants submit a vector of 550K labels

14

HUGE PUBLICITY B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

14

Page 15: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

SIGNIFICANT IMPROVEMENT OVER THE BASELINE

15

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

15

Page 16: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

16

HUGE PUBLICITY

SIGNIFICANT IMPROVEMENT OVER THE BASELINE

yet partially missing the objectives

Page 17: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• emphasizes competition

17

DATA CHALLENGES

Page 18: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

18

We decided to design something better

Page 19: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay19

• Single-day coding sessions

• 20-30 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 20: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

20

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 21: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay21

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 22: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

22

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 23: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

23

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 24: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

24

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND INNOVATION

Page 25: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

25

ANALYTICS TOOLS TO MONITOR PROGRESS

Page 26: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• Algorithm selection and hyperparameter optimization

• studying human problem-solving

• combining human solutions with automatic tools

• comparing and tuning hyperparameter optimizers

• meta-learning: embedding data sets and models, collaborative optimization

26

RESEARCH (BEYOND SOLVING PROBLEMS)

Page 27: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

27

2015 Jan 15 replaying the

HiggsML challenge

Page 28: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

2015 Feb 9 Mortality prediction in septic patients

RAPID ANALYTICS AND MODEL PROTOTYPING

28

Page 29: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

29

2015 Apr 10 Classifying variable stars

Page 30: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

30

2015 May Drug identification from spectra

Page 31: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

31

2015 June Insect classification

Page 32: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay

• Short-term, ad-hoc teams assembled for a given task

• Low-engagement consulting job

• Efficient use of scarce data scientists (i.e., your time)

• Developing and practicing marketable skills

• fast-feedback experimentation is also useful in research

• Networking

• Management meta-tools to track your performance, to guide your training

32

IMPLEMENTING RAMPS IN AN INDUSTRIAL CONTEXT

Page 33: DJALEL ENBOUZID ALEXANDRE GRAMFORTLM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA

Center for Data ScienceParis-Saclay33

THANK YOU!