90
1 CURARE: CURATION ET GESTION DE GRANDES COLLECTIONS DE DONNÉES SUR LE NUAGE DIRECTRICES PARISA GHODOUS, PR. U. LYON I, LIRIS CATARINA FERREIRA DA SILVA, MCF. U. LYON I, LIRIS GENOVEVA VARGAS-SOLAR, CR. CNRS, LIG-LAFMIA Gavin Robert KEMP, LIRIS, [email protected] CURARE: curating and managing big data collections on the cloud

CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

1

CURARE: CURATION ET GESTION DE GRANDESCOLLECTIONS DE DONNÉES SUR LE NUAGE

DIRECTRICESPARISA GHODOUS, PR. U. LYON I, LIRIS

CATARINA FERREIRA DA SILVA, MCF. U. LYON I, LIRISGENOVEVA VARGAS-SOLAR, CR. CNRS, LIG-LAFMIA

Gavin Robert KEMP, LIRIS, [email protected]

CURARE: curating and managing big data collections on the cloud

Page 2: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

2

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art

– Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms

§ CURARE: Service Oriented Architecture for Curating Data Collections– Approach for Curating Data Collections– Data Collection and View Model– Services for Curating Data Collections

§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 3: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

3

CONTEXT & MOTIVATION

28/09/2018

“Data is everything and everything is data”, PythianTurning reality phenomena into data thanks to the Big Data trend

Page 4: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

4

BIG DATA DEFINITION

28/09/2018

§Data collections with characteristics difficult to process on single machines or traditional databases

§A new generation of tools, methods and technologies to collect, process and analyse massive data collections

à Tools imposing the use of parallel processing and distributed storage

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 5: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

5

BIG DATA PROPERTIES- Volume (size) - Velocity (production rate)- Variety (data types & format)

- Variability (inconsistencies by constant meaning changes)

- Veracity (truth and consistency)

- Value (how much information)

3V

4V

5V

...

10V V’s models [Jagadish 2014]

28/09/2018

“Big Data can really be very small and not all large datasets are big!”- Mike 2.0 [Hillard 2012]

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 6: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

6

BIG DATA LIFE CYCLE

Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]

Data Collections StorageNoSQL, NewSQL (Hive), Data sharding(MongoDB, CouchDB)

Harvesting &Cleaning

Preserving

Processing &Analysis

Exploiting

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 7: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

7

CLOUD COMPUTING AND BIG DATA

Ready to use environments forStoring Big Data& Running greedy processing tasks

Software as a Service (SaaS)Full functional software

Infrastructure as a Service (IaaS)CPU, RAM, Disk

Platform as a Service (PaaS)Database systems, frameworks

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 8: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

8

Different sizes, evolution in structure, completeness, production conditions & content, access policies modification …

What information is available in these collections?

Is the data clean? If not what has to be done to make it clean?

Are there relations between data collections which could be exploited for better prediction?

What are the update rates and in what way does this affect the collection?

THE RIGHT DATA FOR THE RIGHT ANALYTICS

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 9: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

9

DATA CURATION

Data management through its lifecycle of interest & usefulness

§ Enable data discovery & retrieval§ Maintain data quality§ Add value§ Provide for re-use over time

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 10: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

10

Define a data meta-data extraction modelfor supporting

decision making related to its exploitation

PROBLEM STATEMENT

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 11: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

11

RESEARCH QUESTIONS§ Model in an integrated manner structural, semantic & quantitative meta-data?

§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?

§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 12: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

12

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections

§ State of the Art – Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms

§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 13: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

13

DATA LAKE

Centralized repository containing virtually inexhaustible amounts of raw data to be analysed

CRM Sensor logs

Data wrangling [Terrizzano et al. 2015]

Constance [Hai et al. 2016]

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 14: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

14

DATA SWAMP

Repositories grow ever bigger and complex to the point that a lake becomes a swamp

CRM Sensor logs

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 15: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

15

DATA CURATION WORKFLOW

Preserving Describing

Step 2: Extracting meta-data

Grooming

Provisioning

Step 3: Exploitation

Selecting

Vetting

Collecting

Step 1: Harvesting

Data wrangling [Terrizzano et al. 2015]28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 16: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

16

EXTRACTING META-DATA

28/09/2018

Preserving DescribingExtracting meta-data

ExploringHarvesting

Level INo meta-data Partial meta-data Complete meta-data

Level 2 Level 3

Sensing &collectingraw data

Size, frequencyfreshness, type

Quantitative &qualitative data+ semantics

[Stonebraker et al. 2015]

information

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 17: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

17

Approach Principle Pros ConsCuration at Source [Curry et al. 2010]

Simple tagging & informationextraction directly from the sensor

Pre-processes data according to samples

No awareness of the whole data collection

Master Data Management[Weatherspoon et al. 2013]

Provide a standard vocabulary at the company scale

Standardizes the language used in data collections

Does not improve data structures

Semantic linking Constance [Hai et al. 2016]

Identify attributes referring to similar topics in different data sets

Explores data collections

Does not improve data structures

Data set clusteringGoods [Halevy 2016]

Discover similar data sets using clustering

Creates an synthesizedrepresentation of several data collections

Difficult to define a similarity criterion to cluster data with low quality (missing & null values, types)

Crowdsourcing and Collaboration spaces [Doan et al. 2005]

Communities produce, maintain & tag data (crowdsourcing)

Improves the qualityof raw data

Manual & a lot of human resources

APPROACHES FOR EXTRACTING META-DATA

28/09/2018

Level 3: manual and collaborative

Level I: simple tags as data is harvested

Level 2: pivot

vocabulary & semantics

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 18: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

18

Technique Idea Pros ConsBig Data Stacks (Hadoop, Berkeley, Asterix, Spark)

Provide data processingtools using parallel programming modelsSupport query languages& visualisation tools

Versatile, general purpose Low level of adaptability Centred on analysis not on data collections exploration

Service based architectures [CoreKG Beheshti 2018, Lord2004]

Provide independentservices implementing the steps of the data curation workflow

Services enable to adapt the tools to the data collections & good maintainability

Services assembly is up to the programmer

Data curation network [Alfred P. Sloan Foundation 2017]

Provide a Web portal that integrates tools from libraries willing to curate data collections

General view of tools and data collectionsRuns on the WebCollaborative curation

No integration of toolsThe curation process is not explicitNo automatic control on who does what

DATA CURATION PLATFORMS

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 19: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

19

§ Model in an integrated manner structural, semantic & quantitative meta-data?

– No integrated data curation model • to explore data collections in search • for well adapted analytics input

§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?

– No elastic architecture – No target solution for data curation

LIMITATIONS OF THE STATE OF THE ART (1/2)

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 20: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

20

Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

– Not enough quantitative meta-data maintenance to make technical decisions

– Domain dependent curation • no explicit data curation workflow • exploration based on visualization

LIMITATIONS OF THE STATE OF THE ART (2/2)

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 21: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

21

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art

§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections

§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 22: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

22

TAILORING BIG DATA CURATION

Propose a data curation, service based system providing data analysts

– Abstract information about data collections content with several associated releases

– Tools for curating data collections– Strategies for managing metadata related to data collections

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 23: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

23

CONTRIBUTIONS AND RESULTS

28/09/2018

Prototype MongoDB & OpenStack Experiments on Grand Lyon and Twitter urban transport data collections

CURARE: Cloud Service Oriented Architecture for storing & curating data collections

Data collection & View modelData curation formalisation

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 24: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

24

DATA CURATION APPROACH

Extracting as much meta data as possible to§ Make technical decisions§ Choose data exploitation & analysis strategies

DATA COLLECTION MODEL

VIEW MODEL

Structural meta-data

Quantitative meta-data

Extraction

Data Collections

Harvesting

Release & View: curated data collection

Views explorationassist decision making

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 25: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

25

DATA CURATION APPROACH

28/09/2018

DATA COLLECTION MODEL

VIEW MODEL

Structural meta-data

Quantitative meta-data

Extraction

Data Collections

Harvesting

Release & View: curated data collection

Views explorationassist decision making

Data harvesting services

Data processing services

(meta-data extraction)

Storage services Curated data collections

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 26: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

26

DATA CURATION APPROACH

28/09/2018

DATA COLLECTION MODEL

VIEW MODEL

Structural meta-data

Quantitative meta-data

Extraction

Data Collections

Harvesting

Release & View: curated data collection

Views explorationassist decision making

Data harvesting services

Data processing services

(meta-data extraction)

Storage services Curated data collections

CURARE: Cloud Service Oriented Architecture for Curating Data Collections

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 27: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

27

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art

§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections

§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 28: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

28

DATA COLLECTION: STRUCTURAL META DATA

28/09/2018

Data producedat t+1Size:2

Data producedat t+2Size:3

Id::doc2temp:21humidi:61

Id::doc2N°car:5Red:true

Id::doc1Bus:2Loc:univ

Retrieved frommetadata & the

provider

Manual

Extracted

DataCollection

- id: URL- name: String- provider: String- licence: [public, restricted]- size: Num- author: String- description: String

Release- id: URL- release: Num- publicationDate : Date- size: Num

11..n

DataItem

- id : URL- name: String- attributs: List()

1

1..n

releases

items

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 29: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

29

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

JSON Documents

GeographiccoordinatesRoad logicname

Observationdata

Traffic eventdescription

Item in R1

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

Data collection

Release/day

R1

Rn

...

ReleasesTHE GRAND LYON TRAFFIC DATA COLLECTION

28/09/2018

Items of a release

Geographiccoordinates

Road logicname

Observationdata

Traffic eventdescription

Item in R1

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 30: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

30

DATA RELEASE DATA ITEM: TRAFFIC EVENT

28/09/2018

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

Geographiccoordinates

Road logicname

Observationdata

Traffic eventdescription

DataItem

- id : URL- name: String- attributs: List()

JSON documentStructure description

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 31: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

31

DATA COLLECTION RELEASE: TRAFFIC EVENTS IN A PERIOD OF TIME

28/09/2018

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

DataItem

- id : URL- name: String- attributs: List()

Release- id: URL- release: Num- publicationDate: Date- size: Num

1

1..nitems

1items

"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]

Release R1

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 32: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

32

DATA COLLECTION: GRAND LYON TRAFFIC EVENTS

28/09/2018

DataItem

- id : URL- name: String- attributs: List()

Release- id: URL- release: Num- publicationDate: Date- size: Num

11..n

items

DataCollection

- id:URL- name: String- provider: String- licence: [public, restricted]- size: Num- author : String- description: String

1

1..n

releases"_id" : "CollectionInfo","author" : “twitter","description" : "this is a Collection","id" : ”...","licence" : "public","name" : ”tweets","provider" : ”twitter","releases" : [ List(Release)],"size" : 736241

"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]

Release<instance>

DataCollection<instance>

11..nreleases

1..n1

items

DataItem<instance>

R1

Rn

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 33: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

33

VIEW MODELData structure providing an aggregated perspective of a data collection content and its associated releases§ Describes the content of a data collection with series of tuples <attribute, type> § For every attribute the view describes

– basic statistical data associated to the values of every attribute: min, max, mean, median, histograms, standard deviations

– null and missing values representation for every couple <attribute, type> for every attribute in a document

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 34: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

34

VIEW: QUANTITATIVE META DATA

28/09/2018

Extracted

Computed

View- id: URL- name: String - provider: String - author: String - description: String - code: Code - releaseSelectionRules:

Function- Source: String - default: version.id

AttributeDescriptor

- id: URL - name: String - type: String - valueDistribution: Hist- nullValue: num- absentValue: num- minValue, maxValue: type - mean, median, mode: type - std: type - count: num

0..n

1ReleaseView

- id: URL- version: num- publicationDate: date - size: num

0..n

1

attributDescriptors

releaseViews

extension

Data Collection

DataItem

ReleaseData producedat t+1Size:2

Data producedat t+2Size:3

Id::doc2temp:21humidi:61

Id::doc2N°car:5Red:true

Id::doc1Bus:2Loc:univ

extension

extension

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 35: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

35

TRAFFIC EVENTS ATTRIBUTES STATISTICS: ATTRIBUTE DESCRIPTOR

28/09/2018

Release R1

For each attribute across all items in a release Ri

Quantitative meta-data: statistical measures of each attribute

AttributDescriptor{"_id" : "tweets_736241_viewMapReduce_24_17021.entities.hashtags.0.text.number","value" : {

"data": [9,9,9,9,9,9],"median": 9,“mean”: 9"count": 6,"valueDistribution": {"9" : 6},"type": "number","mode": {"value": "9","count": 6},"missing": 4380,"nulls": null }}

at12, v12...

DataItem<instance>

at11, v11...

DataItem<instance>

at1n, v1n...

DataItem<instance>

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 36: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

36

TRAFFIC EVENTS STATISTICS: RELEASE VIEW

28/09/2018

Release Ri

Quantitative meta-data: aggregated measures of attributes statisticsin all its data items

ReleaseView"_id": "tweets_736241_viewMapReduce_24_17021","version": "17021","publicationDate": ISODate("2016-08-08T00:00:00Z"),"size": 765,"attributDescriptors": [List(AttributDescriptor)]

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Aggregate

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 37: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

37

GRAND LYON TRAFFIC EVENT STATISTICS: VIEW

28/09/2018

Data collection

View..."code" : {"model-framework": "MongoMapReduceFinalize",

"mapper": Mapper()"reducer": Reducer()"finalize": Finalizer() },

"rules": "function rule(version){\r\n\treturn version\r\n}"

Quantitative meta-data: aggregated measures of releases statistics

Aggregate0..n1

Releases R1ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 38: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

38

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art

§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections

§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 39: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

39

SERVICES FOR CURATING DATA COLLECTIONS

28/09/2018

Integrated Data Curation Platform

Data Cleaning & Processing Services

PIG HADOOP

Data Harvesting Services

REST FLUME

Data storage Services

MongoDB Neo4J CouchDBTagged Data Collections

Decision Making Support Services

Data Analytics Services

Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira da Silva, P. Ghodous, C. Collet, P.Lopez. Concurrent Engineering 2015, Delft, Netherlands

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 40: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

40

CLOUD DEPLOYMENT ENVIRONMENT

28/09/2018

Image modified from Windows data science virtual machinehttps://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 41: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

41

CURARE DATA

HARVESTINGSERVICE

SaaS

CURARE CLOUD SERVICES

28/09/2018

Views(quantitative &

structural description)

CURARE DATA

STORAGESERVICES

PaaS

Meta-data (dataCollection, release)

CURAREDATA

CLEANINGSERVICE

Decision making services

Data analytics services

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Tagged Data Collections

Integrated Data Curation PlatformCURARE

DATAEXPLORATION

SERVICES

Page 42: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

42

DATA HARVESTING: BATCH

28/09/2018

CURARE DATA

HARVESTINGSERVICE

Data source(on demand producer)

connect()

get()

close()

Buffer & caching service

Service Instance

(continuous)

Method call instance

iteratoriterator

iterator

PaaS

Preserving DescribingExtracting meta-data

ExploringHarvesting

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 43: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

43

Service Instance

(continuous)

DATA HARVESTING: STREAM

28/09/2018

subscribe()

receive()

stop()

Message queue service

push/pull

Method call instanceCURARE

DATAHARVESTING

SERVICE

PaaS

Preserving DescribingExtracting meta-data

ExploringHarvesting

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 44: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

44

CURAREDATA

CLEANINGSERVICE

CURAREDISTRIBUTED

DATASTORAGESERVICES

post()

PaaS

CURARE DATA

HARVESTINGSERVICE

post()

DATA CLEANING & STORAGE

28/09/2018 Preserving DescribingExtracting meta-data

ExploringHarvesting

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 45: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

45

DATA PROCESSING & EXPLORATION

28/09/2018

CURARE DISTRIBUTED

DATASTORAGESERVICES

Raw data

Curated data

Analysed data

CURAREDATA

EXPLORATIONSERVICES

Data analytics services

Decision making services

get(analysedData)

sendInstructions(makecollectionViews)

ViewdataCollections

sendInstructions(analyseData)

1

23

4

Preserving DescribingExtracting meta-data

ExploringHarvesting

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 46: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

46

DATA VARIETY & VARIABILITY

28/09/2018

§ Data Collection and View Models– Data collections track structural meta-data over time through releases– Views track quantitative meta-data over time through release views– Attribute descriptor tracks every existing value– Release Views track the evolution over time of an attribute’s values and their

distribution§ Data Curation Service based Architecture

– Service swapping and adding thanks to SOA for enabling data processing tools for exploring with new data formats & meaning

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 47: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

47

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned

§ Conclusion and Perspectives

28/09/2018

Page 48: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

48

CURRENT PROTOTYPE IMPLEMENTATION

§ Deployed on Openstack1

§ Distributed Data and Access Service: MongoDB– 4 router, 3 config servers, 3 shards with 3 replica sets

§ Scripts and statistical operators– process data collections – generate objects-instantiating our model

§ Machines used– 4 VCPU 2.5 GHz, 8 Go RAM, 80 Go Disk

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

1 https://www.openstack.org

Page 49: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

49

EXPERIMENTS SETTING§ Profile the cost of extracting meta-data

– Computing statistical measures to compute quantitative meta-data– Measure execution time on centralized & parallel execution environments

§ Use case: test the interest of using views for a data sharding decision making process

§ Input data collections– Data set Grand Lyon with 2095 documents, 86 releases, 2.5Mb– Data set Twitter with 736242 documents, 125 releases, 2.5 Gb– Samples ranging from 4000 to 20000 documents

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 50: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

50

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype–Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned

§ Conclusion and Perspectives

28/09/2018

Page 51: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

51

How much does it cost to extract meta-data & compute Views?

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 52: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

52

EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)

28/09/2018

Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Varies with the size of the collection and the number of attributesBy far the biggest time consumer is the attribute collection

contributing for 90 to 95% of the time

– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents

Grand Lyon

Twitter

Page 53: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

53

EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)

28/09/2018

Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Speed gain relative to the size of the releases

– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents

Grand Lyon

Twitter

Page 54: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

54

ROADMAP

§ Context & Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views–Experiment 2: Data Sharding–Comparison and Lessons Learned

§ Conclusion and Perspectives

28/09/2018

Page 55: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

55

SHARDING DATA ACROSS DIFFERENT STORES

28/09/2018

Traffic status{

"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

Balanced and smooth fragmentation(size, location, availability)

Shard 0

chunk chunk

Key range0 ... 20

Optimum distribution across shards providing storage spaces (chunks)

Shard 1

chunk chunk

Key range21 ... 40

Shard 2

chunk chunk

Key range41 ... 60

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 56: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

56

EXPERIMENT 2: SHARDING WITHOUT VIEWSChoose shardingkey candidates

Associate key candidates with

shardingstrategies

Shard collection with strategies

Evaluate data distribution across

fragments

Data scientistIdentify attributes

manually

Choose ShardingStrategy

If poor queriesevaluation time

Explore the data collectionmanually

Extract a sample

Data scientist

Raw data collection

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 57: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

57

EXPERIMENT 2: EVALUATION OF SHARDING KEYS DECISIONS WITHOUT VIEWS

Documents number & data varies § greatly for ranged location§ lesser extent for hashed

location

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 58: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

58

EXPERIMENT 2: DECISION MAKING QUESTIONS§ Which attribute can be used to shard the collection?§ Which is the values distribution of each attribute?§ Is there critical data with particular availability requirements?§ Should some fragments be collocated?

28/09/2018

This information must be extracted, computed

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 59: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

59

EXPERIMENT 2: SHARDING USING VIEWS

28/09/2018

Raw data collection

Getmeta-data

Extract meta-data

Compute meta-data

Create Data Collection Meta-data

Computequantitative measures

Discover domain value

properties (null, absent values)

Create Views

Identify attributes Choose shardingkeys candidates

Associate key candidates with

sharding strategies

Shard collection with strategies

Evaluate data distribution across

fragments

Data collection

Release

Item

View

ReleaseView

AttributeDescriptor

Data scientist

Data provider

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 60: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

60

Identify the best candidate key attributes

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 61: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

61

EXPERIMENT 2: IDENTIFYING ATTRIBUTES VALUES DISTRIBUTION

28/09/2018

nb values missing max count min counttweets.quoted_status.user.location.string 8436 3008392 16820 4tweets.user.location.string 11041 571962 269506 4tweets.user.time_zone.string 149 860539 870263 4tweets.place.bounding_box.coordinates.0.0.0.number 3 0 1983882 19950

tweets.place.bounding_box.coordinates.0.0.1.number 3 0 1968043 35784

tweets.place.bounding_box.coordinates.0.1.0.number 3 0 1983882 19950

tweets.place.bounding_box.coordinates.0.1.1.number 3 0 1969602 34231

tweets.place.bounding_box.coordinates.0.2.0.number 5 0 1898817 19950

tweets.place.bounding_box.coordinates.0.2.1.number 3 0 1969602 34231

tweets.place.bounding_box.coordinates.0.3.0.number 3 0 1898817 19950

tweets.place.bounding_box.coordinates.0.3.1.number 3 0 1968043 25784

tweets.quoted_status.user.geo_enabled.string 2 2938908 155678 95665

tweets.user.geo_enabled.string 1 0 3190337 3190337

ReleaseView

Too many missing values

Good distribution of valuesfew missing values àcandidate sharding key

Lead to unbalanced chunks

Attributes related to location

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 62: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

62

With which sharding strategies (range, hash) can candidate keys cope?

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 63: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

63

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (1/2)

28/09/2018

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Abu D

habi

Amer

ica>L

os_Ang

eles

Asia>J

erus

alem

Auckla

nd

Belgra

de

Brisba

neCST

Centra

l Am

erica

Dhaka

Europ

e>Am

ster

dam

Europ

e>Mad

rid

Geo

rgetow

n

Hong K

ong

Jaka

rta

Kuala Lum

pur

Ljub

ljana

Mex

ico C

ity

Mos

cow

New D

elhi

Paris

Riyadh

Singap

ore

Taipe

iUTC

Wellin

gton

Paris

US & Canada

LjubljanaGreenland

Athens

Amsterdam

Missing

Attribute Descriptor

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 64: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

64

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Abu D

habi

Amer

ica>L

os_Ang

eles

Asia>J

erus

alem

Auckla

nd

Belgra

de

Brisba

neCST

Centra

l Am

erica

Dhaka

Europ

e>Am

ster

dam

Europ

e>Mad

rid

Geo

rgetow

n

Hong K

ong

Jaka

rta

Kuala Lum

pur

Ljub

ljana

Mex

ico C

ity

Mos

cow

New D

elhi

Paris

Riyadh

Singap

ore

Taipe

iUTC

Wellin

gton

28/09/2018

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (2/2)

Paris

US & Canada

LjubljanaGreenland

Athens

Amsterdam

Missing

Attribute Descriptor Shard1 Shard2 Shard3

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 65: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

65

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (1/3)

28/09/2018

0

100000

200000

300000

400000

500000

600000

700000

AM

Arcach

on, F

ranceBarne

t

Bourg Pale

tte

Calvi>Mon

za

Chêne-B

ourg, S

uisse

Dans v

otre e

ntrepri

se

En casa

de @Ana

Nehrhoff

France

> Breizh

Gilbert

, AZ

Hogwart

Jedd

ah - T

aif

La rÃ

©alité

Lisieux

, Norm

andie

Lyon

- Ban

gui

Lyon

>saint

-etien

ne

Metz>>

Nancy

Mâcon-D

ijon

Nord, N

ord-P

as-de

-Cala

is

PVRIS

Paris, F

rance >

Pak

istan

Pom Pom G

alli - Tou

rs

Rillieux

-la-Pape

(69)

Saint-R

apha

ël, Prove

nce-A

lpes-C

ôte d'

Azur

Somewhe

re in

a crow

d

Ta gueu

le

Trémolat, F

rance

Villeuba

nneZion

clerm

ont-fer

rand > sq

uamish he

ll

listen

ing to

Dan

gerous W

oman

p-a-p>

paris

studle

y uk

étusson

Attribute Descriptor

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 66: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

66

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (2/3)

28/09/2018

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

ANZIO(ro

ma)

Arena

s de S

an P

edro

Bay Area

Bow, Engla

nd

Canne

s, Fra

nce

Clichy

, Ile-de

-Fran

ce

Dibër

, “

F;Y;R

; of …

Estadio

Vicente C

alderon

France

, Lille

Greno

ble - L

yon IN

Kings L

andin

g, W

esteros

Le P

ortel

Lond

on;

Lyon

Sou

th!

Manch

ester

>Burnley

Montig

ny-lÃ

¨s-Metz

, Fra

nce

Netherla

nds, L

imbu

rg, V

enra

y

Orage

en Mars

Paris >

Tours

Pembs >

Manch

ester

> M;U

;F;C

Rebord

osa

Saint Q

uenti

n

Sheffiel

d (so

metimes

)

Sur Mars

; (Y' fa

is ch

aud ic

i)

Toulou

se, F

rancia

Versaille

s, Ile

-de-F

ranc

e

Yango

n(Mya

nmar

)→T

okyo

→P

aris

chap

uller P

hysic

ist*,

CMS-CERN gil

l

leo w

on | el

la

on th

e road

;;;

stars

hollow

écully

location

France

Lyon

missingAttribute Descriptor

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 67: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

67

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (3/3)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

A l'Oue

st !

Annonay

, Lyo

n

Banbrid

ge, N

orthern

Irelan

d

Booba>K

endri

ck; U

sain B

olt;

CHAMONIX - G

ENEVA

Chatel -

Frenc

h>Swiss

Alps

Croix r

ousse

Dublin

| May

o

FamÃlia

Bala

deira

∞

Frankfu

rt am M

ain, H

essen

Guadala

jara, Ja

lisco

Ici su

r Twitte

r

Kilkis

Ind; A

rea,

Greece

Le D

oulieu,

Franc

e

Lond

on - K

hoba

r - R

iyadh

Lyon

- Ville

urbann

eMED

Mexim

ieux, F

ranc

e

Médine, R

oyau

me d'Ara

bie S

aoud

ite

Noisy-l

e-Sec,

France

PARIS >

مهØ

¯ÙŠ

Paris | F

R

Pilton,

Englan

d

Renne

s > B

rest,

Fran

ce

Saint-A

madou,

Midi-PyrÃ

©nées

Sevilla,

Esp

aña

Strasb

ourg,

Fran

ce; �Tog

o

Valence

, RhÃ

´ne-A

lpes

Wes

t Ran

ch

avec

une f

leur qu

i est

moche

demi x

@cv

rpen

trz

in ba

iley's

arms

lyon4Ã

¨

pauli

ne - ne

lley -

cam

the la

nd of

tea,

rain

& queu

es

БеÐ

¾Ð³Ñ€Ð°Ð́

> Lon

don >

Cae

rdydd

location

France

Lyon

missing

Attribute Descriptor

28/09/2018

Shard1 Shard2 Shard3

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 68: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

68

ROADMAP

§ Context and Problem statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned

§ Conclusion and Perspectives

28/09/2018

Page 69: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

69

Is decision making for sharding better done with or without views?

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 70: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

70

COMPARISON

28/09/2018

Raw data collection

Getmeta-data

Extract meta-data

Compute meta-data

Create Data Collection Meta-data

Computequantitative

measures

Discover domain value properties (null, absent values)

Create Views

Identify attributes distribution

Choose shardingkeys candidates

Associate key candidates with

sharding strategies

Shard collection with strategies

Evaluate data distribution across

fragments

Data collection

Release

Item

View

ReleaseView

AttributeDescriptor

Data scientist

Data providerChoose shardingkey candidates

Associate key candidates with

sharding strategies

Shard collection with strategies

Evaluate data distribution across

fragments

Data scientistIdentify attributes

manually

Choose ShardingStrategy

If poor queriesevaluation time

Explore the data collectionmanually

Extract a sample

Data scientist

Raw data collection

Without views With views

Iterative process

- Guess sharding keys according to the data scientist knowledge

- Test – error approach based on samples

One shot process

- Full structural & quantitative knowledge of data collection

- Decision based on the whole data collection

Without Views With ViewsQuery attributes No YesExplorable attributes Sample AllDecision support Educated guess Measurable graphs

Evaluation based on- Queries execution / communication time: determine the degree of data

fragmentation and colocation- Balance of shard sizes

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 71: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

71

LESSONS LEARNED

Sharding without Views Sharding with Views

Exploration Several days Few hours

Methods Using scripts to find division in the data

Querying the meta data to find related attributes, analysing the histogram for sharding

Quality Mediocre, distribution remains fairly poor

Almost perfect distribution of document across all the shards

Processing Fairly cheap and quick scripts

Expensive data processing

à Sharding with views is all quicker, more intuitive and effective

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 72: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

72

ROADMAP

§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

Page 73: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

73

CONCLUSION

28/09/2018

§ Model in an integrated manner structural, semantic & quantitative meta-data?

§ Data collection & View models: structural and quantitative meta-data

§ Design a cloud service oriented architecture for enabling data curation consideringvariety and variability?

§ CURARE: Cloud service oriented architecture for storing & curating datacollections

§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

§ Use case: choosing the best criterion for sharding data using Views

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 74: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

74

SHORT & MID TERM PERSPECTIVES

§ Compare releases & data collections– visualisation– query languages enhancing current ad-hoc and imperative proposals

§ Evaluate our data curation approach with users feedback– explore newspapers collections & political campaigns– investigate the urban data from the pole CARA

§ Extend the view model & CURARE– discover semantic metadata and relationships among attributes– discover functional, temporal and causal dependencies

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 75: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

75

LONG TERM PERSPECTIVES

§ Compose big data curation services according to QoS criteria– data properties and target applications

§ Data curation traceability– distributed and collaborative data curation based on block chain

§ Big data collections market– curation guided by cost and business models– QoS criteria like privacy, economic cost, provenance, reputation, trust

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Page 76: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

7628/09/2018

CURARE: CURATING AND MANAGING BIGDATA COLLECTIONS ON THE CLOUD

Gavin Robert KEMP, LIRIS, [email protected]

Page 77: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

77

PUBLICATIONS§ Journals

1. Cloud big data application for transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous, C. Collet ,P. López Amaya; International Journal of Agile Systems and Management, Inderscience, 9 (3), 2016

2. Big data collections and services for building intelligent transport applications; G. Kemp, P. López Amaya, C.Ferreira Da Silva, G. Vargas-Solar, P. Ghodous, C. Collet; International Journal of Electronic BusinessManagement, Vol. 14 No. 1, pp. 1-10, 2016

3. CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; GavinKemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)

§ Conferences4. Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira Da

Silva, P. Ghodous, C. Collet, P. Lopez. Concurrent Engineering, Jul. 2015, Delft, Netherlands5. Aggregating and Managing Big rEaltime Data in the Cloud : Application to intelligent transport for Smart

Cities; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous; Proceedings of the 1st InternationalConference on Vehicle Technology and Intelligent Transport Systems, May 2015, Lisbon, Portugal. pp.107-112

§ Book Chapter6. Service Oriented Big Data Management for Transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P.

Ghodous, C. Collet; Smart Cities, Green Technologies, and Intelligent Transport Systems / series Communicationsin Computer and Information Science, Springer, 579, pp. 267-281, 2016

28/09/2018

Page 78: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

7828/09/2018

Page 79: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

79

PARALLEL DBMS & PROCESSING ENVIRONMENTS FORCURATING DATA

28/09/2018

Preserving DescribingExtracting meta-data

ExploringHarvesting

ETL

Parallel Data Processing Platforms

Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)

Parallel Data Processing Platforms

NoSQL & NewSQL(Parallel)

ParallelData Querying

& Analytics

Structured Data provision

Parallel data collection

(Flink, Stream, Flume)

Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)

Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)

Page 80: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

80

EXPLORING TRANSPORT EVENTS IN LYON USING VIEWS

28/09/2018

View<Events>

View<Bycicles>

ReleasesR1 ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn

0..n1

Views explorationassist decision making

Search datasets about traffic events smaller < 30 Mbytes reporting eventsduring vacations

Which releases about bicycles distribution have less than 20% of missing values

ReleasesR1 ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn

0..n1

Search releases reporting traffic eventsclose to bicycle stations

s %

knn

Page 81: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

81

DATA CURATION ON THE LAKES

28/09/2018

Approach Principle Pros ConsMeta data : Constance [Hai et al 2016], [Stonebraker et al 2013] Data wrangling [Terrizzano et al 2015]

Extraction of semantic meta-data or mapping items with conceptsQuery using SPARQL, regular expression

Possibility to express declarative queries; difficult to automate completely

No statistics, iterative data querying for exploring data

Data content description: descriptive statistics, processing according to data types, schema extraction

Explore data structure for extracting the schema

Compute descriptive statistics functions for every element

Aggregated view of the content despite data types. Simple to visualize and scale if important volume.

Adapted for (semi)structured data, manual tagging for multimedia content

Curation: [CoreKG, Curry 2016, QoSMOS2018, Tacit knowledge management 2017]

API with methods for preserving data collections. Querying and data fusion operations

Exploit Compass tool from MongoDB for providing a quantitative vision of data content

Data transformation from CSV to JSON.No semantic knowledge (terms, functional dependencies)

Page 82: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

8228/09/2018

Page 83: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

83

PROCESSING DATA COLLECTIONS

28/09/2018

Description

Prediction

Clustering (Bayesian, hierarchical, k-means, CLARA, PAM)

Classification(Neural network, PLSDA,

KNN, decision trees)

Trend prediction

Regression(PLSR, PCR)

Association

Modelling (LU, QR, PCA=SVD,

PARAFAC)How many accidents are reported per day?

Percentage of use of available bicycles in downtown?

Which are the traffic bottle-neck regions in the city?

Is the number of car accidents related to seasons?

How will the use of bicycles will evolve in downtown during the summer of the next 5 years?

What type of cars are those that have more accidents in the highspeed roads?

Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?

Page 84: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

84

PROCESSING DATA COLLECTIONS

28/09/2018

Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]

Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon

Description Clustering PCA & k-means

Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data

comes to lie on the first coordinate (first principal component),

- the second greatest variance on the second coordinate, and so on.

Page 85: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

85

PROCESSING DATA COLLECTIONS

28/09/2018

Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]

Contribution: PCA for identifying traffic anomalies and therebydetecting problems in roads

DescriptionModelling

Principal Component Analysis (PCA)

Page 86: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

86

Can be sharded

DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS

C

A

P

C - A A - P

C - P

Data models

- Relational- Key-Value- Column oriented- Document oriented

- Dynamo- Voldemort- Tokyo Cabinet- KAI

- Cassandra- SimpleDB- CouchDB- Riak

- BigTable- HyperTable- Hbase

- MongoDB- TerraStore- Scalaris

- BerkeleyDB- MemcacheDB- Redis

- RDBM’s- MySQL- Postgres- etc

- Aster Data- GreenPlum- Vertica- Neo4j

Availability: each client can

always read & write

Partition tolerance: The system works well despite physical network partitions

Consistency: all clients always have

the same view of de data

28/09/2018

Page 87: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

87

DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS

28/09/2018

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

- Relational- Key-Value- Column oriented Tabular- Document oriented

Raw data collections

- How to transform data collections ?- Which is the best adapted model?

à Polyglot persistence

Approaches dealing with transformation rulesinspired in the relational case

tabular (csv, excel)

Media (XML, JSON, BLOB)Graph

Page 88: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

88

DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS

28/09/2018

Persistence- Which part of the document must persist?- Explicit vs. implicit persistence- In memory / hard disk

Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation- Strategies: range, hash, tagged- Distribution & location

Availability & Fault tolerance- Replication & distribution

September 28, 2018

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}

Memory/Cache

Raw data collections

Page 89: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

89

EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)

28/09/2018

§ Varies with the size of the collection– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents

§ By far the biggest time consumer is the attribute collection contributing for 90 to 95% of the time

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Grand Lyon

Twitter

Page 90: CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

90

§ Varies with the size of the collection– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents

EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)

28/09/2018

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Grand Lyon

Twitter