Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
CURARE: CURATION ET GESTION DE GRANDESCOLLECTIONS DE DONNÉES SUR LE NUAGE
DIRECTRICESPARISA GHODOUS, PR. U. LYON I, LIRIS
CATARINA FERREIRA DA SILVA, MCF. U. LYON I, LIRISGENOVEVA VARGAS-SOLAR, CR. CNRS, LIG-LAFMIA
Gavin Robert KEMP, LIRIS, [email protected]
CURARE: curating and managing big data collections on the cloud
2
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art
– Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms
§ CURARE: Service Oriented Architecture for Curating Data Collections– Approach for Curating Data Collections– Data Collection and View Model– Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
3
CONTEXT & MOTIVATION
28/09/2018
“Data is everything and everything is data”, PythianTurning reality phenomena into data thanks to the Big Data trend
4
BIG DATA DEFINITION
28/09/2018
§Data collections with characteristics difficult to process on single machines or traditional databases
§A new generation of tools, methods and technologies to collect, process and analyse massive data collections
à Tools imposing the use of parallel processing and distributed storage
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
5
BIG DATA PROPERTIES- Volume (size) - Velocity (production rate)- Variety (data types & format)
- Variability (inconsistencies by constant meaning changes)
- Veracity (truth and consistency)
- Value (how much information)
3V
4V
5V
...
10V V’s models [Jagadish 2014]
28/09/2018
“Big Data can really be very small and not all large datasets are big!”- Mike 2.0 [Hillard 2012]
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
6
BIG DATA LIFE CYCLE
Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]
Data Collections StorageNoSQL, NewSQL (Hive), Data sharding(MongoDB, CouchDB)
Harvesting &Cleaning
Preserving
Processing &Analysis
Exploiting
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
7
CLOUD COMPUTING AND BIG DATA
Ready to use environments forStoring Big Data& Running greedy processing tasks
Software as a Service (SaaS)Full functional software
Infrastructure as a Service (IaaS)CPU, RAM, Disk
Platform as a Service (PaaS)Database systems, frameworks
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
8
Different sizes, evolution in structure, completeness, production conditions & content, access policies modification …
What information is available in these collections?
Is the data clean? If not what has to be done to make it clean?
Are there relations between data collections which could be exploited for better prediction?
What are the update rates and in what way does this affect the collection?
THE RIGHT DATA FOR THE RIGHT ANALYTICS
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
9
DATA CURATION
Data management through its lifecycle of interest & usefulness
§ Enable data discovery & retrieval§ Maintain data quality§ Add value§ Provide for re-use over time
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
10
Define a data meta-data extraction modelfor supporting
decision making related to its exploitation
PROBLEM STATEMENT
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
11
RESEARCH QUESTIONS§ Model in an integrated manner structural, semantic & quantitative meta-data?
§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?
§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
12
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections
§ State of the Art – Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms
§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
13
DATA LAKE
Centralized repository containing virtually inexhaustible amounts of raw data to be analysed
CRM Sensor logs
Data wrangling [Terrizzano et al. 2015]
Constance [Hai et al. 2016]
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
14
DATA SWAMP
Repositories grow ever bigger and complex to the point that a lake becomes a swamp
CRM Sensor logs
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
15
DATA CURATION WORKFLOW
Preserving Describing
Step 2: Extracting meta-data
Grooming
Provisioning
Step 3: Exploitation
Selecting
Vetting
Collecting
Step 1: Harvesting
Data wrangling [Terrizzano et al. 2015]28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
16
EXTRACTING META-DATA
28/09/2018
Preserving DescribingExtracting meta-data
ExploringHarvesting
Level INo meta-data Partial meta-data Complete meta-data
Level 2 Level 3
Sensing &collectingraw data
Size, frequencyfreshness, type
Quantitative &qualitative data+ semantics
[Stonebraker et al. 2015]
information
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
17
Approach Principle Pros ConsCuration at Source [Curry et al. 2010]
Simple tagging & informationextraction directly from the sensor
Pre-processes data according to samples
No awareness of the whole data collection
Master Data Management[Weatherspoon et al. 2013]
Provide a standard vocabulary at the company scale
Standardizes the language used in data collections
Does not improve data structures
Semantic linking Constance [Hai et al. 2016]
Identify attributes referring to similar topics in different data sets
Explores data collections
Does not improve data structures
Data set clusteringGoods [Halevy 2016]
Discover similar data sets using clustering
Creates an synthesizedrepresentation of several data collections
Difficult to define a similarity criterion to cluster data with low quality (missing & null values, types)
Crowdsourcing and Collaboration spaces [Doan et al. 2005]
Communities produce, maintain & tag data (crowdsourcing)
Improves the qualityof raw data
Manual & a lot of human resources
APPROACHES FOR EXTRACTING META-DATA
28/09/2018
Level 3: manual and collaborative
Level I: simple tags as data is harvested
Level 2: pivot
vocabulary & semantics
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
18
Technique Idea Pros ConsBig Data Stacks (Hadoop, Berkeley, Asterix, Spark)
Provide data processingtools using parallel programming modelsSupport query languages& visualisation tools
Versatile, general purpose Low level of adaptability Centred on analysis not on data collections exploration
Service based architectures [CoreKG Beheshti 2018, Lord2004]
Provide independentservices implementing the steps of the data curation workflow
Services enable to adapt the tools to the data collections & good maintainability
Services assembly is up to the programmer
Data curation network [Alfred P. Sloan Foundation 2017]
Provide a Web portal that integrates tools from libraries willing to curate data collections
General view of tools and data collectionsRuns on the WebCollaborative curation
No integration of toolsThe curation process is not explicitNo automatic control on who does what
DATA CURATION PLATFORMS
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
19
§ Model in an integrated manner structural, semantic & quantitative meta-data?
– No integrated data curation model • to explore data collections in search • for well adapted analytics input
§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?
– No elastic architecture – No target solution for data curation
LIMITATIONS OF THE STATE OF THE ART (1/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
20
Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
– Not enough quantitative meta-data maintenance to make technical decisions
– Domain dependent curation • no explicit data curation workflow • exploration based on visualization
LIMITATIONS OF THE STATE OF THE ART (2/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
21
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
22
TAILORING BIG DATA CURATION
Propose a data curation, service based system providing data analysts
– Abstract information about data collections content with several associated releases
– Tools for curating data collections– Strategies for managing metadata related to data collections
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
23
CONTRIBUTIONS AND RESULTS
28/09/2018
Prototype MongoDB & OpenStack Experiments on Grand Lyon and Twitter urban transport data collections
CURARE: Cloud Service Oriented Architecture for storing & curating data collections
Data collection & View modelData curation formalisation
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
24
DATA CURATION APPROACH
Extracting as much meta data as possible to§ Make technical decisions§ Choose data exploitation & analysis strategies
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
25
DATA CURATION APPROACH
28/09/2018
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
Data harvesting services
Data processing services
(meta-data extraction)
Storage services Curated data collections
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
26
DATA CURATION APPROACH
28/09/2018
DATA COLLECTION MODEL
VIEW MODEL
Structural meta-data
Quantitative meta-data
Extraction
Data Collections
Harvesting
Release & View: curated data collection
Views explorationassist decision making
Data harvesting services
Data processing services
(meta-data extraction)
Storage services Curated data collections
CURARE: Cloud Service Oriented Architecture for Curating Data Collections
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
27
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
28
DATA COLLECTION: STRUCTURAL META DATA
28/09/2018
Data producedat t+1Size:2
Data producedat t+2Size:3
Id::doc2temp:21humidi:61
Id::doc2N°car:5Red:true
Id::doc1Bus:2Loc:univ
Retrieved frommetadata & the
provider
Manual
Extracted
DataCollection
- id: URL- name: String- provider: String- licence: [public, restricted]- size: Num- author: String- description: String
Release- id: URL- release: Num- publicationDate : Date- size: Num
11..n
DataItem
- id : URL- name: String- attributs: List()
1
1..n
releases
items
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
29
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
JSON Documents
GeographiccoordinatesRoad logicname
Observationdata
Traffic eventdescription
Item in R1
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Data collection
Release/day
R1
Rn
...
ReleasesTHE GRAND LYON TRAFFIC DATA COLLECTION
28/09/2018
Items of a release
Geographiccoordinates
Road logicname
Observationdata
Traffic eventdescription
Item in R1
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
30
DATA RELEASE DATA ITEM: TRAFFIC EVENT
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Geographiccoordinates
Road logicname
Observationdata
Traffic eventdescription
DataItem
- id : URL- name: String- attributs: List()
JSON documentStructure description
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
31
DATA COLLECTION RELEASE: TRAFFIC EVENTS IN A PERIOD OF TIME
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
DataItem
- id : URL- name: String- attributs: List()
Release- id: URL- release: Num- publicationDate: Date- size: Num
1
1..nitems
1items
"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]
Release R1
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
32
DATA COLLECTION: GRAND LYON TRAFFIC EVENTS
28/09/2018
DataItem
- id : URL- name: String- attributs: List()
Release- id: URL- release: Num- publicationDate: Date- size: Num
11..n
items
DataCollection
- id:URL- name: String- provider: String- licence: [public, restricted]- size: Num- author : String- description: String
1
1..n
releases"_id" : "CollectionInfo","author" : “twitter","description" : "this is a Collection","id" : ”...","licence" : "public","name" : ”tweets","provider" : ”twitter","releases" : [ List(Release)],"size" : 736241
"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]
Release<instance>
DataCollection<instance>
11..nreleases
1..n1
items
DataItem<instance>
R1
Rn
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
33
VIEW MODELData structure providing an aggregated perspective of a data collection content and its associated releases§ Describes the content of a data collection with series of tuples <attribute, type> § For every attribute the view describes
– basic statistical data associated to the values of every attribute: min, max, mean, median, histograms, standard deviations
– null and missing values representation for every couple <attribute, type> for every attribute in a document
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
34
VIEW: QUANTITATIVE META DATA
28/09/2018
Extracted
Computed
View- id: URL- name: String - provider: String - author: String - description: String - code: Code - releaseSelectionRules:
Function- Source: String - default: version.id
AttributeDescriptor
- id: URL - name: String - type: String - valueDistribution: Hist- nullValue: num- absentValue: num- minValue, maxValue: type - mean, median, mode: type - std: type - count: num
0..n
1ReleaseView
- id: URL- version: num- publicationDate: date - size: num
0..n
1
attributDescriptors
releaseViews
extension
Data Collection
DataItem
ReleaseData producedat t+1Size:2
Data producedat t+2Size:3
Id::doc2temp:21humidi:61
Id::doc2N°car:5Red:true
Id::doc1Bus:2Loc:univ
extension
extension
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
35
TRAFFIC EVENTS ATTRIBUTES STATISTICS: ATTRIBUTE DESCRIPTOR
28/09/2018
Release R1
For each attribute across all items in a release Ri
Quantitative meta-data: statistical measures of each attribute
AttributDescriptor{"_id" : "tweets_736241_viewMapReduce_24_17021.entities.hashtags.0.text.number","value" : {
"data": [9,9,9,9,9,9],"median": 9,“mean”: 9"count": 6,"valueDistribution": {"9" : 6},"type": "number","mode": {"value": "9","count": 6},"missing": 4380,"nulls": null }}
at12, v12...
DataItem<instance>
at11, v11...
DataItem<instance>
at1n, v1n...
DataItem<instance>
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
36
TRAFFIC EVENTS STATISTICS: RELEASE VIEW
28/09/2018
Release Ri
Quantitative meta-data: aggregated measures of attributes statisticsin all its data items
ReleaseView"_id": "tweets_736241_viewMapReduce_24_17021","version": "17021","publicationDate": ISODate("2016-08-08T00:00:00Z"),"size": 765,"attributDescriptors": [List(AttributDescriptor)]
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Aggregate
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
37
GRAND LYON TRAFFIC EVENT STATISTICS: VIEW
28/09/2018
Data collection
View..."code" : {"model-framework": "MongoMapReduceFinalize",
"mapper": Mapper()"reducer": Reducer()"finalize": Finalizer() },
"rules": "function rule(version){\r\n\treturn version\r\n}"
Quantitative meta-data: aggregated measures of releases statistics
Aggregate0..n1
Releases R1ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
38
ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art
§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections
§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
39
SERVICES FOR CURATING DATA COLLECTIONS
28/09/2018
Integrated Data Curation Platform
Data Cleaning & Processing Services
PIG HADOOP
Data Harvesting Services
REST FLUME
Data storage Services
MongoDB Neo4J CouchDBTagged Data Collections
Decision Making Support Services
Data Analytics Services
Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira da Silva, P. Ghodous, C. Collet, P.Lopez. Concurrent Engineering 2015, Delft, Netherlands
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
40
CLOUD DEPLOYMENT ENVIRONMENT
28/09/2018
Image modified from Windows data science virtual machinehttps://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
41
CURARE DATA
HARVESTINGSERVICE
SaaS
CURARE CLOUD SERVICES
28/09/2018
Views(quantitative &
structural description)
CURARE DATA
STORAGESERVICES
PaaS
Meta-data (dataCollection, release)
CURAREDATA
CLEANINGSERVICE
Decision making services
Data analytics services
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Tagged Data Collections
Integrated Data Curation PlatformCURARE
DATAEXPLORATION
SERVICES
42
DATA HARVESTING: BATCH
28/09/2018
CURARE DATA
HARVESTINGSERVICE
Data source(on demand producer)
connect()
get()
close()
Buffer & caching service
Service Instance
(continuous)
Method call instance
iteratoriterator
iterator
PaaS
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
43
Service Instance
(continuous)
DATA HARVESTING: STREAM
28/09/2018
subscribe()
receive()
stop()
Message queue service
push/pull
Method call instanceCURARE
DATAHARVESTING
SERVICE
PaaS
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
44
CURAREDATA
CLEANINGSERVICE
CURAREDISTRIBUTED
DATASTORAGESERVICES
post()
PaaS
CURARE DATA
HARVESTINGSERVICE
post()
DATA CLEANING & STORAGE
28/09/2018 Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
45
DATA PROCESSING & EXPLORATION
28/09/2018
CURARE DISTRIBUTED
DATASTORAGESERVICES
Raw data
Curated data
Analysed data
CURAREDATA
EXPLORATIONSERVICES
Data analytics services
Decision making services
get(analysedData)
sendInstructions(makecollectionViews)
ViewdataCollections
sendInstructions(analyseData)
1
23
4
Preserving DescribingExtracting meta-data
ExploringHarvesting
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
46
DATA VARIETY & VARIABILITY
28/09/2018
§ Data Collection and View Models– Data collections track structural meta-data over time through releases– Views track quantitative meta-data over time through release views– Attribute descriptor tracks every existing value– Release Views track the evolution over time of an attribute’s values and their
distribution§ Data Curation Service based Architecture
– Service swapping and adding thanks to SOA for enabling data processing tools for exploring with new data formats & meaning
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
47
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
48
CURRENT PROTOTYPE IMPLEMENTATION
§ Deployed on Openstack1
§ Distributed Data and Access Service: MongoDB– 4 router, 3 config servers, 3 shards with 3 replica sets
§ Scripts and statistical operators– process data collections – generate objects-instantiating our model
§ Machines used– 4 VCPU 2.5 GHz, 8 Go RAM, 80 Go Disk
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
1 https://www.openstack.org
49
EXPERIMENTS SETTING§ Profile the cost of extracting meta-data
– Computing statistical measures to compute quantitative meta-data– Measure execution time on centralized & parallel execution environments
§ Use case: test the interest of using views for a data sharding decision making process
§ Input data collections– Data set Grand Lyon with 2095 documents, 86 releases, 2.5Mb– Data set Twitter with 736242 documents, 125 releases, 2.5 Gb– Samples ranging from 4000 to 20000 documents
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
50
ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype–Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
51
How much does it cost to extract meta-data & compute Views?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
52
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)
28/09/2018
Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Varies with the size of the collection and the number of attributesBy far the biggest time consumer is the attribute collection
contributing for 90 to 95% of the time
– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents
Grand Lyon
53
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)
28/09/2018
Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Speed gain relative to the size of the releases
– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents
Grand Lyon
54
ROADMAP
§ Context & Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views–Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
55
SHARDING DATA ACROSS DIFFERENT STORES
28/09/2018
Traffic status{
"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Balanced and smooth fragmentation(size, location, availability)
Shard 0
chunk chunk
Key range0 ... 20
Optimum distribution across shards providing storage spaces (chunks)
Shard 1
chunk chunk
Key range21 ... 40
Shard 2
chunk chunk
Key range41 ... 60
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
56
EXPERIMENT 2: SHARDING WITHOUT VIEWSChoose shardingkey candidates
Associate key candidates with
shardingstrategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data scientistIdentify attributes
manually
Choose ShardingStrategy
If poor queriesevaluation time
Explore the data collectionmanually
Extract a sample
Data scientist
Raw data collection
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
57
EXPERIMENT 2: EVALUATION OF SHARDING KEYS DECISIONS WITHOUT VIEWS
Documents number & data varies § greatly for ranged location§ lesser extent for hashed
location
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
58
EXPERIMENT 2: DECISION MAKING QUESTIONS§ Which attribute can be used to shard the collection?§ Which is the values distribution of each attribute?§ Is there critical data with particular availability requirements?§ Should some fragments be collocated?
28/09/2018
This information must be extracted, computed
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
59
EXPERIMENT 2: SHARDING USING VIEWS
28/09/2018
Raw data collection
Getmeta-data
Extract meta-data
Compute meta-data
Create Data Collection Meta-data
Computequantitative measures
Discover domain value
properties (null, absent values)
Create Views
Identify attributes Choose shardingkeys candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data collection
Release
Item
View
ReleaseView
AttributeDescriptor
Data scientist
Data provider
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
60
Identify the best candidate key attributes
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
61
EXPERIMENT 2: IDENTIFYING ATTRIBUTES VALUES DISTRIBUTION
28/09/2018
nb values missing max count min counttweets.quoted_status.user.location.string 8436 3008392 16820 4tweets.user.location.string 11041 571962 269506 4tweets.user.time_zone.string 149 860539 870263 4tweets.place.bounding_box.coordinates.0.0.0.number 3 0 1983882 19950
tweets.place.bounding_box.coordinates.0.0.1.number 3 0 1968043 35784
tweets.place.bounding_box.coordinates.0.1.0.number 3 0 1983882 19950
tweets.place.bounding_box.coordinates.0.1.1.number 3 0 1969602 34231
tweets.place.bounding_box.coordinates.0.2.0.number 5 0 1898817 19950
tweets.place.bounding_box.coordinates.0.2.1.number 3 0 1969602 34231
tweets.place.bounding_box.coordinates.0.3.0.number 3 0 1898817 19950
tweets.place.bounding_box.coordinates.0.3.1.number 3 0 1968043 25784
tweets.quoted_status.user.geo_enabled.string 2 2938908 155678 95665
tweets.user.geo_enabled.string 1 0 3190337 3190337
ReleaseView
Too many missing values
Good distribution of valuesfew missing values àcandidate sharding key
Lead to unbalanced chunks
Attributes related to location
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
62
With which sharding strategies (range, hash) can candidate keys cope?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
63
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (1/2)
28/09/2018
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Abu D
habi
Amer
ica>L
os_Ang
eles
Asia>J
erus
alem
Auckla
nd
Belgra
de
Brisba
neCST
Centra
l Am
erica
Dhaka
Europ
e>Am
ster
dam
Europ
e>Mad
rid
Geo
rgetow
n
Hong K
ong
Jaka
rta
Kuala Lum
pur
Ljub
ljana
Mex
ico C
ity
Mos
cow
New D
elhi
Paris
Riyadh
Singap
ore
Taipe
iUTC
Wellin
gton
Paris
US & Canada
LjubljanaGreenland
Athens
Amsterdam
Missing
Attribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
64
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
Abu D
habi
Amer
ica>L
os_Ang
eles
Asia>J
erus
alem
Auckla
nd
Belgra
de
Brisba
neCST
Centra
l Am
erica
Dhaka
Europ
e>Am
ster
dam
Europ
e>Mad
rid
Geo
rgetow
n
Hong K
ong
Jaka
rta
Kuala Lum
pur
Ljub
ljana
Mex
ico C
ity
Mos
cow
New D
elhi
Paris
Riyadh
Singap
ore
Taipe
iUTC
Wellin
gton
28/09/2018
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (2/2)
Paris
US & Canada
LjubljanaGreenland
Athens
Amsterdam
Missing
Attribute Descriptor Shard1 Shard2 Shard3
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
65
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (1/3)
28/09/2018
0
100000
200000
300000
400000
500000
600000
700000
AM
Arcach
on, F
ranceBarne
t
Bourg Pale
tte
Calvi>Mon
za
Chêne-B
ourg, S
uisse
Dans v
otre e
ntrepri
se
En casa
de @Ana
Nehrhoff
France
> Breizh
Gilbert
, AZ
Hogwart
Jedd
ah - T
aif
La rÃ
©alité
Lisieux
, Norm
andie
Lyon
- Ban
gui
Lyon
>saint
-etien
ne
Metz>>
Nancy
Mâcon-D
ijon
Nord, N
ord-P
as-de
-Cala
is
PVRIS
Paris, F
rance >
Pak
istan
Pom Pom G
alli - Tou
rs
Rillieux
-la-Pape
(69)
Saint-R
apha
ël, Prove
nce-A
lpes-C
ôte d'
Azur
Somewhe
re in
a crow
d
Ta gueu
le
Trémolat, F
rance
Villeuba
nneZion
clerm
ont-fer
rand > sq
uamish he
ll
listen
ing to
Dan
gerous W
oman
p-a-p>
paris
studle
y uk
étusson
Attribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
66
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (2/3)
28/09/2018
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
ANZIO(ro
ma)
Arena
s de S
an P
edro
Bay Area
Bow, Engla
nd
Canne
s, Fra
nce
Clichy
, Ile-de
-Fran
ce
Dibër
, “
F;Y;R
; of …
Estadio
Vicente C
alderon
France
, Lille
Greno
ble - L
yon IN
Kings L
andin
g, W
esteros
Le P
ortel
Lond
on;
Lyon
Sou
th!
Manch
ester
>Burnley
Montig
ny-lÃ
¨s-Metz
, Fra
nce
Netherla
nds, L
imbu
rg, V
enra
y
Orage
en Mars
Paris >
Tours
Pembs >
Manch
ester
> M;U
;F;C
Rebord
osa
Saint Q
uenti
n
Sheffiel
d (so
metimes
)
Sur Mars
; (Y' fa
is ch
aud ic
i)
Toulou
se, F
rancia
Versaille
s, Ile
-de-F
ranc
e
Yango
n(Mya
nmar
)→T
okyo
→P
aris
chap
uller P
hysic
ist*,
CMS-CERN gil
l
leo w
on | el
la
on th
e road
;;;
stars
hollow
écully
location
France
Lyon
missingAttribute Descriptor
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
67
EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (3/3)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
A l'Oue
st !
Annonay
, Lyo
n
Banbrid
ge, N
orthern
Irelan
d
Booba>K
endri
ck; U
sain B
olt;
CHAMONIX - G
ENEVA
Chatel -
Frenc
h>Swiss
Alps
Croix r
ousse
Dublin
| May
o
FamÃlia
Bala
deira
∞
Frankfu
rt am M
ain, H
essen
Guadala
jara, Ja
lisco
Ici su
r Twitte
r
Kilkis
Ind; A
rea,
Greece
Le D
oulieu,
Franc
e
Lond
on - K
hoba
r - R
iyadh
Lyon
- Ville
urbann
eMED
Mexim
ieux, F
ranc
e
Médine, R
oyau
me d'Ara
bie S
aoud
ite
Noisy-l
e-Sec,
France
PARIS >
مهØ
¯ÙŠ
Paris | F
R
Pilton,
Englan
d
Renne
s > B
rest,
Fran
ce
Saint-A
madou,
Midi-PyrÃ
©nées
Sevilla,
Esp
aña
Strasb
ourg,
Fran
ce; �Tog
o
Valence
, RhÃ
´ne-A
lpes
Wes
t Ran
ch
avec
une f
leur qu
i est
moche
demi x
@cv
rpen
trz
in ba
iley's
arms
lyon4Ã
¨
pauli
ne - ne
lley -
cam
the la
nd of
tea,
rain
& queu
es
БеÐ
¾Ð³Ñ€Ð°Ð́
> Lon
don >
Cae
rdydd
location
France
Lyon
missing
Attribute Descriptor
28/09/2018
Shard1 Shard2 Shard3
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
68
ROADMAP
§ Context and Problem statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections
§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned
§ Conclusion and Perspectives
28/09/2018
69
Is decision making for sharding better done with or without views?
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
70
COMPARISON
28/09/2018
Raw data collection
Getmeta-data
Extract meta-data
Compute meta-data
Create Data Collection Meta-data
Computequantitative
measures
Discover domain value properties (null, absent values)
Create Views
Identify attributes distribution
Choose shardingkeys candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data collection
Release
Item
View
ReleaseView
AttributeDescriptor
Data scientist
Data providerChoose shardingkey candidates
Associate key candidates with
sharding strategies
Shard collection with strategies
Evaluate data distribution across
fragments
Data scientistIdentify attributes
manually
Choose ShardingStrategy
If poor queriesevaluation time
Explore the data collectionmanually
Extract a sample
Data scientist
Raw data collection
Without views With views
Iterative process
- Guess sharding keys according to the data scientist knowledge
- Test – error approach based on samples
One shot process
- Full structural & quantitative knowledge of data collection
- Decision based on the whole data collection
Without Views With ViewsQuery attributes No YesExplorable attributes Sample AllDecision support Educated guess Measurable graphs
Evaluation based on- Queries execution / communication time: determine the degree of data
fragmentation and colocation- Balance of shard sizes
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
71
LESSONS LEARNED
Sharding without Views Sharding with Views
Exploration Several days Few hours
Methods Using scripts to find division in the data
Querying the meta data to find related attributes, analysing the histogram for sharding
Quality Mediocre, distribution remains fairly poor
Almost perfect distribution of document across all the shards
Processing Fairly cheap and quick scripts
Expensive data processing
à Sharding with views is all quicker, more intuitive and effective
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
72
ROADMAP
§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives
28/09/2018
73
CONCLUSION
28/09/2018
§ Model in an integrated manner structural, semantic & quantitative meta-data?
§ Data collection & View models: structural and quantitative meta-data
§ Design a cloud service oriented architecture for enabling data curation consideringvariety and variability?
§ CURARE: Cloud service oriented architecture for storing & curating datacollections
§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?
§ Use case: choosing the best criterion for sharding data using Views
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
74
SHORT & MID TERM PERSPECTIVES
§ Compare releases & data collections– visualisation– query languages enhancing current ad-hoc and imperative proposals
§ Evaluate our data curation approach with users feedback– explore newspapers collections & political campaigns– investigate the urban data from the pole CARA
§ Extend the view model & CURARE– discover semantic metadata and relationships among attributes– discover functional, temporal and causal dependencies
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
75
LONG TERM PERSPECTIVES
§ Compose big data curation services according to QoS criteria– data properties and target applications
§ Data curation traceability– distributed and collaborative data curation based on block chain
§ Big data collections market– curation guided by cost and business models– QoS criteria like privacy, economic cost, provenance, reputation, trust
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
7628/09/2018
CURARE: CURATING AND MANAGING BIGDATA COLLECTIONS ON THE CLOUD
Gavin Robert KEMP, LIRIS, [email protected]
77
PUBLICATIONS§ Journals
1. Cloud big data application for transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous, C. Collet ,P. López Amaya; International Journal of Agile Systems and Management, Inderscience, 9 (3), 2016
2. Big data collections and services for building intelligent transport applications; G. Kemp, P. López Amaya, C.Ferreira Da Silva, G. Vargas-Solar, P. Ghodous, C. Collet; International Journal of Electronic BusinessManagement, Vol. 14 No. 1, pp. 1-10, 2016
3. CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; GavinKemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)
§ Conferences4. Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira Da
Silva, P. Ghodous, C. Collet, P. Lopez. Concurrent Engineering, Jul. 2015, Delft, Netherlands5. Aggregating and Managing Big rEaltime Data in the Cloud : Application to intelligent transport for Smart
Cities; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous; Proceedings of the 1st InternationalConference on Vehicle Technology and Intelligent Transport Systems, May 2015, Lisbon, Portugal. pp.107-112
§ Book Chapter6. Service Oriented Big Data Management for Transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P.
Ghodous, C. Collet; Smart Cities, Green Technologies, and Intelligent Transport Systems / series Communicationsin Computer and Information Science, Springer, 579, pp. 267-281, 2016
28/09/2018
7828/09/2018
79
PARALLEL DBMS & PROCESSING ENVIRONMENTS FORCURATING DATA
28/09/2018
Preserving DescribingExtracting meta-data
ExploringHarvesting
ETL
Parallel Data Processing Platforms
Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)
Parallel Data Processing Platforms
NoSQL & NewSQL(Parallel)
ParallelData Querying
& Analytics
Structured Data provision
Parallel data collection
(Flink, Stream, Flume)
Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)
Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)
80
EXPLORING TRANSPORT EVENTS IN LYON USING VIEWS
28/09/2018
View<Events>
View<Bycicles>
ReleasesR1 ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
0..n1
Views explorationassist decision making
Search datasets about traffic events smaller < 30 Mbytes reporting eventsduring vacations
Which releases about bicycles distribution have less than 20% of missing values
ReleasesR1 ReleaseView
AttributDescriptor
Statistiques an
Statistiques a1
0..n
1
attributDescriptors
Rn
0..n1
Search releases reporting traffic eventsclose to bicycle stations
s %
knn
81
DATA CURATION ON THE LAKES
28/09/2018
Approach Principle Pros ConsMeta data : Constance [Hai et al 2016], [Stonebraker et al 2013] Data wrangling [Terrizzano et al 2015]
Extraction of semantic meta-data or mapping items with conceptsQuery using SPARQL, regular expression
Possibility to express declarative queries; difficult to automate completely
No statistics, iterative data querying for exploring data
Data content description: descriptive statistics, processing according to data types, schema extraction
Explore data structure for extracting the schema
Compute descriptive statistics functions for every element
Aggregated view of the content despite data types. Simple to visualize and scale if important volume.
Adapted for (semi)structured data, manual tagging for multimedia content
Curation: [CoreKG, Curry 2016, QoSMOS2018, Tacit knowledge management 2017]
API with methods for preserving data collections. Querying and data fusion operations
Exploit Compass tool from MongoDB for providing a quantitative vision of data content
Data transformation from CSV to JSON.No semantic knowledge (terms, functional dependencies)
8228/09/2018
83
PROCESSING DATA COLLECTIONS
28/09/2018
Description
Prediction
Clustering (Bayesian, hierarchical, k-means, CLARA, PAM)
Classification(Neural network, PLSDA,
KNN, decision trees)
Trend prediction
Regression(PLSR, PCR)
Association
Modelling (LU, QR, PCA=SVD,
PARAFAC)How many accidents are reported per day?
Percentage of use of available bicycles in downtown?
Which are the traffic bottle-neck regions in the city?
Is the number of car accidents related to seasons?
How will the use of bicycles will evolve in downtown during the summer of the next 5 years?
What type of cars are those that have more accidents in the highspeed roads?
Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?
84
PROCESSING DATA COLLECTIONS
28/09/2018
Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]
Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon
Description Clustering PCA & k-means
Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data
comes to lie on the first coordinate (first principal component),
- the second greatest variance on the second coordinate, and so on.
85
PROCESSING DATA COLLECTIONS
28/09/2018
Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]
Contribution: PCA for identifying traffic anomalies and therebydetecting problems in roads
DescriptionModelling
Principal Component Analysis (PCA)
86
Can be sharded
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
C
A
P
C - A A - P
C - P
Data models
- Relational- Key-Value- Column oriented- Document oriented
- Dynamo- Voldemort- Tokyo Cabinet- KAI
- Cassandra- SimpleDB- CouchDB- Riak
- BigTable- HyperTable- Hbase
- MongoDB- TerraStore- Scalaris
- BerkeleyDB- MemcacheDB- Redis
- RDBM’s- MySQL- Postgres- etc
- Aster Data- GreenPlum- Vertica- Neo4j
Availability: each client can
always read & write
Partition tolerance: The system works well despite physical network partitions
Consistency: all clients always have
the same view of de data
28/09/2018
87
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
28/09/2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
- Relational- Key-Value- Column oriented Tabular- Document oriented
Raw data collections
- How to transform data collections ?- Which is the best adapted model?
à Polyglot persistence
Approaches dealing with transformation rulesinspired in the relational case
tabular (csv, excel)
Media (XML, JSON, BLOB)Graph
88
DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS
28/09/2018
Persistence- Which part of the document must persist?- Explicit vs. implicit persistence- In memory / hard disk
Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation- Strategies: range, hash, tagged- Distribution & location
Availability & Fault tolerance- Replication & distribution
September 28, 2018
{"geometry": {"type": "Point", "coordinates": [
4.821773, 45.7513
]}, "full_location": "Autoroute du Soleil,
69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07
19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",
"firstsupplierversiontime": "2016-06-07 19:40:00",
"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07
19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de
Lyon", "publiccomment": "Bouchon, km 455|Voie
Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}
}
Memory/Cache
Raw data collections
89
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)
28/09/2018
§ Varies with the size of the collection– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents
§ By far the biggest time consumer is the attribute collection contributing for 90 to 95% of the time
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Grand Lyon
90
§ Varies with the size of the collection– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents
EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (2/2)
28/09/2018
Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives
Grand Lyon