CURARE CURATIONETGESTIONDEGRANDES …vargas-solar.com/wp-content/uploads/2018/09/soutenance... · 2018-09-28 · 5 BIG DATA PROPERTIES-Volume(size) -Velocity(production rate)-Variety(data

1

CURARE: CURATION ET GESTION DE GRANDESCOLLECTIONS DE DONNÉES SUR LE NUAGE

DIRECTRICESPARISA GHODOUS, PR. U. LYON I, LIRIS

CATARINA FERREIRA DA SILVA, MCF. U. LYON I, LIRISGENOVEVA VARGAS-SOLAR, CR. CNRS, LIG-LAFMIA

Gavin Robert KEMP, LIRIS, [email protected]

CURARE: curating and managing big data collections on the cloud

2

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art

– Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms

§ CURARE: Service Oriented Architecture for Curating Data Collections– Approach for Curating Data Collections– Data Collection and View Model– Services for Curating Data Collections

§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

3

CONTEXT & MOTIVATION

28/09/2018

“Data is everything and everything is data”, PythianTurning reality phenomena into data thanks to the Big Data trend

4

BIG DATA DEFINITION

28/09/2018

§Data collections with characteristics difficult to process on single machines or traditional databases

§A new generation of tools, methods and technologies to collect, process and analyse massive data collections

à Tools imposing the use of parallel processing and distributed storage

Context & Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

5

BIG DATA PROPERTIES- Volume (size) - Velocity (production rate)- Variety (data types & format)

- Variability (inconsistencies by constant meaning changes)

- Veracity (truth and consistency)

- Value (how much information)

3V

4V

5V

...

10V V’s models [Jagadish 2014]

28/09/2018

“Big Data can really be very small and not all large datasets are big!”- Mike 2.0 [Hillard 2012]


6

BIG DATA LIFE CYCLE

Data Processing Data analytics [Feldman et al. 2013], Data Aggregation [Jayram et al. 2007]

Data Collections StorageNoSQL, NewSQL (Hive), Data sharding(MongoDB, CouchDB)

Harvesting &Cleaning

Preserving

Processing &Analysis

Exploiting

28/09/2018


7

CLOUD COMPUTING AND BIG DATA

Ready to use environments forStoring Big Data& Running greedy processing tasks

Software as a Service (SaaS)Full functional software

Infrastructure as a Service (IaaS)CPU, RAM, Disk

Platform as a Service (PaaS)Database systems, frameworks

28/09/2018


8

Different sizes, evolution in structure, completeness, production conditions & content, access policies modification …

What information is available in these collections?

Is the data clean? If not what has to be done to make it clean?

Are there relations between data collections which could be exploited for better prediction?

What are the update rates and in what way does this affect the collection?

THE RIGHT DATA FOR THE RIGHT ANALYTICS

28/09/2018


9

DATA CURATION

Data management through its lifecycle of interest & usefulness

§ Enable data discovery & retrieval§ Maintain data quality§ Add value§ Provide for re-use over time

28/09/2018


10

Define a data meta-data extraction modelfor supporting

decision making related to its exploitation

PROBLEM STATEMENT

28/09/2018


11

RESEARCH QUESTIONS§ Model in an integrated manner structural, semantic & quantitative meta-data?

§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?

§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

28/09/2018


12

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections

§ State of the Art – Data Lakes– Data Curation: Workflow and Approaches– Curation Architectures and Platforms

§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

13

DATA LAKE

Centralized repository containing virtually inexhaustible amounts of raw data to be analysed

CRM Sensor logs

Data wrangling [Terrizzano et al. 2015]

Constance [Hai et al. 2016]

28/09/2018


14

DATA SWAMP

Repositories grow ever bigger and complex to the point that a lake becomes a swamp

CRM Sensor logs

28/09/2018


15

DATA CURATION WORKFLOW

Preserving Describing

Step 2: Extracting meta-data

Grooming

Provisioning

Step 3: Exploitation

Selecting

Vetting

Collecting

Step 1: Harvesting

Data wrangling [Terrizzano et al. 2015]28/09/2018


16

EXTRACTING META-DATA

28/09/2018

Preserving DescribingExtracting meta-data

ExploringHarvesting

Level INo meta-data Partial meta-data Complete meta-data

Level 2 Level 3

Sensing &collectingraw data

Size, frequencyfreshness, type

Quantitative &qualitative data+ semantics

[Stonebraker et al. 2015]

information


17

Approach Principle Pros ConsCuration at Source [Curry et al. 2010]

Simple tagging & informationextraction directly from the sensor

Pre-processes data according to samples

No awareness of the whole data collection

Master Data Management[Weatherspoon et al. 2013]

Provide a standard vocabulary at the company scale

Standardizes the language used in data collections

Does not improve data structures

Semantic linking Constance [Hai et al. 2016]

Identify attributes referring to similar topics in different data sets

Explores data collections

Does not improve data structures

Data set clusteringGoods [Halevy 2016]

Discover similar data sets using clustering

Creates an synthesizedrepresentation of several data collections

Difficult to define a similarity criterion to cluster data with low quality (missing & null values, types)

Crowdsourcing and Collaboration spaces [Doan et al. 2005]

Communities produce, maintain & tag data (crowdsourcing)

Improves the qualityof raw data

Manual & a lot of human resources

APPROACHES FOR EXTRACTING META-DATA

28/09/2018

Level 3: manual and collaborative

Level I: simple tags as data is harvested

Level 2: pivot

vocabulary & semantics


18

Technique Idea Pros ConsBig Data Stacks (Hadoop, Berkeley, Asterix, Spark)

Provide data processingtools using parallel programming modelsSupport query languages& visualisation tools

Versatile, general purpose Low level of adaptability Centred on analysis not on data collections exploration

Service based architectures [CoreKG Beheshti 2018, Lord2004]

Provide independentservices implementing the steps of the data curation workflow

Services enable to adapt the tools to the data collections & good maintainability

Services assembly is up to the programmer

Data curation network [Alfred P. Sloan Foundation 2017]

Provide a Web portal that integrates tools from libraries willing to curate data collections

General view of tools and data collectionsRuns on the WebCollaborative curation

No integration of toolsThe curation process is not explicitNo automatic control on who does what

DATA CURATION PLATFORMS

28/09/2018


19

§ Model in an integrated manner structural, semantic & quantitative meta-data?

– No integrated data curation model • to explore data collections in search • for well adapted analytics input

§ Design a cloud service oriented architecture for enabling data curationconsidering variety and variability?

– No elastic architecture – No target solution for data curation

LIMITATIONS OF THE STATE OF THE ART (1/2)

28/09/2018


20

Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

– Not enough quantitative meta-data maintenance to make technical decisions

– Domain dependent curation • no explicit data curation workflow • exploration based on visualization

LIMITATIONS OF THE STATE OF THE ART (2/2)

28/09/2018


21

ROADMAP§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art

§ CURARE: Service Oriented Architecture for Curating Data Collections–Approach for Curating Data Collections–Data Collection and View Model–Services for Curating Data Collections


28/09/2018

22

TAILORING BIG DATA CURATION

Propose a data curation, service based system providing data analysts

– Abstract information about data collections content with several associated releases

– Tools for curating data collections– Strategies for managing metadata related to data collections

28/09/2018


23

CONTRIBUTIONS AND RESULTS

28/09/2018

Prototype MongoDB & OpenStack Experiments on Grand Lyon and Twitter urban transport data collections

CURARE: Cloud Service Oriented Architecture for storing & curating data collections

Data collection & View modelData curation formalisation


24

DATA CURATION APPROACH

Extracting as much meta data as possible to§ Make technical decisions§ Choose data exploitation & analysis strategies

DATA COLLECTION MODEL

VIEW MODEL

Structural meta-data

Quantitative meta-data

Extraction

Data Collections

Harvesting

Release & View: curated data collection

Views explorationassist decision making

28/09/2018


25


28/09/2018


VIEW MODEL



Extraction

Data Collections

Harvesting



Data harvesting services

Data processing services

(meta-data extraction)

Storage services Curated data collections


26


28/09/2018


VIEW MODEL



Extraction

Data Collections

Harvesting



Data harvesting services

Data processing services

(meta-data extraction)

Storage services Curated data collections

CURARE: Cloud Service Oriented Architecture for Curating Data Collections


27




28/09/2018

28

DATA COLLECTION: STRUCTURAL META DATA

28/09/2018

Data producedat t+1Size:2


Id::doc2temp:21humidi:61

Id::doc2N°car:5Red:true

Id::doc1Bus:2Loc:univ

Retrieved frommetadata & the

provider

Manual

Extracted

DataCollection

- id: URL- name: String- provider: String- licence: [public, restricted]- size: Num- author: String- description: String

Release- id: URL- release: Num- publicationDate : Date- size: Num

11..n

DataItem

- id : URL- name: String- attributs: List()

1

1..n

releases

items


29

{"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513

]}, "full_location": "Autoroute du Soleil,

69005 Lyon", "_id": "Criter11185353", "properties": {"confidentiality": "noRestriction", "probability": "certain", "mobility": "", "creationtime": "2016-06-07 19:40:00", "publiceventtype": "", "networkmanagementtype": "", "observationtime": "2016-06-07

19:40:00", "last_update": "2016-06-07 19:43:30", "numberoflanesrestricted": "0", "effectonroadlayout": "", "creator": "CRITER", "id": "Criter11185353",

"firstsupplierversiontime": "2016-06-07 19:40:00",

"version": "1", "linkname": "", "type": "VehicleObstruction", "status": "active", "direction": "bothWays", "locationtype": "nonLinkedPoint", "disturbanceactivitytype": "", "last_update_fme": "2016-06-07

19:44:29", "endtime": "", "creationreference": "", "informationstatus": "real", "townname": "Voie Rapide Urbaine de

Lyon", "publiccomment": "Bouchon, km 455|Voie

Rapide Urbaine de Lyon", "roadmaintenancetype": "", "versiontime": "2016-06-07 19:43:26", "starttime": "2016-06-07 19:40:00", "gid": "39258", "abnormaltraffictype": ""}

}


4.821773, 45.7513









}


4.821773, 45.7513









}

JSON Documents

GeographiccoordinatesRoad logicname

Observationdata

Traffic eventdescription

Item in R1


4.821773, 45.7513









}

Data collection

Release/day

R1

Rn

...

ReleasesTHE GRAND LYON TRAFFIC DATA COLLECTION

28/09/2018

Items of a release

Geographiccoordinates

Road logicname

Observationdata


Item in R1


30

DATA RELEASE DATA ITEM: TRAFFIC EVENT

28/09/2018


4.821773, 45.7513









}

Geographiccoordinates

Road logicname

Observationdata


DataItem


JSON documentStructure description


31

DATA COLLECTION RELEASE: TRAFFIC EVENTS IN A PERIOD OF TIME

28/09/2018


4.821773, 45.7513









}


4.821773, 45.7513









}


4.821773, 45.7513









}


4.821773, 45.7513









}

DataItem


Release- id: URL- release: Num- publicationDate: Date- size: Num

1

1..nitems

1items

"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]

Release R1


32

DATA COLLECTION: GRAND LYON TRAFFIC EVENTS

28/09/2018

DataItem


Release- id: URL- release: Num- publicationDate: Date- size: Num

11..n

items

DataCollection

- id:URL- name: String- provider: String- licence: [public, restricted]- size: Num- author : String- description: String

1

1..n

releases"_id" : "CollectionInfo","author" : “twitter","description" : "this is a Collection","id" : ”...","licence" : "public","name" : ”tweets","provider" : ”twitter","releases" : [ List(Release)],"size" : 736241

"_id" : "tweets_736241_collection_24_17021","id" : 17021,"name" : "tweets_736241_collection_24_17021","publicationDate" : "Mon Aug 08 2016 00:00:41 GMT+0000 (UTC)","size" : 4386,"url" : "localhosttweets_736241_collection_2417021""dataItems": [ List(dataItem) ]

Release<instance>

DataCollection<instance>

11..nreleases

1..n1

items

DataItem<instance>

R1

Rn


33

VIEW MODELData structure providing an aggregated perspective of a data collection content and its associated releases§ Describes the content of a data collection with series of tuples <attribute, type> § For every attribute the view describes

– basic statistical data associated to the values of every attribute: min, max, mean, median, histograms, standard deviations

– null and missing values representation for every couple <attribute, type> for every attribute in a document

28/09/2018


34

VIEW: QUANTITATIVE META DATA

28/09/2018

Extracted

Computed

View- id: URL- name: String - provider: String - author: String - description: String - code: Code - releaseSelectionRules:

Function- Source: String - default: version.id

AttributeDescriptor

- id: URL - name: String - type: String - valueDistribution: Hist- nullValue: num- absentValue: num- minValue, maxValue: type - mean, median, mode: type - std: type - count: num

0..n

1ReleaseView

- id: URL- version: num- publicationDate: date - size: num

0..n

1

attributDescriptors

releaseViews

extension

Data Collection

DataItem

ReleaseData producedat t+1Size:2


Id::doc2temp:21humidi:61

Id::doc2N°car:5Red:true

Id::doc1Bus:2Loc:univ

extension

extension


35

TRAFFIC EVENTS ATTRIBUTES STATISTICS: ATTRIBUTE DESCRIPTOR

28/09/2018

Release R1

For each attribute across all items in a release Ri

Quantitative meta-data: statistical measures of each attribute

AttributDescriptor{"_id" : "tweets_736241_viewMapReduce_24_17021.entities.hashtags.0.text.number","value" : {

"data": [9,9,9,9,9,9],"median": 9,“mean”: 9"count": 6,"valueDistribution": {"9" : 6},"type": "number","mode": {"value": "9","count": 6},"missing": 4380,"nulls": null }}

at12, v12...

DataItem<instance>

at11, v11...

DataItem<instance>

at1n, v1n...

DataItem<instance>


36

TRAFFIC EVENTS STATISTICS: RELEASE VIEW

28/09/2018

Release Ri

Quantitative meta-data: aggregated measures of attributes statisticsin all its data items

ReleaseView"_id": "tweets_736241_viewMapReduce_24_17021","version": "17021","publicationDate": ISODate("2016-08-08T00:00:00Z"),"size": 765,"attributDescriptors": [List(AttributDescriptor)]

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Aggregate


37

GRAND LYON TRAFFIC EVENT STATISTICS: VIEW

28/09/2018

Data collection

View..."code" : {"model-framework": "MongoMapReduceFinalize",

"mapper": Mapper()"reducer": Reducer()"finalize": Finalizer() },

"rules": "function rule(version){\r\n\treturn version\r\n}"

Quantitative meta-data: aggregated measures of releases statistics

Aggregate0..n1

Releases R1ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn


38




28/09/2018

39

SERVICES FOR CURATING DATA COLLECTIONS

28/09/2018

Integrated Data Curation Platform

Data Cleaning & Processing Services

PIG HADOOP

Data Harvesting Services

REST FLUME

Data storage Services

MongoDB Neo4J CouchDBTagged Data Collections

Decision Making Support Services

Data Analytics Services

Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira da Silva, P. Ghodous, C. Collet, P.Lopez. Concurrent Engineering 2015, Delft, Netherlands


40

CLOUD DEPLOYMENT ENVIRONMENT

28/09/2018

Image modified from Windows data science virtual machinehttps://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/


41

CURARE DATA

HARVESTINGSERVICE

SaaS

CURARE CLOUD SERVICES

28/09/2018

Views(quantitative &

structural description)

CURARE DATA

STORAGESERVICES

PaaS

Meta-data (dataCollection, release)

CURAREDATA

CLEANINGSERVICE

Decision making services

Data analytics services


Tagged Data Collections

Integrated Data Curation PlatformCURARE

DATAEXPLORATION

SERVICES

42

DATA HARVESTING: BATCH

28/09/2018

CURARE DATA

HARVESTINGSERVICE

Data source(on demand producer)

connect()

get()

close()

Buffer & caching service

Service Instance

(continuous)

Method call instance

iteratoriterator

iterator

PaaS


ExploringHarvesting


43

Service Instance

(continuous)

DATA HARVESTING: STREAM

28/09/2018

subscribe()

receive()

stop()

Message queue service

push/pull

Method call instanceCURARE

DATAHARVESTING

SERVICE

PaaS


ExploringHarvesting


44

CURAREDATA

CLEANINGSERVICE

CURAREDISTRIBUTED

DATASTORAGESERVICES

post()

PaaS

CURARE DATA

HARVESTINGSERVICE

post()

DATA CLEANING & STORAGE

28/09/2018 Preserving DescribingExtracting meta-data

ExploringHarvesting


45

DATA PROCESSING & EXPLORATION

28/09/2018

CURARE DISTRIBUTED

DATASTORAGESERVICES

Raw data

Curated data

Analysed data

CURAREDATA

EXPLORATIONSERVICES

Data analytics services

Decision making services

get(analysedData)

sendInstructions(makecollectionViews)

ViewdataCollections

sendInstructions(analyseData)

1

23

4


ExploringHarvesting


46

DATA VARIETY & VARIABILITY

28/09/2018

§ Data Collection and View Models– Data collections track structural meta-data over time through releases– Views track quantitative meta-data over time through release views– Attribute descriptor tracks every existing value– Release Views track the evolution over time of an attribute’s values and their

distribution§ Data Curation Service based Architecture

– Service swapping and adding thanks to SOA for enabling data processing tools for exploring with new data formats & meaning


47

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned

§ Conclusion and Perspectives

28/09/2018

48

CURRENT PROTOTYPE IMPLEMENTATION

§ Deployed on Openstack1

§ Distributed Data and Access Service: MongoDB– 4 router, 3 config servers, 3 shards with 3 replica sets

§ Scripts and statistical operators– process data collections – generate objects-instantiating our model

§ Machines used– 4 VCPU 2.5 GHz, 8 Go RAM, 80 Go Disk

28/09/2018


1 https://www.openstack.org

49

EXPERIMENTS SETTING§ Profile the cost of extracting meta-data

– Computing statistical measures to compute quantitative meta-data– Measure execution time on centralized & parallel execution environments

§ Use case: test the interest of using views for a data sharding decision making process

§ Input data collections– Data set Grand Lyon with 2095 documents, 86 releases, 2.5Mb– Data set Twitter with 736242 documents, 125 releases, 2.5 Gb– Samples ranging from 4000 to 20000 documents

28/09/2018


50

ROADMAP§ Context and Problem Statement : Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype–Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned


28/09/2018

51

How much does it cost to extract meta-data & compute Views?

28/09/2018


52

EXPERIMENT 1: CREATING DATA COLLECTIONS & VIEWS (1/2)

28/09/2018

Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Varies with the size of the collection and the number of attributesBy far the biggest time consumer is the attribute collection

contributing for 90 to 95% of the time

– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents

Grand Lyon

Twitter

53


28/09/2018

Context and Problem Statement | State of the Art | CURARE | Implementation & Experimentation | Conclusion & Perspectives

Speed gain relative to the size of the releases

– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents

Grand Lyon

Twitter

54

ROADMAP

§ Context & Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views–Experiment 2: Data Sharding–Comparison and Lessons Learned


28/09/2018

55

SHARDING DATA ACROSS DIFFERENT STORES

28/09/2018

Traffic status{

"geometry": {"type": "Point", "coordinates": [

4.821773, 45.7513









}

Balanced and smooth fragmentation(size, location, availability)

Shard 0

chunk chunk

Key range0 ... 20

Optimum distribution across shards providing storage spaces (chunks)

Shard 1

chunk chunk

Key range21 ... 40

Shard 2

chunk chunk

Key range41 ... 60


56

EXPERIMENT 2: SHARDING WITHOUT VIEWSChoose shardingkey candidates

Associate key candidates with

shardingstrategies

Shard collection with strategies

Evaluate data distribution across

fragments

Data scientistIdentify attributes

manually

Choose ShardingStrategy

If poor queriesevaluation time

Explore the data collectionmanually

Extract a sample

Data scientist

Raw data collection

28/09/2018


57

EXPERIMENT 2: EVALUATION OF SHARDING KEYS DECISIONS WITHOUT VIEWS

Documents number & data varies § greatly for ranged location§ lesser extent for hashed

location

28/09/2018


58

EXPERIMENT 2: DECISION MAKING QUESTIONS§ Which attribute can be used to shard the collection?§ Which is the values distribution of each attribute?§ Is there critical data with particular availability requirements?§ Should some fragments be collocated?

28/09/2018

This information must be extracted, computed


59

EXPERIMENT 2: SHARDING USING VIEWS

28/09/2018

Raw data collection

Getmeta-data

Extract meta-data

Compute meta-data

Create Data Collection Meta-data

Computequantitative measures

Discover domain value

properties (null, absent values)

Create Views

Identify attributes Choose shardingkeys candidates


sharding strategies



fragments

Data collection

Release

Item

View

ReleaseView

AttributeDescriptor

Data scientist

Data provider


60

Identify the best candidate key attributes

28/09/2018


61

EXPERIMENT 2: IDENTIFYING ATTRIBUTES VALUES DISTRIBUTION

28/09/2018

nb values missing max count min counttweets.quoted_status.user.location.string 8436 3008392 16820 4tweets.user.location.string 11041 571962 269506 4tweets.user.time_zone.string 149 860539 870263 4tweets.place.bounding_box.coordinates.0.0.0.number 3 0 1983882 19950

tweets.place.bounding_box.coordinates.0.0.1.number 3 0 1968043 35784







tweets.quoted_status.user.geo_enabled.string 2 2938908 155678 95665

tweets.user.geo_enabled.string 1 0 3190337 3190337

ReleaseView

Too many missing values

Good distribution of valuesfew missing values àcandidate sharding key

Lead to unbalanced chunks

Attributes related to location


62

With which sharding strategies (range, hash) can candidate keys cope?

28/09/2018


63

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (1/2)

28/09/2018

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Abu D

habi

Amer

ica>L

os_Ang

eles

Asia>J

erus

alem

Auckla

nd

Belgra

de

Brisba

neCST

Centra

l Am

erica

Dhaka

Europ

e>Am

ster

dam

Europ

e>Mad

rid

Geo

rgetow

n

Hong K

ong

Jaka

rta

Kuala Lum

pur

Ljub

ljana

Mex

ico C

ity

Mos

cow

New D

elhi

Paris

Riyadh

Singap

ore

Taipe

iUTC

Wellin

gton

Paris

US & Canada

LjubljanaGreenland

Athens

Amsterdam

Missing

Attribute Descriptor


64

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

Abu D

habi

Amer

ica>L

os_Ang

eles

Asia>J

erus

alem

Auckla

nd

Belgra

de

Brisba

neCST

Centra

l Am

erica

Dhaka

Europ

e>Am

ster

dam

Europ

e>Mad

rid

Geo

rgetow

n

Hong K

ong

Jaka

rta

Kuala Lum

pur

Ljub

ljana

Mex

ico C

ity

Mos

cow

New D

elhi

Paris

Riyadh

Singap

ore

Taipe

iUTC

Wellin

gton

28/09/2018

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY TIME ZONE (2/2)

Paris

US & Canada

LjubljanaGreenland

Athens

Amsterdam

Missing

Attribute Descriptor Shard1 Shard2 Shard3


65

EXPERIMENT 2: ANALYSING CANDIDATE SHARDING KEY LOCATION (1/3)

28/09/2018

0

100000

200000

300000

400000

500000

600000

700000

AM

Arcach

on, F

ranceBarne

t

Bourg Pale

tte

Calvi>Mon

za

ChÃªne-B

ourg, S

uisse

Dans v

otre e

ntrepri

se

En casa

de @Ana

Nehrhoff

France

> Breizh

Gilbert

, AZ

Hogwart

Jedd

ah - T

aif

La rÃ

©alitÃ©

Lisieux

, Norm

andie

Lyon

- Ban

gui

Lyon

>saint

-etien

ne

Metz>>

Nancy

MÃ¢con-D

ijon

Nord, N

ord-P

as-de

-Cala

is

PVRIS

Paris, F

rance >

Pak

istan

Pom Pom G

alli - Tou

rs

Rillieux

-la-Pape

(69)

Saint-R

apha

Ã«l, Prove

nce-A

lpes-C

Ã´te d'

Azur

Somewhe

re in

a crow

d

Ta gueu

le

TrÃ©molat, F

rance

Villeuba

nneZion

clerm

ont-fer

rand > sq

uamish he

ll

listen

ing to

Dan

gerous W

oman

p-a-p>

paris

studle

y uk

Ã©tusson



66


28/09/2018

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

ANZIO(ro

ma)

Arena

s de S

an P

edro

Bay Area

Bow, Engla

nd

Canne

s, Fra

nce

Clichy

, Ile-de

-Fran

ce

DibÃ«r

, â€œ

F;Y;R

; of …

Estadio

Vicente C

alderon

France

, Lille

Greno

ble - L

yon IN

Kings L

andin

g, W

esteros

Le P

ortel

Lond

on;

Lyon

Sou

th!

Manch

ester

>Burnley

Montig

ny-lÃ

¨s-Metz

, Fra

nce

Netherla

nds, L

imbu

rg, V

enra

y

Orage

en Mars

Paris >

Tours

Pembs >

Manch

ester

> M;U

;F;C

Rebord

osa

Saint Q

uenti

n

Sheffiel

d (so

metimes

)

Sur Mars

; (Y' fa

is ch

aud ic

i)

Toulou

se, F

rancia

Versaille

s, Ile

-de-F

ranc

e

Yango

n(Mya

nmar

)â†’T

okyo

â†’P

aris

chap

uller P

hysic

ist*,

CMS-CERN gil

l

leo w

on | el

la

on th

e road

;;;

stars

hollow

Ã©cully

location

France

Lyon

missingAttribute Descriptor


67


0

500000

1000000

1500000

2000000

2500000

3000000

3500000

A l'Oue

st !

Annonay

, Lyo

n

Banbrid

ge, N

orthern

Irelan

d

Booba>K

endri

ck; U

sain B

olt;

CHAMONIX - G

ENEVA

Chatel -

Frenc

h>Swiss

Alps

Croix r

ousse

Dublin

| May

o

FamÃlia

Bala

deira

âˆž

Frankfu

rt am M

ain, H

essen

Guadala

jara, Ja

lisco

Ici su

r Twitte

r

Kilkis

Ind; A

rea,

Greece

Le D

oulieu,

Franc

e

Lond

on - K

hoba

r - R

iyadh

Lyon

- Ville

urbann

eMED

Mexim

ieux, F

ranc

e

MÃ©dine, R

oyau

me d'Ara

bie S

aoud

ite

Noisy-l

e-Sec,

France

PARIS >

Ù…Ù‡Ø

¯ÙŠ

Paris | F

R

Pilton,

Englan

d

Renne

s > B

rest,

Fran

ce

Saint-A

madou,

Midi-PyrÃ

©nÃ©es

Sevilla,

Esp

aÃ±a

Strasb

ourg,

Fran

ce; î”�Tog

o

Valence

, RhÃ

´ne-A

lpes

Wes

t Ran

ch

avec

une f

leur qu

i est

moche

demi x

@cv

rpen

trz

in ba

iley's

arms

lyon4Ã

¨

pauli

ne - ne

lley -

cam

the la

nd of

tea,

rain

& queu

es

Ð‘ÐµÐ

¾Ð³Ñ€Ð°Ð́

> Lon

don >

Cae

rdydd

location

France

Lyon

missing


28/09/2018

Shard1 Shard2 Shard3


68

ROADMAP

§ Context and Problem statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections

§ Implementation and Experimentation–Current Prototype– Experiment 1: Evaluating the Cost of Computing Views– Experiment 2: Data Sharding–Comparison and Lessons Learned


28/09/2018

69

Is decision making for sharding better done with or without views?

28/09/2018


70

COMPARISON

28/09/2018

Raw data collection

Getmeta-data

Extract meta-data

Compute meta-data

Create Data Collection Meta-data

Computequantitative

measures

Discover domain value properties (null, absent values)

Create Views

Identify attributes distribution

Choose shardingkeys candidates


sharding strategies



fragments

Data collection

Release

Item

View

ReleaseView

AttributeDescriptor

Data scientist

Data providerChoose shardingkey candidates


sharding strategies



fragments

Data scientistIdentify attributes

manually

Choose ShardingStrategy

If poor queriesevaluation time

Explore the data collectionmanually

Extract a sample

Data scientist

Raw data collection

Without views With views

Iterative process

- Guess sharding keys according to the data scientist knowledge

- Test – error approach based on samples

One shot process

- Full structural & quantitative knowledge of data collection

- Decision based on the whole data collection

Without Views With ViewsQuery attributes No YesExplorable attributes Sample AllDecision support Educated guess Measurable graphs

Evaluation based on- Queries execution / communication time: determine the degree of data

fragmentation and colocation- Balance of shard sizes


71

LESSONS LEARNED

Sharding without Views Sharding with Views

Exploration Several days Few hours

Methods Using scripts to find division in the data

Querying the meta data to find related attributes, analysing the histogram for sharding

Quality Mediocre, distribution remains fairly poor

Almost perfect distribution of document across all the shards

Processing Fairly cheap and quick scripts

Expensive data processing

à Sharding with views is all quicker, more intuitive and effective

28/09/2018


72

ROADMAP

§ Context and Problem Statement: Exploring and Managing Big Data Collections§ State of the Art§ CURARE: Service Oriented Architecture for Curating Data Collections§ Implementation and Experimentation§ Conclusion and Perspectives

28/09/2018

73

CONCLUSION

28/09/2018

§ Model in an integrated manner structural, semantic & quantitative meta-data?

§ Data collection & View models: structural and quantitative meta-data

§ Design a cloud service oriented architecture for enabling data curation consideringvariety and variability?

§ CURARE: Cloud service oriented architecture for storing & curating datacollections

§ Can meta-data and a service-oriented data curation architecture support decision-making regarding data exploration?

§ Use case: choosing the best criterion for sharding data using Views


74

SHORT & MID TERM PERSPECTIVES

§ Compare releases & data collections– visualisation– query languages enhancing current ad-hoc and imperative proposals

§ Evaluate our data curation approach with users feedback– explore newspapers collections & political campaigns– investigate the urban data from the pole CARA

§ Extend the view model & CURARE– discover semantic metadata and relationships among attributes– discover functional, temporal and causal dependencies

28/09/2018


75

LONG TERM PERSPECTIVES

§ Compose big data curation services according to QoS criteria– data properties and target applications

§ Data curation traceability– distributed and collaborative data curation based on block chain

§ Big data collections market– curation guided by cost and business models– QoS criteria like privacy, economic cost, provenance, reputation, trust

28/09/2018


7628/09/2018

CURARE: CURATING AND MANAGING BIGDATA COLLECTIONS ON THE CLOUD

Gavin Robert KEMP, LIRIS, [email protected]

77

PUBLICATIONS§ Journals

1. Cloud big data application for transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous, C. Collet ,P. López Amaya; International Journal of Agile Systems and Management, Inderscience, 9 (3), 2016

2. Big data collections and services for building intelligent transport applications; G. Kemp, P. López Amaya, C.Ferreira Da Silva, G. Vargas-Solar, P. Ghodous, C. Collet; International Journal of Electronic BusinessManagement, Vol. 14 No. 1, pp. 1-10, 2016

3. CURARE: Maintaining and Managing Data Col-lections Using Views. IEEE Transaction on Big Data; GavinKemp, Catarina Ferreira Da Silva, Genoveva Vargas Solar, Parisa Ghodous (submitted)

§ Conferences4. Towards Cloud big data services for intelligent transport systems; G. Kemp, G. Vargas-Solar, C. Ferreira Da

Silva, P. Ghodous, C. Collet, P. Lopez. Concurrent Engineering, Jul. 2015, Delft, Netherlands5. Aggregating and Managing Big rEaltime Data in the Cloud : Application to intelligent transport for Smart

Cities; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P. Ghodous; Proceedings of the 1st InternationalConference on Vehicle Technology and Intelligent Transport Systems, May 2015, Lisbon, Portugal. pp.107-112

§ Book Chapter6. Service Oriented Big Data Management for Transport; G. Kemp, G. Vargas-Solar, C. Ferreira Da Silva, P.

Ghodous, C. Collet; Smart Cities, Green Technologies, and Intelligent Transport Systems / series Communicationsin Computer and Information Science, Springer, 579, pp. 267-281, 2016

28/09/2018

7828/09/2018

79

PARALLEL DBMS & PROCESSING ENVIRONMENTS FORCURATING DATA

28/09/2018


ExploringHarvesting

ETL

Parallel Data Processing Platforms

Spark (RDD – Tables/Graphs)Hadoop ecosystem tools (e.g., Pig)

Parallel Data Processing Platforms

NoSQL & NewSQL(Parallel)

ParallelData Querying

& Analytics

Structured Data provision

Parallel data collection

(Flink, Stream, Flume)

Spark (descriptive statistics functions)Hadoop ecosystem tools (e.g., Hive)

Parallel RDBMS, Big Data Analytics Stacks (Asterix, BDAS)Parallel analytics (Matlab, R)

80

EXPLORING TRANSPORT EVENTS IN LYON USING VIEWS

28/09/2018

View<Events>

View<Bycicles>

ReleasesR1 ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn

0..n1


Search datasets about traffic events smaller < 30 Mbytes reporting eventsduring vacations

Which releases about bicycles distribution have less than 20% of missing values

ReleasesR1 ReleaseView

AttributDescriptor

Statistiques an

Statistiques a1

0..n

1

attributDescriptors

Rn

0..n1

Search releases reporting traffic eventsclose to bicycle stations

s %

knn

81

DATA CURATION ON THE LAKES

28/09/2018

Approach Principle Pros ConsMeta data : Constance [Hai et al 2016], [Stonebraker et al 2013] Data wrangling [Terrizzano et al 2015]

Extraction of semantic meta-data or mapping items with conceptsQuery using SPARQL, regular expression

Possibility to express declarative queries; difficult to automate completely

No statistics, iterative data querying for exploring data

Data content description: descriptive statistics, processing according to data types, schema extraction

Explore data structure for extracting the schema

Compute descriptive statistics functions for every element

Aggregated view of the content despite data types. Simple to visualize and scale if important volume.

Adapted for (semi)structured data, manual tagging for multimedia content

Curation: [CoreKG, Curry 2016, QoSMOS2018, Tacit knowledge management 2017]

API with methods for preserving data collections. Querying and data fusion operations

Exploit Compass tool from MongoDB for providing a quantitative vision of data content

Data transformation from CSV to JSON.No semantic knowledge (terms, functional dependencies)

8228/09/2018

83

PROCESSING DATA COLLECTIONS

28/09/2018

Description

Prediction

Clustering (Bayesian, hierarchical, k-means, CLARA, PAM)

Classification(Neural network, PLSDA,

KNN, decision trees)

Trend prediction

Regression(PLSR, PCR)

Association

Modelling (LU, QR, PCA=SVD,

PARAFAC)How many accidents are reported per day?

Percentage of use of available bicycles in downtown?

Which are the traffic bottle-neck regions in the city?

Is the number of car accidents related to seasons?

How will the use of bicycles will evolve in downtown during the summer of the next 5 years?

What type of cars are those that have more accidents in the highspeed roads?

Will increasing the parking cost reduce car traffic in the city and increase the amount of people using public transport?

84


28/09/2018

Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program [ECCS 2009]

Contribution: PCA et K-means for predicting the use tend of Velo’V in Lyon

Description Clustering PCA & k-means

Orthogonal linear transformation transforms the data to a new coordinate system such that - the greatest variance by some projection of the data

comes to lie on the first coordinate (first principal component),

- the second greatest variance on the second coordinate, and so on.

85


28/09/2018

Inferring the Root Cause in Road Traffic Anomalies [IEEE Data Mining 2012]

Contribution: PCA for identifying traffic anomalies and therebydetecting problems in roads

DescriptionModelling

Principal Component Analysis (PCA)

86

Can be sharded

DATA COLLECTIONS STORAGE: NO/NEWSQL SYSTEMS

C

A

P

C - A A - P

C - P

Data models

- Relational- Key-Value- Column oriented- Document oriented

- Dynamo- Voldemort- Tokyo Cabinet- KAI

- Cassandra- SimpleDB- CouchDB- Riak

- BigTable- HyperTable- Hbase

- MongoDB- TerraStore- Scalaris

- BerkeleyDB- MemcacheDB- Redis

- RDBM’s- MySQL- Postgres- etc

- Aster Data- GreenPlum- Vertica- Neo4j

Availability: each client can

always read & write

Partition tolerance: The system works well despite physical network partitions

Consistency: all clients always have

the same view of de data

28/09/2018

87


28/09/2018


4.821773, 45.7513









}

- Relational- Key-Value- Column oriented Tabular- Document oriented

Raw data collections

- How to transform data collections ?- Which is the best adapted model?

à Polyglot persistence

Approaches dealing with transformation rulesinspired in the relational case

tabular (csv, excel)

Media (XML, JSON, BLOB)Graph

88


28/09/2018

Persistence- Which part of the document must persist?- Explicit vs. implicit persistence- In memory / hard disk

Fragmentation/Sharding & replication: - Vertical or horizontal fragmentation- Strategies: range, hash, tagged- Distribution & location

Availability & Fault tolerance- Replication & distribution

September 28, 2018


4.821773, 45.7513









}

Memory/Cache

Raw data collections

89


28/09/2018

§ Varies with the size of the collection– 2 min for the Grand Lyon event data set of 2000 documents – 36 hours for the Twitter data set of 700000 documents

§ By far the biggest time consumer is the attribute collection contributing for 90 to 95% of the time


Grand Lyon

Twitter

90

§ Varies with the size of the collection– 6-7 sec for the Grand Lyon event data set of 2000 documents – 50 min for the Twitter data set of 700000 documents


28/09/2018


Grand Lyon

Twitter