View
6
Download
0
Category
Preview:
Citation preview
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules
XLDB 2017 10/11/2017
} Very large amounts of data
} Very volatile data
} Data as streams
} Data schemaless } Data quality
Modern data management
} XLDB from 2011
} XLDB 2017 ◦ 2,5 days ◦ 140 attendees ◦ 14 talks ◦ 17 lightnings talks ◦ 2 sessions posters ◦ 2 demonstrations
} Speakers ◦ LIRIS ◦ Databricks (SPARK) ◦ CERN ◦ SAP Big Data ◦ Imperial College ◦ …
XLDB
} SAP Hub ◦ Plateforme de connexion avec hadoop, …
} Liris ◦ Hive and HadoopDB evaluation with LSST data ◦ Performance � Loading : Hive is better � Query reponse time : HadoopDB is better with a high volumetry ◦ Scalability (25 à 50 machines): both scale up well
Talks
} Liris proposal
Talks
Talks
} Flink like SPARK ◦ Batch, streaming, …
} SPARK New stream features ◦ SQL can be run on streams (Bullet) ◦ Checkpoints improve fault-tolerance ◦ Aggregation by window
} LeanXcale ◦ New database vendor ◦ Scalable transactional management: scale up to many million of
transactions per second ◦ OLTP and OLAP
Talks
} MonetDB by ◦ Column storage ◦ Interesting product but no map and not very well documented
} CEPH presentation (CERN) ◦ Open source product ◦ CEPH does not use replicated block ◦ Storage virtualisation
Talks
} CloudMdsQL ◦ Provide integrated access to multiple, heterogeneous cloud data
stores such as NoSQL, HDFS and RDBMS ◦ Others polystores : SPARKSQL, Polybase … ◦ Issues : Execute joins between RDBMS and HDFS and Nosql ◦ Not OpenSource (LeanXcale)
Talks
QueryProcessor
RDBMSWrapper
HDFSWrapper
SELECTid,xFROMASCAN(…).MAP(…).REDUCE(…).FILTER(KEYIN(1,3)).PROJECT(…)
} Knowledge Preservation in HEP (Notre Dame) ◦ Huge investment in producing data for science ◦ Data can be wasted or not re-used ◦ Data preservation: “backing up your hard drive” ◦ harder problem: software + “knowledge” ◦ Data And Software Preservation for Open Science
� CERN –DESY – CNRS � Containers Portability = Preservation! � CERN Open Data Portal
� CERN Analysis Preservation ◦ How new analysis tools can be preserved ?
Talks
} European Bioinformatics Institute : Genomics based on ElasticSearch
� 8 data nodes (2 cores / 32Gb RAM / 200Gb disk) � 3.2 billons of documents / 782Gb � Complex query on >100 million genes ~500ms
} Bullet � A real-time query engine that lets you run queries on very large data
streams � OpenSource � Components : Storm, kafka
} Kafka Kloner like MirrorMaker ◦ A dynamic High-Speed Inter-Cluster Kafka Replicator ◦ Developped by Yahoo for yahoo ◦ 150 billion events per day with an average latency around 2 sec.
Lightning talks
} CERN Openlab project with SPARK ◦ Physics Data Analytics and Data Reduction with Apache Spark ◦ CMS Experiment
} Oracle database In-Memory ◦ Significant improvement for Data warehouse appliance
Lightning talks
} AstroSpark ◦ SPARK for astronomical data : Cone-Search, Cross-Match … ◦ Data partitioning and indexing with healpix ◦ Query optimizer for astronomical queries ◦ Astronomical Data Query Language support
Lightning talks
} QSERV ◦ Execution of 2 queries before losing the ssh connexion at CC ◦ Shared nothing architecture ◦ Big challenge but many things to do : � Fault-tolerance � Data distribution � Big queries
} Wikidata
Demonstration
Recommended