34
6. Lecture Introduction to Databases Advances in databases: NoSQL databases Cheng Fu Dept. of Geography, University of Zürich Rolf Meile Eidg. Forschungsanstalt für Wald, Schnee und Landschaft Zhiyong Zhou, Tutor Dept. of Geography, University of Zürich kudos to Dr. Haosheng Huang for providing this document Geo874 | HS20 Universität Zürich

6. Lecture Introduction to Databases Advances in databases: … · 2020. 10. 22. · 6. Lecture Introduction to Databases Advances in databases: NoSQL databases Cheng Fu Dept. of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • 6. Lecture Introduction to Databases

    Advances in databases: NoSQL

    databases

    Cheng Fu

    Dept. of Geography, University of Zürich

    Rolf Meile

    Eidg. Forschungsanstalt für Wald, Schnee und Landschaft

    Zhiyong Zhou, Tutor

    Dept. of Geography, University of Zürich

    kudos to Dr. Haosheng Huang for providing this document

    Geo874 | HS20

    Universität Zürich

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    2

    Summary 5.1

    |

    ER model Step Relational model

    Entity type 1,2 Relation

    Binary 1:1 relationship type 3Add PK on the total participation

    side to the othe side

    Binary 1:N relationship type 4Add PK on the 1-side to the N-

    side

    Binary M:N relationship type 5 Relation and two FKs

    n-ary relationship type 7 Relation with n FKs

    Simple attribute 1, ... Attribute

    Composite attribute 1, ... Set of simple attributes

    Multivalued Attribute 6 Relation and FK

    Correspondence between ER- and relational models

    Elmasri & Navathe (2014): Ch 9, pp. 287-296.

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    3

    So far we covered...

    Relational databases– DB Design: Requirements -> conceptual -> logical ->

    physical

    – Entities and relationships

    – Relational model

    – SQL – the language of relational databases

    | Intro

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    4

    So far we covered...

    Relational databases– DB design process: requirements -> conceptual ->

    logical -> physical design

    – Basic assumptions:

    • Requirements: Use-cases can be defined and data can be structured at start

    • Conceptual: modelling entity types through fixed number of attributes, and their relationships

    • Logical: relational model, dense, row-based (tuple) data structures, row-first access, and SQL as access language

    • Physical: designed for a single server, ACID (atomicity, consistency, isolation, durability) transactions.

    So – what if some of above do not hold...

    | Intro

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    5

    Learning objectives

    You will understand current developments in data collection,

    storage and processing – Why NoSQL?

    You will understand the basic characteristics of NoSQL

    databases

    You will be aware of different NoSQL database types: key-

    value, document, column, graph

    | Intro

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Contents – NoSQL Databases

    1. Introduction

    2. Characteristics

    3. NoSQL database types: key-value, document, column,

    graph

    6

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    The shift to digital economy

    • Economy powered by the Internet and other 21st century technologies – the cloud, mobile, social media and big data

    • Key to every digital economy business: Web/mobile-based applications need to– Support large numbers of concurrent users (tens of thousands, millions)

    – Deliver highly responsive experiences to a globally distributed base of users

    – Be always available – no downtime

    – Handle semi- and unstructured data

    – Rapidly adapt to changing requirements with new features and frequent updates

    • Relational databases: structures and data types are fixed in advance

    • NoSQL: Applications can add new fields on the fly.

    Relational databases are unable to meet these new

    requirements!

    7 | Intro

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Big Data – V V V

    8 | Intro

    Buzzword, but...– Volume: large amounts of data.

    • Large: beyond what can reasonably fit into a singlememory/ hard drive

    – Velocity: data arriving at a high rate and need to bestored and possibly processed fast

    • E.g., streaming data from sensors, from social networkfeeds, at the stock exchange, created on the Web...

    • Delays are costly

    – Variety/Variability: data change their schema, formats, are inconsistent, and are of different types...

    • Need approaches that allow for easy adaptation in whatis stored and how it is processed

    – ... Some talk also of Veracity (how trustful the data can be, data quality)

    Early talk on Big Data (coining the term): http://www.quora.com/Who-coined-the-

    term-big-data (Roger Magoulas, 2009)

    http://www.quora.com/Who-coined-the-term-big-data

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Scalability

    9 | Intro

    – To solve Volume and Velocity we need scalable solutions: systems that can grow at least proportionally to the needs.

    – Purpose-built computers are $$$ (scale up, vertical scalability)

    – What if the data outgrow it?

    – Ability to add more small/cheap resources as requirements grow (scale out, horizontal scalability);

    – Allowing partial failure by providing redundancy

    – Consequence: databases are partitioned between multiple physical systems (computers).

    Relational databases are poor in supporting these!

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    • IBM mainframe: Z server

    10

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    NoSQL

    • Definition from http://www.nosql-database.org/– Next Generation Databases mostly addressing some of the points: being

    non-relational, distributed, open-source and horizontally scalable.

    – “not only SQL”

    – Typically no good at joins

    – Triggered by the needs of web 2.0 companies: Facebook, Google, Amazon.com

    – Increasingly used in big data and real-time web applications

    – More than 225 NoSQL databases• CouchDB

    • MongoDB

    • Hadoop & Hbase

    • Neo4J

    • Cassandra

    • …

    11 | Intro

    http://www.nosql-database.org/

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Contents – NoSQL Databases

    1. Introduction

    2. Characteristics

    3. NoSQL database types: key-value, document, column,

    graph

    12

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Characteristics of NoSQL databases

    • Simplicity of design: schema-less– SQL databases: structures and data types are fixed in advance

    – NoSQL databases: dynamic schemas; Applications can add new fields/attributes on the fly; Able to store large volumes of rapidly changing structured, semi-structured (e.g., JSON, XML), and unstructured data (e.g., video, audio, image, books, documents, journals, …)

    • Horizontal scaling (distribution) to clusters of machines– SQL databases: vertical scaling (add/remove resources to/from a computer server)

    – NoSQL databases: potentially thousands of machines, potentially distributed around the world

    • Be always available, highly responsive queries for global users– NoSQL databases: DB servers distributed globally

    13 | Characteristics

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Transaction properties in SQL databases: ACID

    – Atomicity• The set of operations on the database is either executed in its entirety, or not at all.

    • Entire transaction succeeds or fails (all-or-nothing)

    • Example: transfer of funds from account A to account B: either both, the debit on A and the deposit into B are executed, or none of the two.

    – Consistency• After a transaction has been executed, the integrity constraints have to be satisfied.

    • Valid state of DB before AND after transaction

    • Example: During the execution, there may be violations, but if they remain until the end, the transaction has to be undone (“aborted”).

    – Isolation• Enables transactions to operate independently of and transparently to each other.

    • Other transactions cannot access data that has been modified during a transaction that has not yet completed.

    – Durability (persistence)• After the successful completion of a transaction, the DBMS commits to make the outcome of

    the transaction permanent, even in the presence of concurrency and/or breakdowns

    – These are problematic with distributed database systems (NoSQL)!• Partitioning data into different servers

    | Transaction

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Partitioning consequences

    • Partitioning introduces communication overhead

    to assure database consistency after transactions –

    during synchronisation or replication process.

    • In physically distributed DBs network latency

    matters (10km = 67μs = 67x10-6s vs. cache write =10-20μs!)

    • Simultaneous updates may occur

    • Locking – concurrency control mechanism that

    assures exclusive access to a resource (value, row,

    table…)

    • BUT! Locking impacts on availability (the resource

    not available for other users when updated [locked])

    15 | Transactions

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Brewer’s CAP Theorem

    • CAP theorem: A distributed computer system can provide only two of the following guarantees at the same time.– Consistency (all nodes see the same data at the same time)

    – Availability (a guarantee that every request receives a response about whether it succeeded or failed)

    – Partition tolerance (the system continues to operate despite partitioning due to network failures)

    16 | Transactions

    Gilbert, S., & Lynch, N. (2002). Brewer's conjecture and the feasibility of consistent, available,

    partition-tolerant web services. ACM SIGACT News, 33(2), 51-59.

    C

    A

    P

    X

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    CAP illustration – Consistency violation

    17 | Transactions

    P1 P2

    U1 U2

    • T1 V: 1000

    • T2 Sync

    • T3 V: 1000 1000

    • T4 U1: V+500 = 1500

    • T5 U2: Read V = 1000

    • T6 Sync

    No locking:

    Accessible

    Can be partitioned

    Reads can be stale

    (old)

    AP system:Ensuring Availability and Partition tolerance

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    CAP illustration – Availability violation

    18 | Transactions

    P1 P2

    U1 U2

    • T1 V: 1000

    • T2 Sync

    • T3 V: 1000 1000

    • T4 U1: V+500 = 1500 Lock

    • T5 Sync

    • T6 U2: Read V = 1500

    With locking:

    Consistent

    Partitionable

    Temporarily inaccessible

    CP system:Ensuring Consistency and Partition tolerance

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    CAP illustration – Partitioning violation

    19 | Transactions

    P1 P2

    U1 U2

    • Putting in a mechanism similar to ACID transaction checks in a single

    server database and not accounting for failure of the communication

    network and its latency

    Not resilient

    to network

    failure

    CA system:Ensuring Consistency and Availability

    Relational DBs are CA systems

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Transaction in NoSQL

    • NoSQL: BASE instead of ACID

    – Basically Available: The system does guarantee the availability of the data; there will be a response to any request. But, that response could still be ‘failure’ to obtain the requested data or the data may be in an inconsistent or changing state.

    20 | Transactions

    – Soft state: The state of the system could change over time, so even during times without input there may be changes going on due to ‘eventual consistency,’ thus the state of the system is always ‘soft.’

    – Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Contents – NoSQL Databases

    1. Introduction

    2. Characteristics

    3. NoSQL database types: key-value, document, column,

    graph

    21

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    NoSQL databases types

    • Key-value

    • Document-based

    • Column-based

    • Graph-based

    22| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Key-value database• Key-value database: like a files system where the path acts as the key

    and the file acts as the value.

    • It pairs keys to values– The key is unique, and can be auto-generated.

    – The value can be String, JSON, BLOB (Binary Large OBject), etc.• Key-value DBs just store these values, without caring or knowing what's inside; it's

    the responsibility of the application to understand what was stored.

    • Key-value DBs often use hash tables to map keys to values

    • Example databases: Riak, Redis, Memcached, …

    Key Value

    “India” {“B-25, Sector-58, Noida, India – 201301”}

    “Romania”

    {“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606”,

    “City Business Center, Coriolan Brediceanu No. 10, Building B,

    Timisoara, 300011”}

    “US” {“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033”}

    23| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Document-based database

    • Document-based DBs store records as “documents” where a document can generally be thought as a collection of key-value pairs.– Format (encoding) of documents: XML, JSON, BSON (Binary JSON), …

    – The documents provide some structure and encoding of the managed data.

    • Compared to Key-value DBs, document-based DBs embed attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents.

    • In a document DB, each document carries its own schema — unlike a relational database, in which every row in a given table must have the same columns.

    • Examples: – MongoDB: used by LinkedIn, Foursquare, eBay, …

    – CouchDB

    – …

    24| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    CouchDB JSON example: a ‘customer’ document

    {

    "_id": "1189802380023",

    "_rev": "314159",

    "customer": "peter",

    "gender": "male",

    "likes": ["Biking", "Photography"],

    "address": {

    "Country": ”Switzerland",

    "City": ”Zurich"

    }

    }

    Global Unique Identifier, Passed in or

    generated by CouchDB

    Revision number, versioning mechanism

    Arbitrary tags, schema-less,

    could be validated by the

    programmers

    • Each customer is a “document”.

    – A database might have tens of thousands or even millions of “documents”.

    • Different documents do not need to have the same structure.

    – e.g., another customer document might have a new tag like “telephone”.

    – schema-less: different from relational databases

    25| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Column-based database

    • Relational databases store all the data in a particular table’s rows together on-disk, making retrieval of a particular row fast.

    • Column-based databases generally store all the values of a particular column together on-disk, which makes retrieval of a large amount of a specific attribute fast.

    • This approach is very suitable for aggregate queries and analytics scenarios where you might run range queries over a specific field/attribute.

    • Example database: Google’s BigTable, Hadoop’s HBase, …

    Id Tweet … …

    1 “...”

    2 “…”

    3 “…”

    26| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Graph-based database

    • Graph-based databases are good at dealing with interconnected data.

    • Graph databases consist of connections, or edges, between nodes. Both nodes and

    their edges can store additional properties such as key-value pairs.

    – Node: an instance of an object

    • The strength of a graph database is in traversing the connections between the nodes.

    • Graph databases are very suited to problem spaces where we have connected data, such as social networks, routing information for goods and money, recommendation engines

    • Example databases: Neo4J, Infinite Graph…

    27| Database Types

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    28

    https://www.youtube.com/watch?v=jyx8iP5tfCI

    Seven Databases in Song

    https://www.youtube.com/watch?v=jyx8iP5tfCI

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Summary

    • NoSQL databases

    – schema-less: no predefined schema

    – horizontal scaling (distribution) to clusters of machines

    – ensure availability

    – Transaction properties: BASE instead of ACID

    • SQL database: Atomicity, Consistency, Isolation, Durability

    • NoSQL database: Basically Available, Soft state, Eventual consistency

    • Different NoSQL database types

    – key-value pairs, documents, graphs, ...

    – difficult to switch to another NoSQL database (Disadvantage)

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Recap

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    What we covered in this semester

    • Why we need databases

    • Basics of relational DB

    • Relational DB design: requirement analysis conceptual DB design logical DB design physical DB design

    • Structured Query Language (SQL), PostgreSQL

    • NoSQL

    We hope you will find these useful in further studies and career.

    Hands-on in Geo 875

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Next lecture session

    • Intro to big data processing

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    Later, in lab

    • 10:15-12:00

    • Y25-J-09, Y25-J-10

    • Physical DB design and realization

  • Geo874 | Intro to Databases | HS20

    C. Fu, R. Meile, Uni Zürich

    L6 | Advances in Databases

    34

    Next week – exam!

    • About 60 mins

    • Stuff covered in lectures and practicals

    • Not allowed to take lecture notes

    • Pencil and paper

    • Friday 30.10.2020 8:40 am - 9:40 am in Y15-G-40

    • Be here at 8:30 am!

    • No lab on 01.11.2019

    • Good Luck!

    | Summary