58
Distributed Architectures Soware Architecture VO (706.706) Roman Kern Institute for Interactive Systems and Data Science, TU Graz 2020-01-15 Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 1 / 58

Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed ArchitecturesSo�ware Architecture VO (706.706)

Roman Kern

Institute for Interactive Systems and Data Science, TU Graz

2020-01-15

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 1 / 58

Page 2: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

IntroductionWhat is a distributed architecture?

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 2 / 58

Page 3: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Goals

Main goal: ScalabilityIn the optimal case a system scales linearly with the objective

E.g. number of transactions, size of the data, users

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 3 / 58

Page 4: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Solutions

Main solutions1 More e�icient algorithms2 Use faster hardware (e.g. more powerful machines)

I Scale vertically3 Add more machines

I Scale horizontally (scale out)I Parallel computing, distributed computing

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 4 / 58

Page 5: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures

Parallel computing vs. distributed computingIn parallel computing all component share a common memory, typically threads within asingle programIn distributed computing each component has it own memory

I Typically in distributed computing the individual components are connected over a networkI Dedicated programming languages (or extensions) for parallel computing

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 5 / 58

Page 6: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures

Distributed architecturesMost complex solution

I Due to the added parallelism of data and processingI And therefore an increased risk of errors

Overall latency will be the the one of the slowest machineI I.e. latency cannot be decreased via distributed solutions

Therefore the architecture needs to be sound

… focus on abstraction and composition

Note: In practice people are o�en biased towards technologies they know (e.g. SQL)

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 6 / 58

Page 7: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures

http://nighthacks.com/roller/jag/resource/Fallacies.html

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 7 / 58

Page 8: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures

Di�erent levels of complexityLowest complexity for operations, which can easily be distributed

I If they are independent and short enough be to executed independent from each otherI And if the data can be partitioned into independent parts

Higher degree of complexity for operations, which compute a single result on multiplenodes

I Synchronisation of data access also raises the complexity

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 8 / 58

Page 9: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures

Additional aspects of complexityComplexity = intrinsic complexity + accidental complexity

I Intrinsic complexity of the problem itself (i.e. how hard is the problem)I Accidental complexity arises from the implementationI Low accidental complexity good for maintenanceI If high accidental complexity, you can never be sure it has been correctly implementedI The risk of errors rises with the complexity (i.e. it scales)

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 9 / 58

Page 10: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures - Practical Advise

General advice to deal with complexityDesign for failure

I E.g. hardware will fail, human errors (bugs)I Limit their consequences

Push complexity into a single placeI E�ect of bugs will be minimisedI Also called “complexity isolation”

Avoid tricky operationsAvoid to store aggregates

I Prefer to deal with raw dataI Treat the data as immutable

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 10 / 58

Page 11: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures - Theory

Locality of ReferenceGranularity

I CPU caches to storage position within the data centre

In a distributed context: bring the code to the dataBest practice: position the data in a way it is accessed

I Partition the data according to some criteriaI e.g. Rack awareness of the computing infrastructure

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 11 / 58

Page 12: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Architecture for Simple Applications

Share Nothing ArchitectureNo centralised data storage

Can scale almost infinitely

Used since the beginning of the 80ies, popularised by Google

Only a few systems allow for such an architecture

ImplicationsIf a system requires some sort of shared resources or orchestrated processing, the complexityrises.

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 12 / 58

Page 13: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures BasicsBuilding blocks of distributed systems

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 13 / 58

Page 14: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics

Number of issues to address1 Serialisation2 Group membership3 Leader election4 Distributed locks5 Barriers6 Shared resources7 Configuration

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 14 / 58

Page 15: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Serialisation

Group serialisationTransform an object into a byte array (and back)

I Needed to transfer objects between nodes in a distributed environmentI Used to store objects (e.g. in databases)

Should work across programming languages

Therefore serialisation frameworks provide a Schema Definition Language

Examples: Thri�, Protocol Bu�ers, Avro

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 15 / 58

Page 16: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Group Membership

Group membershipWhen a single node comes online…

How does it know where to connect to?

How do the other members know of an added node?

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 16 / 58

Page 17: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Group Membership

⇒ Peer-to-peer architectural styleEach node is client, as well as server

Parts of the bootstrapping mechanism

Dynamic vs. static

Fully dynamic via broadcast/multicast within local area networks (UDP)

Centralised P2P - e.g. central login components/servers

Static lists of group members (needs to be configurable)

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 17 / 58

Page 18: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Leader Election

Leader electionNot all nodes are equal, e.g. centralised components in P2P networks

Single node acts asmaster, others are workersSome nodes have additional responsibilities (supernodes)

Having centralised components makes some functionality easier to implement

E.g. assign work-load

Disadvantage: might lead to a single point of failure

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 18 / 58

Page 19: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Leader Election

⇒ Client-server architectural styleOnce the leader has been elected, it takes over the role of the server

All other group members then act as clients

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 19 / 58

Page 20: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Leader Election

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 20 / 58

Page 21: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Leader Election

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 21 / 58

Page 22: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Leader Election

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 22 / 58

Page 23: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Locks

Distributed locksRestrict access to shared resources to only a single node at a time

E.g. allow only a single node to write to a file

May yield many non-trivial problems, for example deadlocks or race conditions

Distributed locks without central component are very complex to realise

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 23 / 58

Page 24: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Locks

Distributed locks example⇒ Blackboard architectural style

The shared repository is responsible to orchestrate the access to a locks

Notifies waiting nodes once the lock has been li�ed

This functionality is o�en coupled with the elected leader

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 24 / 58

Page 25: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Barriers

BarriersSpecific type of distributed lock

Sychronise multiple nodes

E.g. multiple nodes should wait until a certain state has been reached

Used when a part of the processing can be done in parallel and some parts cannot bedistributed

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 25 / 58

Page 26: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Shared Resources

Shared ResourcesIf all nodes need to be able to access a common data-structure

Read-only vs. read-write

If read-write, the complexity rises due to synchronisation issues

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 26 / 58

Page 27: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Zookeeper

Example: Apache ZookeeperZookeeper is a framework/library

Used by Yahoo!, LinkedIn, Facebook

Initially developed by Yahoo!, Now managed by ApacheFeatures

I Coordination kernelI File-system like APII Synchronisation, Watches, LocksI ConfigurationI Shared data

Alternative approaches: Google Chubby, Microso� Centrifuge

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 27 / 58

Page 28: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - DFS

Distributed File SystemsVirtual file system distributed over multiple machines

I Based on a local file system

Same semantics like traditional file systemsI Folders & files

Files are internally split into smaller blocks (e.g. 64MB)

Blocks are redundantly stored on multiple machines

Logic to record, which block is stored on which machine

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 28 / 58

Page 29: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - DFS

Examples of Distributed File SystemsGoogle File System (GFS)

I Based on a local file system

Hadoop Distributed File System (HDFS)I Portable, open-source solutionI Consists of (a single) name node and (many) data nodes

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 29 / 58

Page 30: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Sharding

ShardingSplit the data horizontally

Each node in a network may manage a separate chunk of the data

For example in web search engines

Each node is responsible for a number of web-pages

Returns search results from the local collection

All results from all shards are then combined into a single result

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 30 / 58

Page 31: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Sharding

Sharding example

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 31 / 58

Page 32: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - Sharding

Sharding - PropertiesNeed redundancy, in case a node goes down

Level of redundancy depends on the data

E.g. if a node with low-tra�ic web-pages goes down, it might not even have an impact onthe quality of the search results (at least on the first page)

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 32 / 58

Page 33: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Distributed Architectures Basics - �ality A�ribute

Anarchic ScalabilityDesign for a large distributed system

Parts of the system are developed independently from each other

Therefore the system need to be designed for malfunctioning or even maliciouslycomponentsThe web is an example where anarchic scalability is one of the most important aspects

I Thus clients are not expected to know all serversI … and servers are not supposed to know all clientsI Another consequence is the link integrity → only one-directional

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 33 / 58

Page 34: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous ArchitecturesThe default choice for scalable solutions

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 34 / 58

Page 35: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Motivation

Synchronous architecturesEach call terminates a�er the request has been completely processed

E.g. traditional data-centric architectures (databases)

Pro: Easy to use and predictive behaviour

Con: Does not deal well with load (need to plan with the worst case)

Asynchronous architecturesThe call returns before the request has been processed

The processing happens in the background

Con: Non predictive behaviour

Pro: Load can be distributed over time, thus be�er scalability

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 35 / 58

Page 36: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Worker

Asynchronous worker architecture

The client issues a call without waiting for the result

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 36 / 58

Page 37: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Worker

Asynchronous worker architectureThe client does not wait for the end of the processing

I.e. does not track the result

If the worker fails, all currently processed request will fail

�eue MotivationIntroduce a new component to track the processing of the requests, e.g. a queueing system

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 37 / 58

Page 38: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - �eue & Worker

Asynchronous queue & worker architecture

The client puts the request into the queue, the workers poll the queue for requests

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 38 / 58

Page 39: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - �eue & Worker

�eue and worker architectureThe client is decoupled from the worked via a queue component

The queue component is (usually) responsible to track the status of the requests

The size of the queue depends on the current load

If a worker fails, the request will be put back into the queue

Default architecture choice for many distributed systems

Note: O�en, the queue itself might be distributed and might store the requests (e.g. in adatabase)

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 39 / 58

Page 40: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - �eue & Worker

�eue and worker architecture aspectsThe number of workers/applications may vary

I Single consumer vs. multi consumer queuesI Multiple independent workers that execute the same code vs.I … each worker has a di�erent task

Typically queues are FIFO (first in, first out)I Some queue support items with higher priority

In some configurations the application is responsible to track the statusA typical application may consist of multiple layers of queues and workers

I I.e. the output of a worker is fed as input to another queue

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 40 / 58

Page 41: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - �eue & Worker

�eue & worker propertiesThe architecture is straightforward, but not simple

Race conditions may occurPartitioning might be hard to implement

I Bad for fault tolerance

Tedious to buildI Much code necessary for serialisation/routing logic/monitoring of queues/etc.

Complex deployment for all workers and queues

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 41 / 58

Page 42: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream processing

Stream processingThe queue & worker architecture can be used for stream processing

I Not suitable for real-time applications, due to delay from the queue

Continuous stream of incoming requestsI I.e. the queue is perpetually filledI O�en these reflect events in such se�ings

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 42 / 58

Page 43: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream Processing

Two basic stream processing typesOne-at-a-time

I Each event is processed individually

Micro-batchI Multiple events are combined into a single batch

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 43 / 58

Page 44: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream Processing

Basic execution semanticsAt-least-once

I Each event is guaranteed to be executedI But might be processed more o�en (thus the same result might be reported multiple times)I Might introduce inaccuracies

At-most-onceI Each event is optionally executed, but no more than once

Exactly-onceI Each event is guaranteed to be executed only once

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 44 / 58

Page 45: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream Processing

Trade-o� between the stream processing types

One-at-a-time Micro-Batch

Lower Latency XHigher Throughput XAt-least-once semantic X XExactly-once semantic (sometimes) XSimpler programming model X

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 45 / 58

Page 46: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream Processing

ImplicationsExactly-once can be achieved via strictly ordered processing

I Fully process an event before continuing

More e�icient for micro-batchesI Multiple batches in parallelI Need to store the batch-id of the last successfully processed batch

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 46 / 58

Page 47: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Asynchronous Architectures - Stream Processing

Strength Examples

Batch High throughput Hadoop, SparkOne-at-a-time Low latency StormMicro-batch Tradeo� Trident, Spark

Table: Comparison of the main distributed architectures

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 47 / 58

Page 48: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda ArchitectureBig Data Architecture

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 48 / 58

Page 49: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda Architecture - Motivation

Target scenarioLarge amount of data

I Too big for a single machine→ distributed system

Data is continuously updatedI Mostly just additions, i.e. new data

Majority of operations are read-onlyI E�ectively, queries on the data

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 49 / 58

Page 50: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda Architecture - Motivation

Typical solutionThe typical solution would be a data-centric architecture

Data is stored in a distributed traditional RDBMS or noSQL databases

Updates are wri�en into the database (via transactions)

The queries are computed in a distributed manner, over all the data

As some queries are too slow, they need to be precomputed

… and the precomputed results need to be updated (as soon a new data comes in)

Therefore this can also be called a incremental architecture

The incremental architecture is too complex, mainly because of i) the (distributed) transactionsupport and ii) the complex algorithms to merge the new data.

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 50 / 58

Page 51: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda Architecture - Motivation

Target propertiesRobustness & fault tolerance

Scalability

Generalisation

Extensibility

Ad hoc queries

Minimal maintainance

Debuggability

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 51 / 58

Page 52: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda Architecture - Overview

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 52 / 58

Page 53: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Lambda Architecture - Overview

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 53 / 58

Page 54: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Kappa ArchitectureSimpler Alternative to Lambda Architecture

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 54 / 58

Page 55: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Kappa Architecture - Introduction

8 Rules of Stream Processing (Stonebraker et at., 2005)1 Keep the Data Moving2 �ery using SQL on Streams (StreamSQL)3 Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data)4 Generate Predictable Outcomes5 Integrate Stored and Streaming Data6 Guarantee Data Safety and Availability7 Partition and Scale Applications Automatically8 Process and Respond Instantaneously

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 55 / 58

Page 56: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Kappa Architecture - Motivation

Basic idea behind Kappa ArchitectureThe Lambda Architecture is relatively complex

Most of the problems can be solved via a streaming approach→ treat the batch processing as if it were a stream of data

I Have the framework optimise for the two cases (stream vs. batch)

http://www.kappa-architecture.com/

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 56 / 58

Page 57: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

Kappa Architecture - Example

Example of streaming framework: Apache KafkaMessage queueing system

Stores all data for a given timespan, e.g. 2 daysTopics

I Partitions within topicsI Partitions are chosen by the application, i.e. both producers and consumers need to agree on

the partitions

Brocker

Each message within a partition has an o�set

Clients have a unique id and remember their o�set

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 57 / 58

Page 58: Distributed Architectureskti.tugraz.at/staff/rkern/courses/sa/slides_da.pdf · Distributed Architectures Di•erent levels of complexity Lowest complexity for operations, which can

The End

Roman Kern (ISDS, TU Graz) Distributed Architectures 2020-01-15 58 / 58