21
Apache Apex Meetup Introduction to Apache Apex Real time streaming.. Really!!! Chinmay Kolhatkar [email protected] February 13, 2016

Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Introduction to Apache ApexReal time streaming.. Really!!!

Chinmay [email protected]

February 13, 2016

Page 2: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Agenda➔ Project History

➔ What is Apache Apex?

➔ Directed Acyclic Graph (DAG)

➔ Components of DAG

➔ Windowing

➔ Operator Lifecycle

➔ Apache Apex Architecture

➔ Other features

Page 3: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Project History➔ Started development at DataTorrent in 2012

➔ Open-sourced under ASF in 2015

➔ Currently Have 50+ committers

➔ Free to use Streaming Application platform

Page 4: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

What is Apache Apex?➔ Apex project is under Apache Software Foundation

➔ Apex is a Streaming Application platform

➔ YARN-native application

➔ Complete implementation is done in Java

➔ Consist of 2 primary components

◆ Apex Core - Engine which facilitates Real time processing

◆ Apex Malhar - Out-of-the-box operators that can be used with Apex

Core

Page 5: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Defines compute stages

➔ Defined how tuple flow over compute stages over stream

Directed Acyclic Graph (DAG)

Filtered

Stream

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

Page 6: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Smallest atomic data that flows over a

stream

➔ Emitted by Operators after processing

➔ Received by next Operator for

processing

➔ Java objects which are serializable

➔ Types:

◆ Data Tuple

◆ Control Tuple

Components of DAG - Tuple

Page 7: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Logical compute unit

➔ Java code which processes a tuple

➔ Runs inside a JVM

➔ Types

◆ Input Adapter

◆ Generic Operator

◆ Output Adapter

Components of DAG - Operator

Page 8: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Connect operators

➔ Channel that carries the tuples from

one operator to another

Components of DAG - Stream

Page 9: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Ends of a stream

➔ Part of operator

➔ Types of ports

◆ Input Port

◆ Output Port

Components of DAG - Ports

Page 10: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Windowing

➔ Tuples divided into time slices

➔ Windows are given ids (type:long)

➔ Also called as Streaming Window

● Default 500ms

Page 11: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Input Operator inserts control tuple

➔ Control tuple marks window boundary

➔ Different operator may be processing

different windows

➔ All management activities of data

happens at the boundary of window

Windowing (contd…)BeginWindowControl Tuple

EndWindowControl Tuple

Data Tuples

Window nWindow n+1 OutputAdapter

InputAdapter

GenericOperator

Page 12: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Called by Apex Platform

➔ Simple unit test like lifecycle

➔ Governed by control tuples

➔ All operators in DAG go through

this life-cycle

Operator Lifecycle

Page 13: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Setup

◆ Start of operator lifecycle

◆ Do any initialization here

➔ beginWindow

◆ Marks starting of window

➔ endWindow

◆ Marks end of window

➔ teardown

◆ Do any finalization here

◆ End of operator lifecycle

Operator Lifecycle (contd...)

Page 14: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ emitTuples

◆ Called for Input Adapters

◆ Called in an infinite while

loop by platform

➔ process

◆ Called for Generic Operators

and Output Adapters

◆ Associated to to a port

◆ Called for every incoming

tuple

Operator Lifecycle (contd...)

Page 15: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ OutputPort::emit

◆ Special method not part of

operator lifecycle

◆ To be called by operator

code

◆ Emits the tuples to next

operator

◆ Bound by Window

Operator Lifecycle (contd...)

Page 16: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Apache Apex Architecture

Machine nodes (Physical or Virtual)

Hadoop (YARN) Distributed File System (e.g. HDFS)

Apache Apex Core (Streaming Engine)

Streaming Application Streaming ApplicationR

EST API

External Data

SourcesApache Apex Malhar

(Reusable Operators, Connectors)Custom Operators

Page 17: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Ease of Use

➔ Locality

➔ Fault Tolerance

➔ Scalability

◆ Partitioning

◆ Auto-scaling

Other features of platform

Page 19: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

Apex in Distributed Environment

Hadoop Edge Node

dtManage (Web UI)

Hadoop Node

YARN Container

App Master

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Worker Container

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Worker Container

CLI

dtGateway (REST API)

Part of DataTorrent RTS

dtGateway (REST API)

dtManage (Web UI)

Web Browser

Page 20: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ AT_LEAST_ONCE (default)

◆ Windows are processed at least once

➔ AT_MOST_ONCE

◆ Windows are processed at most once

➔ EXACTLY_ONCE

◆ Windows are processed exactly once

Processing Modes

Page 21: Introduction to Apache Apex - files.meetup.com · Apache Apex Meetup Windowing Tuples divided into time slices Windows are given ids (type:long) Also called as Streaming Window Default

Apache Apex Meetup

➔ Saves operator state on HDFS

➔ Each operator undergoes checkpointing

➔ Done by platform

➔ Happens every 60 streaming windows by default i.e. 30 sec.

➔ Checkpoint is named by the windowId at which it happens

➔ If all operators gets checkpointed at same window, that checkpointed state

becomes “committed” state of application

➔ Committed state is used for recovery in case of failure

Checkpointing