Loden_302 BO UC-bods.pdf

Embed Size (px)

Citation preview

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    1/44

    ]Session Code: 302

    Using BusinessObjects Data Services to combat ETL

    problems arising from bad source Data

    Don Loden

    SARAH MERTZ[ASUG INSTALLATION MEMBER

    MEMBER SINCE: 1998

    DAVID SWIERENGA[ASUG INSTALLATION MEMBER

    MEMBER SINCE: 2005

    GREG REISCHLEIN[ASUG INSTALLATION MEMBER

    MEMBER SINCE: 2007

    Business Intelligence ConsultantDecision First Technologies

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    2/44

    [

    Discovering and demonstrating best practices in data flow andETL design

    The water bucket brigade

    Comment to enhance transparency and documentation

    Approach job development with an object-oriented focus

    Learning Points

    Real Experience. Real Advantage. 2

    ange a a ap ure Got good delta process? Recovery, performance, and overlap.

    Using source- and target-based CDC

    Wrapping complex database functions seamlessly

    A multi-user approach to keep the project rolling while waitingfor complex code

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    3/44

    [

    Discovering and Demonstrating Best Practices in Data Flow and ETL Design

    Examining Change Data Capture (CDC)

    Keeping the Project Rolling: A Multi-User Approach Wrap-Up

    What Well Cover

    Real Experience. Real Advantage. 3

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    4/44

    [

    Discovering best practices in Data Flow and ETL design

    The water bucket brigade

    Comment to enhance transparency and documentation Approach job development with an object-oriented focus

    In This Section

    Real Experience. Real Advantage. 4

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    5/44

    [

    SAP BusinessObjects Data Integrator is an object-orientedtool: Exploit the nature of the tool

    Try to approach job design with the idea of reusability When you have two tasks that are primarily the same, try to

    control the flow with variables and conditionals

    Dont be afraid of sub- obs

    Best Practices in ETL: Think Reusability

    Real Experience. Real Advantage. 5

    Sometimes sub-jobs are the only way to accomplish a task such asswitching system configurations

    If you perform an operation more than two or three times,consider writing a function

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    6/44

    [

    Benefits of reusability in a team environment: Using functions and variable-driven conditional transforms

    Demo

    Real Experience. Real Advantage. 6

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    7/44

    [ Reusability Through Conditionals

    Use conditionals to reuse

    logic. This makes code

    flexible, but takes

    Real Experience. Real Advantage. 7

    -

    oriented nature of Data

    Services.

    All objects have comments, but note the workflow. This is a multi-use

    object so an annotation was used. This way the description could be

    specific to the current task.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    8/44

    [ Reusability: Whats Inside the Conditional?

    If the variable test from the

    last workflow does not

    contain a value, then print to

    the log.

    Real Experience. Real Advantage. 8

    If the variable test from the

    last workflow does contain a

    value, then run the data flowto update the INVDETL table.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    9/44

    [

    Group your workflows into subject areas

    Approach job design with the idea of organization

    Place logical components together and order them in the way theywill need to execute according to database constraints

    Utilize a group-based numbering system

    Number our sub ect areas in the order of execution

    Best Practices in ETL: Think Organization

    Real Experience. Real Advantage. 9

    Group 1 Lookup tables Group 2 Address

    Group 3 Product

    Group 4 Customer

    Group 5 Sales

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    10/44

    [ Organizational Work flow Example for a Job

    Always use a fully commented

    header for the job.

    Real Experience. Real Advantage. 10

    Notice the comments in the object

    descriptions. These expand on the

    numeric groupings to share the

    subject area name.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    11/44

    [

    Avoid SQL Transforms

    Dont lose your metadata by using an SQL Transform when a

    query transform will suffice The Metadata Reports are powerful tools; do not short-circuit

    them by removing metadata

    Best Practices in ETL: Maintain Metadata

    Real Experience. Real Advantage. 11

    Avoid using database views for data stores

    You lose all aspects of the table and are constrained only to thecolumns in the view for metadata reporting

    Often these situations present performance problems

    Preserve metadata by using built-in transforms when at all possible.

    Avoid SQL Transforms!

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    12/44

    [

    Divide your data flows into meaningful work sections:

    Driver Stage

    Limit your data set with your first query transform Push down all possible operations to the database

    The database will be much more efficient withmost operations

    The Water Bucket Brigade

    Real Experience. Real Advantage. 12

    Data Services is designed with the intent ofstratifying the work load between the databaseserver and the job engine

    Parsing Stage Parsing will mostly occur in the job server

    Make this happen after driving down the amount of data Lookup Stage Lookups are expensive and powerful

    Perform lookups after data is limited and parsed

    The key is to always limit the result set before evoking expensive

    operations in the job engine

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    13/44

    [

    Divide your data flows into meaningful work sections: (cont.)

    Business Rules Stage Validate your data using transforms such

    as validation, case logic, and error trapping Always strive to provide metrics around your data

    Validation transforms provide great ways to gather statistics

    Validation provides statistics of errors that are automatically collected

    The Water Bucket Brigade (cont.)

    Real Experience. Real Advantage. 13

    n e e a a a epor s

    To access the Metadata Reports:

    Data Services Management Console MetadataReports

    To ensure that Validation efforts are not lost in the Metadata

    Reports, always give your validations meaningful names while

    setting up the transforms.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    14/44

    [ Driver Stage Example

    Real Experience. Real Advantage. 14

    Use the first query transform as a driver stage to limit the data set.

    Push this query down to the database whenever possible.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    15/44

    [ Parsing Stage Example

    Real Experience. Real Advantage. 15

    After limiting the data set, perform the parsing that will most likely occur

    in the job server engine.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    16/44

    [ Lookup Stage Example

    Real Experience. Real Advantage. 16

    Lookups, by nature, almost always happen in the engine and are very

    expensive iterative operations.

    Lookups need to be downstream from the driver logic.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    17/44

    [ Business Rule Enforcement Stage Example

    Real Experience. Real Advantage. 17

    This example shows case logic as well as validation.

    Business rules are enforced last. These may take multiple forms.

    Though both could have been hidden in query transforms, strive to show

    logic graphically using the built-in transforms.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    18/44

    [

    Sprinkle comments throughout your job to enrichunderstanding as well as provide documentation

    Use comments: At the job level

    Describe variables and the purpose of the job

    Use Comments Generously

    Real Experience. Real Advantage. 18

    At the workflow level

    Describe grouping, parallelism, and logical flow of the job

    At the data flow level

    Describe the transformation logic and target table options

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    19/44

    [

    Always use object-level descriptions on single and reusableobjects

    Single-use objects are the most important and should alwayspossess an object description

    Be careful with object descriptions on reusable objects

    Make the descri tion eneric to the basic function of the ob ect

    Use Annotations and Object Descriptions

    Real Experience. Real Advantage. 19

    Generally as a rule, make object descriptions match anannotation used as a header

    Use annotations when you want to call attention to something

    Annotations are also great to use for target table options

    Use an annotation for the specifics on a reusable object

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    20/44

    [Job-Level Comments

    Job-level comments should show variable values,

    schedule, execution time, and dependencies.

    Real Experience. Real Advantage. 20

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    21/44

    [Work Flow-Level Comments

    Real Experience. Real Advantage. 21

    Use an annotation instead of an object description when

    either the object can change or the comment only

    applies to this particular situation.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    22/44

    [ Data Flow-Level Comments

    Real Experience. Real Advantage. 22

    Always explicitly state unique source or target table options inannotations. This adds meaningful documentation but also avoids

    unnecessary clicking through the job.

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    23/44

    [

    Always approach job development with an object-orientedfocus

    Think of logic in self-contained pieces If you have to perform an operation multiple times, then use the

    same code and drive the behavior with a variable

    Turn sin le-use ob ects into multi-use ob ects

    Object-Oriented Job Development

    Real Experience. Real Advantage. 23

    Envelop single-use objects like conditionals in workflows

    Dont wait on complex logic that holds up development

    Use placeholders in your code while waiting for the logic

    Dont hold up an entire flow for one or two columns

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    24/44

    [ Demo: Using Placeholder Logic

    Real Experience. Real Advantage. 24

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    25/44

    [ ETL Best Practices Summary

    Think Reusability Object-Oriented development

    Think Organization Organize code into groups

    Maintain Metadata Avoid SQL transforms

    Real Experience. Real Advantage. 25

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    26/44

    [What Well Cover

    Discovering and Demonstrating Best Practices in Data Flowand ETL Design

    Examining Change Data Capture (CDC)

    Keeping the Project Rolling: A Multi-User Approach

    Wra -U

    Real Experience. Real Advantage. 26

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    27/44

    [

    Change Data Capture (CDC)

    Got good delta process? Recovery, performance, and overlap.

    Using source- and target-based CDC Wrapping complex database functions seamlessly

    In This Section

    Real Experience. Real Advantage. 27

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    28/44

    [

    Why bother with change data capture?

    Delta loads only load-changed records where initial loads load all

    covered data Performance is always better when loading only changed records

    Recover a job when there is not time to run the full load

    Change Data Capture (CDC)

    Real Experience. Real Advantage. 28

    - -

    Strive for date- or process-ID-based CDC

    Use target when no other option presents itself

    Use table compares or Auto Correct on lesser data volumes

    Wrap database functionality such as hash or checksums with largevolumes

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    29/44

    [

    Recovery

    What is a job execution function, and why do I want one?

    The job execution function determines your delta strategy and isgenerally based upon source-based date stamp or process-IDcomparisons

    Performance:

    Got Good Delta Process

    Real Experience. Real Advantage. 29

    Use the job execution function to drive source-based CDC This limits the data set up front to maximize performance

    Overlap

    What is your overlap strategy?

    Can you avoid overlap, must you reconcile, or can you pre-sample?

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    30/44

    [Job Execution Function

    The Job Execution Function and Table extend the recoverycapabilities of Data Services:

    Note the Job_Execution table below. The job execution functionis used to create the records in this table.

    Real Experience. Real Advantage. 30

    CDC is handled with the extractlowand extracthigh dates

    Data Services Recovery is extended with the status value:

    Done creates a new record

    Started reuses the dates and process ID to recover from a failure

    Min and max dates are derived and attached to a process ID

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    31/44

    [ Demo: Job Execution Function and CDC

    Here well step through source- and target-based CDC andhow the job execution table and job execution function play

    into this scenario

    Real Experience. Real Advantage. 31

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    32/44

    [What Type of Overlap Strategy Suits You?

    Overlap strategies: avoidance, reconciliation, and pre-sampling

    Which strategy suits your needs?

    Overlap avoidance: If the job is small enough, you may avoidperforming overlap at all

    If your load window is one day and the initial runs in a day

    Delta loads would begin after this point and there is no need to overlap

    Real Experience. Real Advantage. 32

    Requires a process to apply changes that occurred during your initialload

    Load the initial with a termination value, such as amax date

    Run the reconciliation job to cover the range fromthe max date to the current date

    Delta load starts from the current date

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    33/44

    [ Overlap Strategy (cont.)

    Overlap strategies: avoidance, reconciliation, and pre-sampling

    Which strategy suits your needs? (cont.)

    Pre-sampling: Preferable if possible First sample the max timestamp and record this value, then load data

    until this value. Your first delta load then becomes the min value.

    Real Experience. Real Advantage. 33

    ,

    and the job

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    34/44

    [ Source- or Target-Based CDC?

    Source-based is optimal for performance and clean execution

    Strive for date- or process-ID-based CDC

    Date is best for jobs from operational source systems

    Process ID is useful for date warehousing applications that need tobe removed from artificial date constructs

    Real Experience. Real Advantage. 34

    , -

    Use table compares or Auto Correct on smaller data sets

    If data sets are larger, then look to integrate databasefunctionality such as a hash or check-sum value

    These are used only when no other option is available

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    35/44

    [

    Bring complex functionality into a Data Services job using DataServices functions as wrappers

    Take advantage of the strengths of your team by dividingdevelopment

    Have one developer create the database function, then import thatfunction into Data Services for the rest of the team

    Wrapping Complex Data Functions

    Real Experience. Real Advantage. 35

    Now the function is fully reusable by any member of the teamwithout special knowledge of the native database, coding, or

    store procedure/function creation

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    36/44

    [ Demo: Wrap Database Functions

    To demonstrate multi-use and ease of sharing, Ill wrapdatabase check-sum functions for complex CDC into a job

    Real Experience. Real Advantage. 36

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    37/44

    [What Well Cover

    Discovering and Demonstrating Best Practices in Data Flowand ETL Design

    Examining Change Data Capture (CDC) Keeping the Project Rolling: A Multi-User Approach

    Wrap-Up

    Real Experience. Real Advantage. 37

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    38/44

    [ Multi-User Approach

    Try a multi-user approach to keep the project rolling whilewaiting for complex code

    Dont make your developers wait on complex logic Split tasks

    Example If complex database logic is needed, have the functionsdevelo ed b one team member then wra ed into functions so that all

    Real Experience. Real Advantage. 38

    can use in Data Services

    Use empty tables as placeholders if code is not finished to populatethem

    If you have the physical schema for the table, then that is enough toimport into Data Services to construct your code

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    39/44

    [ Multi-User Approach (cont.)

    Try a multi-user approach to keep the project rolling whilewaiting for complex code (cont.)

    Divide up the project by the group-based subject areas todistribute the workload among the team

    Group 4 (Customer) development should be going on in tandemwith Grou 5 Sales

    Real Experience. Real Advantage. 39

    Both groups will need to be run in order; however for development,there is nothing stopping tandem development

    Have the ETL Architect integrate the project and maintain

    the holistic design for tandem development

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    40/44

    [What Well Cover

    Discovering and Demonstrating Best Practices in Data Flowand ETL Design

    Examining Change Data Capture (CDC)

    Keeping the Project Rolling: A Multi-User Approach

    Wrap-Up

    Real Experience. Real Advantage. 40

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    41/44

    [ Key Learnings

    Document your flows with comments to maintain transparencyand improve documentation

    Remember Auto Documentation will capture these comments toprovide useful additions to your documentation reports

    Simplify complex flows for other developers on the team

    Real Experience. Real Advantage. 41

    n uc e r ga e es gn ows n o sma c un s o

    logic Data Services performs best with well-divided flows

    Divide your flows for more chances of reusability

    Try to maintain metadata by using built-in transformswhenever possible

    Avoid SQL Transforms

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    42/44

    [ Key Learnings (cont.)

    Design jobs from an object-oriented point of view

    Use source-based CDC where possible

    Performance is best with source-based CDC Wrap database functionality into functions when there is no

    other CDC option or function equivalent

    Real Experience. Real Advantage. 42

    Functions provide excellent wrappers to reuse complex databaselogic

    Reuse functions without team members having to re-writecomplex stored procedures and database functions

    While waiting in a multi-user environment, use placeholders inyour data flows

    Use the object-oriented nature of Data Services to keep theproject moving

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    43/44

    [ Resources

    SAP Community Network DI Performance Blog:

    https://wiki.sdn.sap.com/wiki/display/BOBJ/Performance

    SAP Community Network DI Tips and Tricks: https://wiki.sdn.sap.com/wiki/display/BOBJ/BusinessObjects%20D

    ata%20Services%20Tips%20and%20Tricks

    Real Experience. Real Advantage. 43

    BusinessObjects Board forum (not affiliated with SAP)

    www.forumtopics.com/busobj/index.php

    Useful Data Integrator forum with many categories such asadministration, design, performance, and real time

    [

  • 7/27/2019 Loden_302 BO UC-bods.pdf

    44/44

    [

    ]

    Thank you for participating.

    SESSION CODE: 302

    Please remember to complete and return your

    evaluation form following this session.

    For ongoing education on this area of focus, visit the Year-Round

    Community page at www.asug.com/yrc

    Real Experience. Real Advantage. 44

    Services to combat ETL problems

    arising from bad source data

    How to contact me:Don [email protected]