Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ChristinaKHNAISSER etLucLAVOIEDépartementd’informatiqueFacultédessciences
[email protected]@usherbrooke.cahttp://info.usherbrooke.ca/llavoie
Architectureetdimensionnalité
2020-11-13
BASES DE DONNÉESENTREPÔTS DE DONNÉES
BD301v110b
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
2
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
PLAN
¢Architecture¢Dimensionnalité¢Techniquesdimensionnelles¢Conseilsd’Adamson¢Questionsouvertes¢Références
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
3
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
ARCHITECTUREPROPOSITIONS
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
4
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
ARCHITECTUREMODÈLE DATAMART « AD HOC »
!!!DMVentes
DMFabrication
DMExpédition
!!!
!!!
Source 1
Source 2
…
Source n
ETL
ETL
ETL
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
5
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
ARCHITECTUREMODÈLE « INFORMATION FACTORY »(INMON)
!!!
Bases de données spécialisées(e.a.: DM)
Source 1
Source 2
…
Source n
ETL
ETL
ETL
Entrepôt(de(données(normalisé
Applications(e.a.: SAD)
Exploitation et forage de données
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
6
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
ARCHITECTUREMODÈLE « DIMENSIONAL DW »(KIMBALL)
!!!
Source'1
Source'2
…'
Source'n
ETL
ETL
ETL
Entrepôt'de'données
VDMVDMDM DM
VDM
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
7
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
ARCHITECTUREMODÈLE « DIMENSIONNEL »CONTEMPORAIN
Source 1
Source 2
Source …
Source n
ELT 1
ELT 2
ELT n
ELT …
Zone de chargement
Zone de traitement
Zone de stockage
Zone des vues
DMk
DM 1ED
EA
Applications analytiques
App 1
App 2
App …
App m
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
8
ARCHITECTURESYNTHÈSEArchitecture Description CaractéristiquesAdhoc*DataMart*Silo*Stovepipe*Island
Schémaspécifiquesanscontexteorganisationnel
Souventunschémadimensionnel(étoilesoufloconsdeneige)
Factory*Atomic DW*EnterpriseDW
Unschémaorganisationneletplusieursschémasspécifiques.
Leschémaorganisationnelestnormalisé.LLesschémasspécifiquessontdimensionnels.
DDW*Enterprise DW*BusDW*Architected DM*Virtual DM
Unschémaorganisationnelcomprenantdesvuesspécifiques.
Leschémaorganisationnelest« purement »dimensionnel.Lesvuesspécifiquessontsouventmatérialisées.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
9
ARCHITECTURE INTERNEQUESTIONS ET PROPOSITIONS¢ Quellestructurepourleschéma?¢ Quellemodélisationtemporellepourleschéma?¢ Commentconstruireleschéma?¢ Commentalimenter(ELTouETL)?¢ ...
¢ Quelquespropositions� Inmon� Kimball� Jiang� Snodgrass� Johnston&Weis� Date,Darwen&Lorentzos� ...
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
Chaque« Region »résidesurun« Cluster »deserveursquiluiestpropre.La« Staging Region »estreprésentéesousformedetablesmuniechacuned’unecléartificielle,sansaucunecontrainte.La« ProcessingRegion »estune« constellation ».La« StoringRegion »enestdéduiteenfonctiondesbesoinsd’analyse.
2020-11-13
10
ARCHITECTURE INTERNEAPERÇU GLOBAL COMMUN
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Jiang. 2015.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
11
VUE DÉTAILLÉE DE LA« STAGING REGION »
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Jiang. 2015.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢ Labase¢ Principepremier¢ Tablesdimensionnelles¢ Tablesfactuelles
¢ Évolutivité¢ Exemples
2020-11-13
15
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
DIMENSIONALITÉ
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
� 8]VeiZg�+� ��BdgZ�dc�9^bZch^dc�IVWaZh� � �''+
FWhj�???
Less familiar are implicit relationships, which occur when two attributes are located in the same table. Implicit relationships imply a natural affinity between attributes, rather than a relationship that can take many contexts. These relationships tend to be more consistent, and they are browsable.
;nfb_Y_j�H[bWj_edi^_fi�:[iYh_X[�9edj[njAs you have already learned, every fact table bears foreign key references to dimension tables. These references provide the dimensional context for the facts. These joins can also be thought of as providing information about relationships between the dimension tables. The business process measured by the fact table is the context for this relationship.
Once again, we will turn to the orders process for an example. Figure 6-1 reprises the orders star that was introduced in Chapter 1 and embellished in Chapter 3. The grain of the fact table is the individual order line. Dimension tables represent the day of an order,
<_]kh[�,#'� 6�[VXi�iVWaZ�Zmea^X^ian�gZaViZh�Y^bZch^dc�iVWaZh
full_dateday_key
DAY
day_of_week_nameday_of_week_abbrday_of_monthholiday_flagweekday_flagweekend_flagmonth_numbermonth_namemonth_abbr
quarter_month
year_monthyear_quarterfiscal_periodfiscal_yearfiscal_year_period
year
quarter
day_of_week_number
CUSTOMER
customer_keycustomerheadquarters_statebilling addressbilling_citybilling_statebilling_zipsic_codeindustry_name
SALESPERSON
salesperson_keysalespersonsalesperson_idterritoryterritory_codeterritory_managerregionregion_coderegion_vp
ORDER_INFO
order_info_keyorder_type_codeorder_type_descriptioncredit_flagreorder_flagsolicited_flaginitial_order_flag
ORDER_FACTS
product_keysalesperson_keyday_keycustomer_keyorder_info_keyquantity_orderedorder_dollarscost_dollarsmargin_dollarsorder_idorder_line_id
PRODUCT
product_keyproduct
brandbrand_codebrand_manager
category_codecategory
skuunit_of_measure
product_description
2020-11-13
16
DIMENSIONNALITÉEXEMPLE D’ADAMSON
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
17
DIMENSIONNALITÉLA BASE (COMMUNE À L’ÉTOILE ET AU FLOCON)¢Unévènementestunique,seproduitàunmomentprécis,metencausedesentités(appeléesdimensions)etestcaractérisépardesmesures(appelésfaits).
¢Lesfaitssontminimalementagrégeables(souvent« additifs »).
¢Lesdimensionssontdéterminantes(ellesdéterminentdesrestrictions).
¢Principedelacléprimaireartificielle.¢Principedel’universalitédesdimensions(UD)etdesfaits(UF)enrapportauxévènementsdécritsparunemême(valeurde)relation« étoilée ».
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
18
DIMENSIONNALITÉUN CONSTAT
¢1.Corolaire:lesfaitsnesontpasannulables.¢2.Rappel:lesclésnesontpasannulables.¢3.Conclusion(1+2):
� aucun attributd’unetabledefaits(TF)n’estannulable.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢ Relationévènementielle
¢ Tablefactuelle
¢ Tabledefaits
¢ Relationdimensionnelle
¢ Tabledimensionnelle
¢ Dimension
Évènement Dimension
2020-11-13
19
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
DIMENSIONNALITÉQUELQUES SYNONYMES
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
20
DIMENSIONNALITÉPRINCIPE PREMIER¢ Principesimpliste :« Unetablefactuelleparprocessus ».¢ Dangersduprincipesimpliste
� incompatibilitédespériodes;� indépendancedesréférentiels.
¢ Principeefficace :« Deuxfaitssontdanslamêmetable,si,etseulements’ilsontmêmegranularitéetmêmesynchronicité. »
¢ Remarque� Granularitéetsynchronicité sontindépendants.
¢ Ilendécouletoutdemêmeunlienétroitentretablesfactuelles(TF)etprocessus:� uneTFesttoutentièredéfinieparunseulprocessus;� unprocessuspeutdéterminerplusieursTF.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢ Générationobligée denouvelles clésartificielles� pourquoi?
¢ Regroupementdesdimensions� paraffinité� leslaissés-pour-compte(junk tables)� pourquoiminimiserlenombredetables?� tablesassociatives,auxiliairesetautres?
¢ Richessereprésentative� codeetlibellé� permutations� variantes
¢ Joies(etpeines)delaredondance� pourquoilimiterlesrallonges(outriggers)?� pourquoipasdesvues?
2020-11-13
21
TABLESDIMENSIONNELLES
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
Unetablefactuellespécialisée
Lescatégorisationsetlesreprésentationssontpartagéesautantquepossibleentrelesdifférentetablestemporelles,maislesclés(granularitédebase)sontindépendantes(postulatinitial).
¢Rôle:� fixerlagranularitédelatabledefaits
� établirlesdifférentescatégorisationsadmises
� réunirlesdifférentesreprésentationsutilisées
¢D’autressolutionsexistent,fondéessurdesmodèlesunificateurs:� BCDM� DDLM� AV� ...
2020-11-13
22
TABLE TEMPORELLEBD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—
LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢Clé:� base:{dim1,dim2,...,dimn,dimT}
� discriminationsupplémentaire?� générationd’unecléartificiellesynthétique?
¢GranularitéreprésentéepardimT¢Synchronisme(co-occurence temporelle)
¢Densité(sparsity)¢Dimensionsdégénérées¢Additivitéetagrégeabilité
2020-11-13
23
TABLES FACTUELLESBD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—
LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
Unemodificationdetype1estoccasionneuneperteirrémédiabled’information.Àmoyenetlongterme,elleestunesourced’erreursetunecausededistorsiondumodèle.Elledoitdoncêtreévitée...àtoutprix.
¢Évolutiondesdimensions:Permise� Type1(récriture) – fortementdéconseillé
¢ maintiendutuple,modificationdelavaleurdel’attribut.
� Type2(historique) – recommandée¢ insertiond’unnouveautupleaveclesnouvellesvaleursd’attribut.
¢Évolutiondesfaits:Interdite� ...saufsielleestmodéliséecommeunetablebitemporelle!
2020-11-13
24
ÉVOLUTIVITÉBD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—
LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
� 8]VeiZg�(� ��HiVgh�VcY�8jWZh� � �))
FWhj�?
9ecced�9ecX_dWj_ediIn operational systems, it is common practice to break data elements down to constituent parts whenever possible. From these components, it is possible to construct any combinations that may be needed. For example, customer name may be broken down and stored as a first name, middle initial, and last name. These attributes can be combined to produce a full name, if needed.
These components have analytic value and, of course, will be included in a dimensional design. Unlike the operational schema, however, the dimensional schema should also include dimensions that represent common combinations of these elements. For example, if a large number of reports group information by full name, or in a last-name-first format, then these common usages should also appear as discrete dimension columns. This principle can be seen at work in Figure 3-2.
The customer dimension in Figure 3-2 shows how the three components of a customer name, such as John P. Smith, may be used to construct five attributes in the customer dimension table. Because each of these attributes is made available in the dimensional
<_]kh[�)#(� 8dchigjXi^c\�V�g^X]�hZi�d[�Y^bZch^dc�Viig^WjiZh
cust_name_firstcust_name_middlecust_name_last
ORDER_HEADER
Source table Dimension table
attributeattribute
CUSTOMER
customer_keyattributeattribute
attributeattributeattribute
attributeattributeattributeattributeattributeattribute
attributeattributeattribute
attributeattribute. . .
region_codecountry_codeterritory_code
territory_namecountry_name
credit_order_flag
first_namemiddle_initial
customer_type_codecustomer_type_name
last_namefull_namefull_name_last_firsttype_code
attributeattribute
JohnP.Smith
002
01-701
N
JohnP.SmithJohn P. SmithSmith, John P.
002indirect
07-70107701United StatesEast
Not Credit Order
attributeattributeattribute
attributeattribute
. . .region_code
credit_order_flag
2020-11-13
25
EXEMPLE DE« RICHESSE »
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
S’agit-ildedimensionsdégénérées...oudediscriminantssupplémentaires?
ORDERvsORDER_INFO
**� � �E6GI�>� ���;jcYVbZciVah
Ibembo�9^Wd]_d]�:_c[di_ediThe data in dimension tables originates in operational systems. In a dimensional data warehouse or stand-alone data mart, it comes directly from the operational system. In a Corporate Information Factory, it is first moved to the enterprise data warehouse, and then to a dimensional data mart. Once information makes its way into a dimension table, it may change in the operational source. For example, a customer’s date of birth is updated to correct an error, or a customer’s address is updated when they move.
Because the downstream star schema uses a surrogate key as the primary key of each dimension table, it does not have to handle changes the same way the source does. The operational system may track the change history of each data element, or it may simply overwrite with the changed values. Regardless, the star schema can respond to each change in whatever manner makes most sense for measuring the overall business process.
In every dimensional design, it is crucial to identify how changes in source data will be represented in dimension tables. This phenomenon is referred to as slowly changing dimensions. This term gets its name from the relatively slow rate at which dimensions accumulate changes,
<_]kh[�)#+� 9Z\ZcZgViZ�Y^bZch^dch�YZÒcZ�i]Z�\gV^c�d[�i]^h�[VXi�iVWaZ
PRODUCT
product_keyproductproduct_descriptionskuunit_of_measurebrandbrand_codebrand_managercategorycategory_code
DAY
day_keyfull_dateday_of_week_numberday_of_week_nameday_of_week_abbrday_of_monthholiday_flagweekday_flagweekend_flagmonth_numbermonth_namemonth_abbrquarterquarter_monthyearyear_monthyear_quarterfiscal_periodfiscal_yearfiscal_year_period
ORDER_INFO
order_info_keyorder_type_codeorder_type_descriptioncredit_flagreorder_flagsolicited_flaginitial_order_flag
SALESPERSON
salesperson_keysalespersonsalesperson_idterritoryterritory_codeterritory_managerregionregion_coderegion_vpORDER_FACTS
product_keysalesperson_keyday_keycustomer_keyorder_info_key
quantity_orderedorder_dollars cost_dollars margin_dollars
order_idorder_line id
CUSTOMER
customer_keycustomercustomer_idheadquarters_statebilling addressbilling_citybilling_statebilling_zipsic_codeindustry_name
Degeneratedimensions
2020-11-13
26
EXEMPLE DEDIMENSIONSDÉGÉNÉRÉES BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—
LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec'+.� � �E6GI�>>>� ���9^bZch^dc�9Zh^\c�
7le_Z_d]�j^[�Idem\bWa[Snowflaking a dimension is similar to a process called normalization, which guides the design of operational systems. This technique was developed to ensure referential integrity of data in operational systems, which support a wide variety of simultaneous transactions that are highly granular. An analytic database does not share this usage pattern, and referential integrity can be enforced by the ETL process. Normalization is therefore not necessary. In fact, modeling the relationships between dimension attributes detracts from usability, complicates ETL, and may even disrupt performance. That said, there are some good reasons you may wish to model a snowflake, as you will see later in this chapter.
DehcWb_pWj_ed�?i�Ki[\kb�_d�Wd�Ef[hWj_edWb�I[jj_d]Entity-relationship (ER) modeling is often used to design databases that support operational systems, or OLTP (online transaction processing) systems. This form of modeling places a heavy emphasis on capturing the relationships between attributes in much the same way a snowflake does. Through the process of normalization, redundancy is systematically driven out of the data model. Repeating groups are moved to their own tables, and designers ensure
<_]kh[�-#+� 6�hcdlÓV`Z�hX]ZbV
order_line_number
extended_costmargin_dollarsquantity_ordered
order_dollars
day_keysalesrep_keycustomer_key
product_key
salesrep_key
hire_dateoffice_locationsalesperson_codesalesperson_nameterritory_key
order_number
CATEGORY
category_key
category_codecategory_name
FISCAL_YEAR
fiscal_year_keyfiscal_year
YEAR
year_keyyear
FISCAL_PERIOD
fiscal_period_keyfiscal_year_keyperiodfy_and_period
QUARTER
quarter_keyyear_keyquarteryear_and_quarter
MONTH
month_keyfiscal_period_keyquarter_keymonth_numbermonth_namemonth_and_year
DAY
day_keymonth_keydateday_of_week_numday_of_week_name
holiday_flagday_of_month
BRAND
brand_key
brand_namebrand_codebrand_manager
category_key PRODUCT
ORDER_FACTS
SALESREP territory_key
territory_codeterritory_nameregion_key
TERRITORY
region_key
region_vpregion_coderegion_name
REGION
customer_key
primary_industrycustomer_idcustomer_name
billing_address
billing_zipphone
billing_statebilling_city
CUSTOMER
product_key
product_namesku
brand_key
2020-11-13
27
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
EXEMPLE DE RALLONGES (OUTRIGGERS)
Lagénéralisationdesrallongesconduitàunevariantedel’étoile:leflocon.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
� 8]VeiZg�(� ��HiVgh�VcY�8jWZh� � �*'
FWhj�?
The margin rate is the percentage of the sale that represents profit, and it is closely monitored by management. In this table, each individual transaction has a margin rate of 3.02 percent. It is not possible to summarize the margin rate for these transactions by adding them together. This would produce a margin rate of over 27 percent, which is clearly incorrect.
Luckily, there is a solution. Ratios can be broken down into underlying components that are additive. In this case, the margin rate is the ratio of the margin dollars to order dollars. These components are fully additive. They can be stored in a fact table and safely aggregated to any level of detail within a query or report. An example report is shown in Figure 3-4. The nonadditive fact margin_rate is not stored in the fact table; it is computed as the ratio of the sums of margin dollars and order dollars. This computation may be done in a query or by additional processing logic in the reporting tool. Care must be taken with subtotals and grand totals in the report; the margin rate in these rows must be computed as the ratio of the subtotals for margin dollars and order dollars.
While nonadditive facts are not stored in fact tables, it is important not to lose track of them. For many processes, ratios are critical measurements without which a solution would leave much to be desired. Nonadditive facts should be documented as part of the schema design, as described in Chapter 18, “How to Design and Document a Dimensional Model.”
<_]kh[�)#*� CdcVYY^i^kZ�[VXih�VgZ�XdbejiZY�Vh�i]Z�gVi^d�d[�VYY^i^kZ�[VXih
Margin Report
Salesperson Customer
Jones
Baldwin
Sebenik
Sgamma
Balter Inc.Raytech
Venerable HoldingseMart LLCShatter & Lose
Subtotal :
Subtotal :
Subtotal :
Subtotal :
Comstock RealtyRizSpaceStarcomp
Implosion TownDemiSpace
MarginDollars
192.7439.05
121.50253.44
8.74
231.79
383.68
160.52
327.80
1,103.80
12.0658.1090.36
213.88113.92
MarginRate
3.02%3.02%
3.02%3.02%3.02%
3.02%
3.02%
3.02%
3.02%
3.02%
3.02%3.02%3.02%
3.02%3.02%
OrderDollars
Margin Rate is anonadditive fact.
Summary row iscomputed as a ratioof the subtotals,not by summingmargin rates forthe salesperson.
6,382.211,293.11
4,023.228,392.00
289.54
7,675.32
12,704.76
5,315,33
10,854.33
36,549.74
399.291,923.932,992.11
7,082.223,772.11
Grand Total:
Date: January 1, 2009Product: Gel Pen Red
2020-11-13
28
EXEMPLE NON ADDITIFBD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—
LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
SELECTproduct_key,SUM(quantity_shipped)
FROMsales_facts
GROUPBYproduct_key
HAVINGSUM(quantity_shipped)>0
Cecientraine unecomplexitéinsidieuseetcroissante(boiling thefrog).
Fusionnerlesfaitsdansunseulfaitgénéralenajoutantunedimensiondiscriminanten’estpasunesolution.
Lameilleuresolutiondemeuredeconstruiredeuxtablesdefaits.
� 8]VeiZg�)� ��6�;VXi�IVWaZ�[dg�:VX]�EgdXZhh�� � �,)
FWhj�??
To understand how this happens, it is useful to study an example. Presume that business requirements for the sales department have been identified and are expressed as follows:
Analyze r� Quantity Ordered by Date, Customer, and Product
Analyze s� Quantity Shipped by Date, Customer, and Product
These statements identify measurements (facts) in bold text, and context (dimensions) in bold italic text. Each expresses the lowest level of detail at which the measurement will be studied.
Although they share the same dimensionality, the two measurements do not occur at the same time. For example, on days when a customer orders a product, there may not be shipments. Quantity Ordered and Quantity Shipped do not describe the same process.
7�I_d]b[�<WYj�JWXb[�9Wki[i�:_\\_Ykbj_[iThe star schema at the top of Figure 4-1 attempts to address the requirements using a single fact table: sales_facts. It contains the facts quantity_ordered and quantity_shipped. Attributes of the associated dimension tables have been omitted from the illustration. The grain of this fact table requires that orders, shipments, or both be recorded by day, product, and customer. This might also be stated as “shipments and/or orders.” The presence of “and/or” in a statement of grain is usually a sign of problems to come, as you will see shortly.
<_]kh[�*#'� ;VXih�l^i]�Y^[[ZgZci�i^b^c\�^c�V�h^c\aZ�iVWaZ
DAY
SALES_FACTSCUSTOMER
PRODUCT
day_keycustomer_keyproduct_keyquantity_orderedquantity_shipped
SALES_FACTS
day_keycustomer_
keyproduct_
keyquantity_ordered
quantity_shipped
123
123
456
456
789
123
777
777
777
777
777
777
222
333
111
222
222
111 100
200
50
0
0
0
0
0
0
100
75
125
These zeros will cause trouble
2020-11-13
29
EXEMPLE NONSYNCHRONE
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢ Drillacross¢ Hiérarchisationet« conformeddimensions »¢ Drillthrough¢ Skipdrill
2020-11-13
31
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
TECHNIQUES DIMENSIONNELLES
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
32
« DRILL ACROSS »LE PROBLÈME :JOINTURE DES TABLES DE FAITS
� 8]VeiZg�)� ��6�;VXi�IVWaZ�[dg�:VX]�EgdXZhh�� � �-)
FWhj�??
J?F� Never attempt to join to two fact tables, either directly or through a common dimension. This can produce inaccurate results.
The result of this Cartesian effect is evident in the report at the bottom portion of Figure 4-10. The query attempts to select total orders and total shipments and group them by product. Since product 222 has one order and two shipments within the scope of aggregation, the order is double counted. The resulting report incorrectly shows 400 units ordered.
Also notice that product 333 does not show up in the report at all. Although it was ordered, there were no corresponding shipments. The RDBMS was therefore not able to join the order to corresponding shipments. The SQL-literate reader may suggest substituting outer joins to the dimension tables; however, this will not solve the problem as the query has been qualified within the day dimension.
:h_bb_d]�7YheiiThe proper way to compare two processes is called drilling across. This term has a tendency to cause confusion. Although the word “drill” is used, this process is unrelated to the drill-up, drill-down, or drill-through capabilities of many query and reporting tools. Instead, the term is meant to describe crossing multiple processes. While it is common to speak of a
<_]kh[�*#'&� ?d^c^c\�ild�[VXi�iVWaZh�aZVYh�id�igdjWaZ
123
123
123
777 222 200
50777 333
777 111 100
ORDER_FACTS
456
456
789
777 222 75
125777 222
777 111 100
SHIPMENT_FACTS
day_keycustomer_
keyproduct_
keyquantity_
key day_keycustomer_
keyproduct_
keyquantity_shipped
SELECT
FROM
WHERE
product,day,order_facts,shipment_facts
order_facts.product_key = product.product_key ANDorder_facts.day_key = day.day_key ANDshipment_facts.product_key = product.product_key ANDshipment_facts.day_key = day.day_key AND...additional qualifications on date...
product.productGROUP BY
product.product,SUM( order_facts.quantity ordered ),SUM( shipment_facts.quantity_shipped )
The orderfor product 222 isdouble counted product
Product 111
Product 222
100
400
100
200
----------- ---------------- -----------------
sum(quantity_ordered)
sum(quantity_shipped)
The orderfor product 333 does
not appear
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
Lesrésultatsobtenussontcorrects,maisincomplets
� 8]VeiZg�)� ��6�;VXi�IVWaZ�[dg�:VX]�EgdXZhh�� � �-+
FWhj�??
<_]kh[�*#''� 9g^aa^c\�VXgdhh�dgYZgh�VcY�h]^ebZcih
123
123
123
777
777
777
222 200
50333
111 100
ORDER_FACTS
day_keycustomer_
keyproduct_
keyquantity_ordered
456
456
789
777
777
777
222 75
125222
111 100
SHIPMENT_FACTS
day_keycustomer_
keyproduct_
keyquantity_shipped
Orders Query Shipments Query
product
Product 111
Product 222
Product 333
===========quantity ordered
100
200
50
================ product
Product 111
Product 222
===========quantity shipped
100
200
================
Merge on commondimensional attribute
(product),and compute ratio
product===========
Product 111
Product 222
Product 333
quantity ordered================
100
200
50
quantity shipped================
100
200
ratio======
100%
100%
0%
2020-11-13
33
« DRILL ACROSS »L’EXEMPLE
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
Étapes1. Traiterséparément2. Fusionner3. Compléter4. Épurer
Lafusionpeutprendredifférentesformes(jointureinterneouexterne)selonqu’ilyadescalculsounonsurlesattributsagrégés.
Neregrettez-vouspasl’absencedesopérateursdesemi-jointure(matching)etsemi-différence(notmatching)enSQL?
-,� � �E6GI�>>� ���Bjai^eaZ�HiVgh
J?F� Cross-process ratios are powerful measurements. Because they do not correspond to a single column, or even a single table, they are often lost in the metadata collected by design tools and reporting software. These ratios should be documented at design time, highlighting the interaction between business processes that can be supported by the star schemas.
In principle, there is no reason that each of the fact tables participating in a drill-across report must reside in the same database. The queries in Phase 1 can take place on different database instances, or even databases from different vendors. In Figure 4-13, each of the four stars resides in a different database, from a different vendor. The intermediate result sets are then combined to produce the final report.
For all of this to work, it is important that the common dimensions be the same in each database, both in terms of structure and content. In terms of structure, their presence in each star allows the common dimensions to be retrieved by each phase 1 query. In terms of content, the identical representation of dimension values enables merging of the intermediate results during Phase 2.
For example, in Figure 4-13, the period and region dimensions are present in all stars. This allows each Phase 1 query to aggregate information by period and region. The content of these dimensions is represented identically as well. In all databases, periods are specified as “Q1” through “Q4” and regions as “North,” “South,” and so forth. Because the content values are identical, they can be used to join the result sets together in Phase 2. If one or more of the databases had different region designations, or specified periods as “First Quarter” through “Fourth Quarter,” it would not be possible to complete the merge.
<_]kh[�*#'(� 9g^aa^c\�VXgdhh
Fact A Fact B
Dim 1
Dim 2
Dim 3
Dim 1
Dim 2
Dim 3
Query AResults
Query BResults
FinalResults
Issue a separate queryfor each fact table
Phase 1:
Qualify each query asneeded
–
Get same dimensions ineach query
–
Summarize facts by chosendimensions
–
Combine the result setsPhase 2:
Perform a full outer joinbased on commondimensions
–
Compute comparisons orratios of facts if desired
–
2020-11-13
34
« DRILL ACROSS »LA RECETTE
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
35
« DRILL ACROSS »LE CODE (POUR DEUX « ÉTOILES »AET B)SELECTCOALESCE(A.clé,B.clé,...)ASclé,A.quantité1,...,A.quantiténA,B.quantité1,...,B.quantiténB,f1(A.quantité1,...,A.quantiténA,B.quantité1,...,B.quantiténB)ASrésultat1, ...fnR(A.quantité1,...,A.quantiténA,B.quantité1,...,B.quantiténB)ASrésultatnR
FROM(SELECTA.clé,agg1(att1)ASquantité1,...,aggnA (attnA)ASquantiténA,...
FROMFAJOINDA1 ON(FA.dim1=DA1.clé)...JOINDAnA (FA.dimnA=DAnA.clé)
WHERE...autresrestrictions,e.a.surlesdates...
)ASAFULLOUTERJOIN(...)ASB
ONA.clé =B.clé
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec
¢La« recette »nepeutêtreappliquéeaveuglément,elledoitaussirespectercertainescontraintes� conformitédesentitésmisesencause;
� silerésultatdudrillacross doitêtreutilisécomme« tabledefaitsynthétique »,deuxautrescontraintes sontapplicables¢ granularitéidentique¢ synchronicité
¢Pourlesexprimer,nousavonsbesoindesconceptsdehiérarchisationdesentitésetdecelledesprocessus
2020-11-13
37
« DRILL ACROSS »LES CONTRAINTES
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
38
TECHNIQUESHIÉRARCHISATION PAR ENTITÉS
'+*� � �E6GI�>>>� ���9^bZch^dc�9Zh^\c�
?d\ehcWj_ed�je�=Wj^[hAttribute hierarchies are best documented graphically. The format introduced in Figure 7-2 is a good starting point. It includes important information about the product attribute hierarchies, such as names for each level of the attribute hierarchy, attributes present at each level, and the one-to-many relationships between instances of each level. One key piece of information is missing: what attribute should be shown when someone drills into a new level in the hierarchy? For example, when users drill from category to brand, should they see brand codes or brand names? This information may be important if the hierarchy will be used to configure a drilling tool. A target attribute for each level in an attribute hierarchy can be indicated using bold text.
The diagram in Figure 7-3 illustrates attribute hierarchies in several dimensions. Boxes denote the levels of each hierarchy. They are arranged along a vertical axis from most highly summarized to most detailed. Crows-feet are used to indicate the one-to-many relationships between levels. The box for each level has a name and includes a list of attributes for the level. The target attribute for drilling is shown in bold.
Notice this diagram incorporates multiple hierarchies within the day dimension; both hierarchies begin and end with days and years but have differing levels in between. It is also possible to diagram each alternative hierarchy separately. This may be necessary if the same attribute appears at different levels in different hierarchies.
<_]kh[�-#)� 9dXjbZci^c\�Viig^WjiZ�]^ZgVgX]^Zh
DayDimension
ALL DAYS
ProductDimension
ALL PRODUCTS
CATEGORY
category_codecategory_name
BRAND
brand_codebrand_namebrand_manager
PRODUCT
skuproduct_name
SalespersonDimension
ALL SALESREPS
REGION
region_coderegion_vpregion_name
TERRITORY
territory_codeterritory_name
SALESREP
salesperson_codesalesperson_nameoffice_locationhire_date
DAY
full_dateday_of_week_numberday_of_week_nameday_of_monthholiday_flag
CustomerDimension
ALL CUSTOMERS
CUSTOMER
customer_idcustomer_nameprimary_industrybilling_address
MONTHmonth_and_year
QUARTER
year_and_quarterquarter
FISCAL_PERIOD
fy_and_periodperiod
month_namemonth_number
YEAR
year
FISCAL_YEAR
fiscal_year
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
39
TECHNIQUESCONSTRUCTION DE CUBES
'+,� � �E6GI�>>>� ���9^bZch^dc�9Zh^\c�
6\\gZ\ViZ�9Zh^\c�VcY�BVcV\ZbZci� Documentation of hierarchies among dimension attributes can also help with the design of aggregate stars. Discussed in Chapter 15, “Aggregates,” aggregates summarize information in a base star in order to improve query performance. For example, a base star may contain orders by order line, with dimensions for day, product, salesperson, and customer. An aggregate of this star may summarize orders by brand and month. When the business question does not require order-line-level detail, the aggregate star may be used to answer the question. Since it has significantly less data, it will provide a faster answer.
An aggregate star can be planned by identifying the level of summarization it will provide in each dimension. This definition can be documented by graphically indicating the level of summarization along hierarchies in each dimension. This is similar to planning a cube and can also be accomplished using a diagram like the one in Figure 7-4.
Some database products are able to generate and maintain aggregate structures automatically, in much the same way cubes are generated by some multidimensional database tools. Dimensional hierarchies are defined as part of the table definitions, or metadata. This information is then referred to in the definition of aggregate structures and leveraged by the tool to generate and maintain them.
9Wl[Wj�M^[d�:eYkc[dj_d]�>_[hWhY^_[iIf you choose to document attribute hierarchies, remember that they represent some, but not all, possibilities for drilling, conformance, cube design, and aggregate design. You have
<_]kh[�-#*� Jh^c\�V�]^ZgVgX]n�Y^V\gVb�id�YZh^\c�V�XjWZ�dg�V\\gZ\ViZ
Order Facts by:
ALL PRODUCTS
CATEGORY
BRAND
PRODUCT
ALL SALESREPS
REGION
TERRITORY
SALESREP
ALL CUSTOMERS
CUSTOMER
ALL DAYS
DAY
MONTH
QUARTER FISCAL_PERIOD
YEAR FISCAL_YEAR
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
40
TECHNIQUESHIÉRARCHISATION PAR PROCESSUS
¢Lesprocessuspouvantêtrehiérarchisés,ilesttentantdefairedemêmepourlestablesdefaitspourobtenirdes« tablesdefaitssynthétiques ».
¢Certainesprécautionsdoiventtoutefoisêtreprises� lamiseenrelationdedeuxtablesdefaits,mêmesiellessontenrelationdesous-processus,doitpasserparundrillacross;
� unteldrillacross doitdoncrépondreàdescontraintessupplémentairesdegranularitéetdesynchronicité enplusdecelledeconformitédesdimensions.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
41
TECHNIQUESBRIDGING
¢Parcequelesrelationsnxn...çaexiste!¢Parexemple
� multidignostic¢ Ilfautdoncadapter(généraliser)le« drillacross »àntablesdefait.
¢VoirexempledansAdamson.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
44
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
LES CONSEILS D’ADAMSON
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
45
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (1)¢ TIP-031
Assigneachdimensiontableasurrogatekey.Thissingle
columnwillbeusedtouniquelyidentifyeachrowinthe
table.
¢ TIP-032
Providearichandcomprehensivesetofdimension
attributes.Eachnewattributedramaticallyincreasesthe
numberofanalyticpossibilities.
¢ TIP-035
Itisnotalwaysclearwhetheranumericdataelementis
afactoradimension.Whenindoubt,paycloseattention
tohowitwillbeused.Iftheelementvaluesareusedto
filterqueries,orderdata,controlaggregation,ordrive
master–detailrelationships,itismostlikelyadimension.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
46
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (2)
¢TIP-037Donotusetheprinciplesofnormalizationtoguidedimensiontabledesign.Analyticdatabasesdonotbenefitfromthesetechniques.Situationsthatcallforsnowflakesandoutriggersaretheexceptionratherthantherule.
¢TIP-043Avoidoverusingdegeneratedimensions.Ifanattributeisnotatransactionidentifier,considerplacingitinajunkdimensioninstead.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
47
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (3)
¢TIP-048Usetype1changescarefully.Theyrestatethecontextforassociatedfacts.Confusioncanbeminimizedbyeducatingsystemsanalystsandbusinessusers.
¢TIP-052Foreachdimensionattribute,chooseanddocumenttheappropriateslowchangeresponse.Ifyouareuncertain,thetype2responseissafest.Whenasourcesystemcapturesthereasonforachange,asingleattributemaydriveeithertypeofresponse.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
48
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (4)
¢TIP-073Neverattempttojointwofacttables,eitherdirectlyorthroughacommondimension.Thiscanproduceinaccurateresults.
¢TIP-081Whenavailabletoolscannotdrillacross,orwhendrill-acrossreportssufferfrompoorperformance,designandbuildamergedfacttablethatsummarizestheprocessesatacommonlevelofdetail.Thisderivedtableperformsthedrill-acrossoperationwhenthewarehousetablesareloaded,insteadofperformingitatquerytime.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
49
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (5)
¢TIP-118Whentwodimensionattributesshareanaturalaffinity,andareonlyrelatedinonecontext,theybelonginthesamedimensiontable.Whentheirrelationshipsaredeterminedbytransactionsoractivities,andtheycanoccurinmultiplecontexts,theyshouldbeplacedinseparatedimensiontables.
¢REC-122RelocateFree-FormTextFieldstoanoutrigger.Excessiverowlengthisoftenaresultoftheinclusionofseveralfree-formtextfieldsinthedimensiontable.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
50
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (6)
¢TIP-130Whenafacttableanddimensiontablehavemultiplerelationships,itisnotnecessarytobuildmultiplecopiesofthedimension.Eachrolecanbeaccessedbyjoiningviewsoraliasesofthedimensiontotheappropriateforeignkeysinthefacttable.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
51
DIMENSIONNALITÉLES CONSEILS D’ADAMSON (7)¢ TIP-134
DonotallowthestorageofNULLsindimensioncolumns.
Instead,chooseavaluethatwillbeusedwheneverdataisnotavailable.
¢ TIP-136AvoidallowingNULLvaluesinforeignkeycolumns.They
requirealternativejoinsyntaxandcreateNULLinstance
valuesfordimensioncolumnsevenwhenNULLsarenotstored.
¢ TIP-140
Special-caserowscanbeaddedtodimensionstodeal
withincorrectormissinginformation.Thisavoidsthe
needtoexcludefactsfromthewarehouse.Thestarshouldrecordsufficienttransactionidentifierstoallow
theanomalousrecordtobeidentifiedandcorrectedin
thefuture.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
52
QUESTIONS OUVERTESTEMPORALISATION
¢Laconceptiondestablesdefaitssembleconduirenaturellementàunetemporalisation« partuple »excluantletype2,d’oùlarecommandation(forte)denepastemporaliser,carl’utilisationtype1invalideraittouslesrapportsantérieurs.
¢Enconclusion,silapossibilitédecorrigerlesfaits(doncleurtemporalisation)estnécessaire,ladimensionnalitén’apportepasdesolution.
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
53
QUESTIONS OUVERTESLE RETOUR DES JOINTURES¢ Ilendécoulequeplusieursrequêtesdevrontfaireappelàdesjointures:� pourlierdesfaitsdeTFdifférentes« drilling across »;
� pourlierdesdimensionsdifférentes« (dimension)merging »;
� pourréduirelepoidsdedimensionsobèses,lahiérarchisationetenparticulierlesflocons(« snowflake »);
� engénéral,pourtraiterlesrequêtesnonanticipées.¢Pourquoidoncavoirtoutfaitpourlesexclureaudépart?
Départementd’informatique,Facultédessciences,UniversitédeSherbrooke,Québec2020-11-13
54
BD301:Entrepôtsdedonnées-Architectureetdimensionnalité(v110b)—LucLavoie
PRINCIPALES RÉFÉRENCES
¢ ADAMSON,C. 2010.Thecomplete reference starschema.McGraw-Hill,NewYork,NY,USA.
¢ ADAMSON,C. 2008-2015.http://blog.oaktonsoftware.com
¢ INMON,W.H. 2005.Buildingthedatawarehouse.JohnWiley,Indianapolis,IN,USA.
¢ JIANG,B.2015.Constructing DataWharehouses with Metadriven GenericOperators,andmore.2nded.,Createspace.
¢ KIMBALL,R. 2013.Thedatawarehouse toolkit:thedefinitive guidetodimensional modeling.JohnWiley,Indianapolis,IN,USA.