Business Intelligence & Data Warehousing Tom A. Fürstenberg Business Intelligence Consultant Cap Gemini Ernst & Young 1 Leerdoelen college • Wat is BI & DWH? (Conceptueel en Technisch) • Toepassing van BI & DWH • De praktijk van een consultant iha en bij Cap Gemini Ernst & Young ihb 2 Inhoud College • • • • • • • • • • Performance Management Business Intelligence (Performance Measurement) OLAP Extranets Architectuur Data Warehouse ETL Multidimensioneel Modelleren CGE&Y Aanpak 3 Data Mining Performance Management Doelgericht meten en bijsturen van bedrijfsdoelstellingen 4 In control of a company 5 Overview Strategie & Missie Conceptueel Besturingsmodel Verantwoordelijkheden & Bevoegdheden Doelen Middelen Randvoorwaarden Operationeel Besturingsmodel Besturingssystematiek Key Performance Indicators Critical Succes Indicators InformatieExternal Indicators voorziening Informatiemodel Informatiesysteem Datawarehouse 6 data data data Besturings visie: Bouwstenen voor besturing van organisaties Wie? Organisatie Methoden Systemen Hoe? Doelstellingen & Prestatie indicatoren Wat? Strategie &Missie Stake holders Waarden & normen 7 Waarom ? Methoden Diverse Financiële modellen Balanced Scorecard Beleid & strategie Middelenmngt. 8 Processen Leiderschap INK managementmodel waardering door klanten waardering door maatschappij Ondernemings resultaten waardering door personeel Personeelsbeleid Naar een operationeel Besturingsmodel CSI CSI CSI CSI KPI OI OI Na het vaststellen van de Doelen en KPI’s Worden de Critische Succes Indicatoren bepaald Gevolgd door het vaststellen van de Omgevings Indicatoren 9 OI OI CSI CSI CSI Naar een operationeel Besturingsmodel KPI KPI KPI KPI Tijd Product Markt Afdeling 10 Regio en Bevoegdheden Multi-dimensionale OI OI CSI CSI Verantwoordelijkheden CSI Van Model naar Gedragsverandering Gegevensstructuur KPI KPI KPI KPI Operationeel Besturingsmodel Management Charter Informatievoorziening Planning en Commitment Beoordeling en Sturing 11 Some Typical Mgt. Questions PRODUCT • How much have we sold? • Which product gives the best profit? • Which product has the largest sales volume this quarter? • Which product best meets market needs? • How much to produce of each product? CUSTOMER • Who is the most profitable customer? • What is the satisfaction level? • Which are the best segments? • Which service to improve? • How many customers have we lost last year? • Who are our biggest accounts? CHANNEL MARKETING • Which retailer yields most by volume and which by profit? • What promotions will yield most profit? • What effect will discounts have on the turnover? • What are the area coverage levels? • How many contacted people became a customer? • Promotions’ results? • What is the competition doing? 12 Key Performance Indicators Top 10 Purchasing 1 2 3 4 5 6 7 8 9 10 # hits development of purchase prices stocklevel by item or product line reliability of suppliers; quality (zero-defects) financial position of suppliers reliability of suppliers; time to delivery reliability of suppliers; completeness purchasing department costs number of goods returned cash discounts accounts payable as a % of purchasing value Sales 1 2 3 4 5 6 7 8 9 10 Production 86 84 81 81 79 74 72 48 47 39 1 2 3 4 5 6 7 8 9 10 # hits sales volume or sales volume growth depreciation of accounts receivable number of buyer complaints quality of deliveries marketing expenses freight costs value of new orders accounts receivable as a % of turnover reliability of deliveries; time to delivery market growth Source: overhead costs direct costs of materials direct costs of labour factory overhead expenses quality finished product sick days overtime stock of raw materials production volume maintenance costs Finance 95 92 92 88 87 86 85 85 84 83 1 2 3 4 5 6 7 8 9 10 Earnings Before Interest & Taxes (EBIT) Gross margin Profit Before Tax (PBT) Gross investments working capital cash flow liquidity position Return on Capital Employed Return on Sales turnover time of goods in stock Results FIND! The Best benchmarkstudy conducted in 1997/1998 by Ernst & Young Consulting and VU. 103 industrial companies participated in the study. 13 # hits 94 93 92 91 91 91 90 90 89 89 # hits 104 103 102 102 101 94 92 77 75 75 En nu alleen nog even meten… Business Intelligence (performance measurement) 14 The Answers The information is there, but spread 15 everywhere! De praktijk... 16 Problemen • • • • • • • • • • • • (Over)belasting IT-afdeling (queries) Lange doorlooptijd rapport-’fabricage’ Hoge kosten aan manuren Databronnen moeilijk integreerbaar Niet-gestandaardiseerde rapporten Geen eenduidige definities Foutgevoelig Manipuleerbaar Afhankelijkheid van ‘schakels’ Discussies over verschillen in cijfers Beperkte analyse-mogelijkheden Verkeerde en te late interpretaties, 17 conclusies, beslissingen Een druk op de knop... 18 Van chaos... Naar structuur 19 Why now? Hype? Developments: Market Pull • Globalisation of markets • Individualisation of customers • Shorter life cycle of products • Information overload • Mergers Technology Push • Faster hardware • Cheaper disk capacity • Modern OLAP-tools • Any access: c/s, web, mobile 20 OnLine Analytical Processing • Gebaseerd op de syntax van managementinformatie vragen: <meetwaarde> per <dim1> per <dim2> per ... • KPI’s, CSI’s en OI’s zijn meetwaarden • Produkt, Regio, Klant, Tijd, etc. zijn dimensies (slice & dice) • Dimensies kennen hierachiën (drill down) 21 OLAP Product Manager’s View Regional Manager’s View Financial Manager’s View Ad Hoc View 22 Introduction to Cubes Location Atlanta Denver Detroit Grapes Cherries Melons Sales Sales Apples Pears Q1 Q2 Q3 Time 23 Q4 Product Demo • eFashion Case • BusinessObjects Demo 24 BusinessObjects: Semantic Layer 25 Any Access 26 Info- & Analysis-need at 3rd parties 27 e/m-Business Intelligence: Extranets SUPPLIERS extr anet Data Warehouse extra net extr anet CUSTOMERS 28 PARTNERS Extranet demo’s 29 Business Intelligence Theory 30 BI Definition Business Intelligence is the process of collection, cleansing, combining, consolidation, analysis, interpretation and communication of all internal and available external data, relevant for the decision making process in the organisation 31 BI Concept Business Value Decisions Action Analysis Integration Collection Knowledge Information Data 32 Feedback BI Systems Reporting & Query DSS, MIS and EIS OLAP Data Mining 33 The Five Functional Levels ‘bunch of reports’, ‘cube’ i.e. finding variables Complexity of the question Number of users reporting analysis querying standard reports Static unique ‘report’ or question exploring mining 34 i.e stat. analysis, testing a hypothesis Dynamic The Five Functional Levels Complexity of the question Number of users reporting 80 % analysis of all users Static/ Dynamic querying exploring mining 35 interactief Corporate Information Factory Any Source Applications External data Any Data L O A D M A N A G E M E N T Any Access Q U E R Y Data Marts Data Warehouse Operational Data Store 36 M A N A G E M E N T LAN/ WAN WWW Components of the CIF • • • • Data Warehouse Data Mart Operational Data Store ETL 37 Data Warehouse 38 Definition Bill Inmon Characteristics of a data warehouse: • Subject-oriented • Integrated • Time-variant • Non-volatile • Both summary and detailed data 39 Data Warehouse • Contains data that can be used to meet the information of (part of) the organisation • Contains integrated data extracted from one or more sources • Mostly contains large amounts of data • Contains data that is clean and consistent • May contain aggregated data • Optimised for its use 40 Data Warehouse Data Base Data Warehouse Actual Historical Internal Internal and External Isolated Integrated Transactions Analysis Normalised Dimensional Dirty Clean and Consistent Detailed Detailed and Summary 41 Data Warehouse Advantages • One point of contact • Time savings • No loss of historical data • OLTP’s not hampered by BI activities • Better consistency and quality of data • Improvement of Business Intelligence 42 Data Warehouse Disadvantages • Never quite up-to-date • Requires a lot of storage space • Requires a lot of communication, coordination and cooperation • Large impact on the organisation • A data warehouse is only the beginning 43 Data Mart • DW design does not optimise query performance • Data is not stored in an optimal fashion for any given department in the DW • Competition to get the resources required to get inside the DW • Costs for DSS computing facilities are high because of the large volume in DW 44 Data Mart Characteristics: • Customised for a specific department • Limited amount of history • Summarised • Very flexible • Elegant presentation • Processor dedicated to the department 45 Data Mart Divided by: • Business • Geography • Security • Political (budget) • Structure (data mining) 46 Data Mart Three different kinds of data marts: • Subset/summary • MOLAP • ROLAP 47 Operational Data Store Characteristics: • Subject-oriented • Integrated • Current-valued • Volatile • Detailed data 48 ETL: Extraction Source selection: • Data model is starting point: determine data elements that are needed • For each data element, determine available data sources • If more han 1 source available, select on: – – – – Quality, reliability and integrity Scope of data Location and availability of data Location and availability of expertise 49 ETL: Transformation Processing: • Aggregate records • Encoding structures • Simple reformatting • Mathematical conversion • Resequencing of data • Default values • Key conversion • Cleansing 50 ETL: Transformation Key transformation Key structure A Key structure B Key structure A Key structure C Key structure A Key structure B New key structure Key structure C 51 ETL: Cleansing Data quality is critical for: • Marketing communications • Targeted marketing • Customer matching • Retail- and commercial householding • Combining information • Tracking retail sales 52 ETL: Cleansing Common excuses for not cleaning: • The data in the operational systems seem to work just fine • Data can be joined most of the time • Cleansing will take place after population of the data warehouse • Data entry will be improved • The users will never agree to change their data 53 Multi Dimensional Data Modeling 54 MD Modeling: Contents • • • • • • • • E/R Modeling (Ex.) MD Modeling (Ex.) Star Schema Slowly Changing Dimensions (Ex.) Surrogate Keys Aggregation (Ex.) Measures & Dimensions reviewed Other important MDM aspects 55 Exercise: E/R Modeling How could the sales transaction database of the eFashion retailer look like? Loyalty Card Products Products Sold Ticket Card_nr Ticket_nr Ticket_nr Product_nr Cust_name Store_nr Product_nr Bar_Code Adress Card-nr #_products Prod_Desc Zip_code Employee_nr price Actual_price City Time_Stamp dicount Weight ... ... Store Store_nr Employee Employee_nr Store_name Emp_name Adress ... Zip_code City State Manager 56 ... Management Questions • Give me the annual revenue of all my product lines divided over all the sales regions over the last 3 years • Give me the top 10 of most profitable products this year • Give me the top 10 of most sold products of last year • Give me the top 10 of most profitable customers • Compare the YTD revenue with the one in the same period last year and the target 57 Why not E/R Modeling? • End users cannot understand, remember, navigate an E/R model (not even with a GUI) • Software cannot usefully query an E/R model • Use of E/R modeling doesn’t meet the DW purpose: intuitive and high performance querying 58 Exercise: Model the Efashion DM • • • • Sales Revenue Time hierarchy (Year-Quarter-Month) Store hierarchy (Region, State, City, Store) Product hierarchy (Line, Category, SKU) 59 eFashion Data Mart Time Facts Month_nr Month_nr Store_nr Month_desc SKU_nr SKU_nr Quarter SKU_desc Sales_revenue Year Category ... Product Line Geography Store_nr Store_name City State Region 60 DW Modeling Components Dimension Tables Geographic Dimension Fact Table Geographic Product Product Time Measures Units $ Facts Time 61 Using a Star Schema Employee_Dim EmployeeKey EmployeeID . . . Dimension Table Time_Dim TimeKey TheDate Fact Table Sales_Fact Product_Dim ProductKey ProductID TimeKey EmployeeKey ProductKey CustomerKey ShipperKey . . . . . . $ Shipper_Dim . . . Customer_Dim ShipperKey ShipperID . . . CustomerKey CustomerID 62 . . . Components of a Star Schema Employee_Dim EmployeeKey EmployeeID . . . Time_Dim Sales_Fact TimeKey TheDate . . . Dimensional Keys Shipper_Dim TimeKey TimeKey EmployeeKey ProductKey CustomerKey ShipperKey $ . . . ProductKey ProductID . . . Multipart Key Measures Customer_Dim ShipperKey ShipperID . . . Product_Dim CustomerKey CustomerID 63 . . . Exercise: Slowly Changing Dimensions Suppose the product categories change from time to time. Model the Data Mart when the manager wants to see historical reports against: 1. The present categories 2. The categories at the time of the sale 3. Both against the present categories and the immediate previous categories 4. The categories at any specified time 64 SCD Exercise 1 Time Facts Month_nr Month_nr Store_nr Month_desc SKU_nr SKU_nr Quarter SKU_desc Sales_revenue Year Category ... Product Line Geography Store_nr Store_name City State Region 65 SCD Exercise 2 Time Facts Month_nr Month_nr Store_nr Month_desc Product_key Product_key Quarter SKU_nr Sales_revenue Year SKU_desc ... Product Category Line Geography Store_nr Store_name City State Region Most Recent Product Key Map Product_key SKU_nr 66 SCD Exercise 3 Time Facts Month_nr Month_nr Store_nr Month_desc Product_key Product_key Quarter SKU_nr Sales_revenue Year SKU_desc ... Product Category Category_old Geography Line Store_nr Store_name City State Region 67 SCD Exercise 4 Time Facts Month_nr Month_nr Store_nr Month_desc SKU_nr SKU_nr Quarter SKU_desc Sales_revenue Year Category ... Product Line Valid_from Geography Valid_until Store_nr Store_name City State Region 68 Slowly Changing Dimensions • Type 1: Overwrite the dimension record • Type 2: Create new dimension record • Type 3: Create an ‘old’ field in the dimension record • Type 4: Add a valid_from and valid_until field in the dimension record Ad. Type 2: requires surrogate keys, but in general, one should always use these because of performance and flexibility Ad. Type 4: Kimball only recognizes 3 types SCD’s 69 Always Use Surrogate Keys • Allows DWH to assign new key versions for SCD’s (type 2) • Higher performance with numeric keys than with long, alphanumeric keys 70 Exercise: Aggregation Suppose the manager queries frequently on product line level and finds the performance too low. Question: How to model the data mart when we want to add aggregated measures on product line level? 71 Exercise: Aggregation Time Facts Month_nr Month_nr Store_nr Month_desc Product_key Product_key Quarter SKU_nr Sales_revenue Year SKU_desc ... Product Category Line Geography Aggregated Facts Store_nr Store_name Week_nr City Product_Line Store_nr Line_key Line_key Line Sales_revenue State ... 72 Region Exercise: Measures Add the following measures to the eFashion Data Mart: • Stock Quantity • Product Price • Promotion Costs (product-specific, storeindependent) 73 Exercise: Measures Q_Stock Facts Month Quarter Quarter Store_nr Year SKU_nr Stock_qty (av, eom) Time Facts Month_nr Month_nr Store_nr Month_desc Product_key Product_key Quarter SKU_nr Sales_revenue Year SKU_desc Stock_qty Product Price Category Line (Valid_from Valid_until) Promotion Facts Geography Store_nr Month_nr Store_name SKU_nr City Promotion_cost State Duration Region Promotion_type ... 74 Measures & Dimensions reviewed The most useful measures are • Numeric • Additive Dimensions are: • The natural entry points of the facts • I.e., used for constraints and report breaks • Independent of each other, not hierarchically related 75 Other Important MDM-Aspects • • • • • • Cardinality Grain Referential Integrity Conformed Dimensions Drill Across Traps 76 How to make the CIF? Any Source Applications External data Any Data L O A D M A N A G E M E N T Any Access Q U E R Y Data Marts Data Warehouse Operational Data Store 77 M A N A G E M E N T LAN/ WAN WWW CGE&Y BI-Approach Overview Metamodel Extraction, Transformation Load Development Data Warehouse Architecture II Data Warehouse Architecture I Evolutionary Strategy Project Management 78 Implementation Definition Increments Source data Incremental Delivery DW blueprint Strategy & objectives Awareness Communication 79 Data Mining 80 Data Mining Definition: The process of digging intelligently into large volumes of data to discover and analyse previously unknown relationships or to validate hypotheses. 81 Data Mining Versus OLAP OLAP/Query Data Mining Are there some customers from large accounts with a high decrease in international calls? Are there any common characteristics among these customers? Information Data 82 Applications • Risk Analysis (grant credit, investment) • Fraud Detection (telephone charge, bank withdrawals) • Trouble Shooting and Diagnosis • Process Controls (wafer fabrication) • Promotion Analysis • Bankruptcy Prediction (mortgage lending, business partners) • Customer Churn (telco) • CRM (next slides) 83 Maximizing Customer Value • • • • Getting more prospects in Turning prospects into customers Selling more products to existing customers Getting less customers out 84 Which ones in and which ones out? Growth Yield per individual customer Keep Migration Yield per customer Costs per customer Highest Lowest Customer profitability Out Example: One to One Marketing • Treat different customers differently – differentiate message – differentiate product offer – differentiate channel • Need for usable information => predict customer behavior out of databases 86 89 90 91 92 Example: clickstream analysis • What parts of our Web site get the most visitors? • What parts of the Web site do we associate most frequently with actual sales? • What parts of the Web site are superfluous or visited infrequently? • Which pages on our Web site seem to be "session killers," where the remote user stops the session and leaves? • What is the new-visitor click profile on our site? • What is the click profile of an existing customer? A profitable customer? A complaining customer that all too frequently returns our product? • What is the click profile of a customer about to cancel our service, complain, or sue us? • How can we induce the customer to register with our site so we learn some useful information about that customer? • How many visits do unregistered customers typically make with us before they are willing to register? Before they buy a product or service? 93 Customized Customer Service Tele-sales 94 Service desk Example: Contact Strategy Good Tele-sales Data mining Customer Data Direct Mail Bad 95 Channel optimisation Sales visit The customer choses the channel Operational systems Service question CC App. Complaint handling Integration Analysis Leaflet request Contact Leaflet receipt Order Status Order service question Organisation! Status order Email Complaint 96 Data Sources for Data Mining Collecting & Cleansing DATA • Transactions (loyalty cards) • Behaviour of existing customers • Logfiles & cookies • Market research • Data suppliers • Public data 97 Example: Affinity Grouping • Market Basket: what items are sold together? • Market Basket: what categories are sold with what items? • Market Basket: what is not sold with certain items? • Event Correlations: what other services are brought in the first month after signing up for a satellite TV subscription? 98 Data Mining Techniques • Decision Trees, Classification Trees, Rule Induction • Neural Nets • Visualisation • Fuzzy Logic; Nearest Neighbour; Memory Based Reasoning; Case Based Reasoning • Proprietary Logic • Classical Statistics 99 Predictive Power Data Mining Techniques Statistical analysis Neural networks Genetic algorithms Decision trees Intuïtion Simplicity 100 Critical Success Factors • Data availability (large amounts of a wide variety of data) • Data consistency • Data quality • Domain expertise • Data used/needed is allowed by privacy laws 101 Benefits • Improved customer relationships • More revenue from existing customers • Market segmentation • Differentiated products and services • Differentiated sales channels • More effective marketing programs • Improved fraud detection • Improved investments •… 102 Demo Decision Tree with BusinessMiner from BusinessObjects 103 Contact information Tom A. Fürstenberg Business Intelligence Consultant Cap Gemini Ernst & Young Sector Energy, Products & Transport Tel +31 6 21 878 915 email: Tom.Furstenberg@CapGemini.nl 104