Data Warehousing Articles - Loyer TCG - Technology Consulting Group

EXCESS RELIANCE ON A STAR SCHEMA FOR A MAIN REPOSITORY
by William Laurent

Financial-oriented reporting warehouses may suffer from an inherent and pervasive lack of flexibility, due to excess reliance on a star schema design in strategic repository areas. Localized reporting systems will not accomplish stated objectives if one attempts to design and implement too many diverse types of reports (across business lines, market segments, organizational units, etc.) against a star schema-centric reporting database. Different groups of data analysts will always view an enterprise's data from dissimilar perspectives and disparate conventions. They will find it cumbersome to share star schema data structures that attempt to satisfy too wide a reporting audience simultaneously. It is many times best to load semi-normalized data into "a black box" data warehouse firstly, and then feed star or OLAP structures accordingly as data marts to propitiate sectarian wishes.

A given collection of data users will view information in a certain sequence and aggregation level; other groups will require different sequences, access paths, and sets of fundamental data. Requirement gaps between user groups will probably be large. Even though analytic knowledge workers toil away with data from the same transactional source systems of record--semantics names, and definitions of same/similar data elements vary from one department or business unit to another, making a "one size fits all" dimensional reporting model difficult to navigate, especially without a unified and associated meta data repository robustly in place. In addition, star schema tables sometimes share dimensions; this results in the spreading out of fact tables, and thus the warehouse tends to capture only one or a handful of effective data measures (such as unearned premium or specific notionals) per fact table, making the main warehouse repository more complex and resulting in custom fact tables based on line of business, instead of serving multiple organizational sections. Enterprise data audiences should be defined with as broad a brush as possible when constructing a main repository; it will be impossible to satisfy many different business units if the main repository is too customized.

Before embarking on a dimensional design, the data modeler must thoroughly understand what issues, problems, and questions the model will solve (or facilitate) for each business constituency that will be using the repository, now or in the future. Data users seldom have a thorough grasp of their own reporting needs or analytical requirements; therefore, it is best to capture as much as the source system universe as possible in the main warehouse and let individual user groups set requirements for their respective data marts, with data ownership being the ultimate goal. We must first have popular consensus and agreement on names, definitions, and business rules (calculations, aggregations, transformations) of data elements before we can populate proposed or current star schemas.

Dimensional models are usually better broken down according to localized needs and fed by a main quasi-normalized and self-contained warehouse that will port information out of the main repository and into dimensional marts, cubes, or XML mechanisms. The phased warehouse approach (normalized warehouse feeding dimensional marts) will also lend itself more readily--logically and physically--to n-tier architectures--with more options for the separation of data, presentation layers, procedure calls, business logic, transformations, and more. This will ultimately empower the knowledge users as they receive repository data, incarnated in data marts along their lines of business on demand. The star schema structure in itself does not provide the necessary information about the atomic level data--the very foundation of the data warehouse--required for the fact tables within the star schema. Going to a star schema too early in the game will not work--as one size does not fit all. For large warehouse implementations, star schema design must occur along specified measures and dimensions, and may not work effectively across various lines of business--likewise for aggregations and roll-ups. Star schema models, OLAP and ROLAP cubes, and virtual star schemas built from relational views or object database extensions, should occur downstream.

Enterprise information requirements and organization modeling are usually portrayed--in addition to UML constructs such as State Transition Diagrams and Case Hierarchies--via normalized entity relationship models. These models help segue into and form the basis of a data warehouse's vital meta data constructs and repositories by defining and capturing a more atomic and transactional level of data. Communication between an organization's data warehouse team and data stewardship team and (source-system) business analysts will be more effective if tangible common ground exists in the form of a ERD based model. The initial definition and implementation of a schema at such a diminutive data stratum is critical to the success of a main data warehouse or reporting repository. A main warehouse stratum founded on a star schema fabrication will not add value: I32f we fail to store incoming data at its lowest level of granularity in a normalized data layer we may find ourselves reextracting, rescrubbing, and retransforming source system data every time we need to add a new derived attribute to an operational or OLAP report. Such issues will be compounded with the introduction of historical and archiving requirements, which will pose new problems to the designers, DBAs, and developers. Implementing star structures in the wrong places can prevent you easily getting back to the detailed nature of your business in a pinch.

Furthermore, reorganizations, product introductions, new products, premium calculations, policy types, broker relationships, etc., are going to continually affect the warehouse. If the warehouse is going to stay current (and being current will be a big selling point of the warehouse), changes to the warehouse have to be made fast. A reporting or DSS warehouse/repository that is built on a logical and somewhat normalized data model, carefully abstracted from its source systems and properly insulated from downstream data marts (vis-$B!&(Bvis a staging areas) will be the exigent remedy. In other words: The normalized approach provides a magnitude of much greater capability and scalability--allowing for any question to be asked of the data by all interested parties, and the easy addition of more data in the future. If we have changing business requirements or the merging of companies/business units, the main data repository may have to be redesigned and/or re-adjusted to meet any new business requirements. This can be very expensive and also limit the future scalability and growth of the warehouse architecture. If the bulk of our dimensional, transactional and historical data is already stored in a normalized layer, we can react more quickly to changing user requirements

In contrast to a normalized main repository, data marts should, for the most part, be based on star schema models and contain data for specific subject areas. A good star schema will be scoped and processed along business or product lines. Data will be organized through multidimensional models that will support OLAP analysis, representing a conceptualization of the data closest to the way in which a manager perceives the corporate reality. This is why we must build star schemas along business lines: With smaller star schemas and dimension models in data marts, we can focus on distinct and trenchant business issues. In some instances, we may even want to affect the oft-used technique of creating duplicate dimensional panoramas by creating virtual star schema tables via relational database views. These views may be found in a de-normalized staging or reporting area; or hybrid relational-star-schema views could be used to feed OLAP data cubes as part of a load-extraction process for such tools as Cognos or Essbase.

Some other problems with using dimensional models to form the basis of a main reporting repository for the enterprise include the following:

If the main data warehouse is somewhat dynamic (daily inserts via batch feeds, data updates and purges), Referential Integrity constraints, triggers, and other database integrity checks will work implicitly better with a normalized repository structure. This may be an important consideration for ongoing validation requirements (if any) during database loads, refreshes, calculation engine processes, and more. Going directly from operational source data to a star schema can add difficulty to the repository's messy task of enforcing data integrity (with respect to the definition of the fact tables especially). We want the primary tables that drive the logical and physical model to retain a correctness of granularity at all times. Problems may occur when we insert an aggregated record in a basic fact table without proper validations or integrity constraints; if the database does not correctly flag, reject, or handle the rows, we may have synchronization problems with our aggregates.
In both loading a (somewhat normalized vs. dimensional) relational database and taking data from a similarly modeled repository for distribution to data marts, most ETL oriented tools will be subject to better integration performance and collaborative data stewardship. Here is one example: When cleaning and inserting data into a database from a flat file, an ETL tool will often have to match up external keyed values (such as policy or client number) with an appropriately corresponding surrogate keyed value in the repository. This type of integration and transformation effort-which will be common to any insurance or financial data cleansing tasks-will involve singleton lookups on the repository database, which will have to perform an index covered query on keyed column 'SARG' values. If huge concatenated fact table indexes come in to play, performance may start crawling.
Extraction and transformation of source data should be solved at a higher (country or companywide) level in order to properly take advantage of your metadata repository. According to historical vendor practice, his repository will be normalized by nature.
XML integration might be a very important argument for basing the main physical repository on a logical normalized model. Current and next-generation data warehouses should be built with an eye toward the interoperability of warehouse data with systems inside and outside the enterprise. This extensible architecture must be an architecture that will accommodate and work with XML schemas and transformation processes.

Data warehouses and data marts focus on structured data; however, many new epoch (B2B and CRM) projects and applications will have to operate on less structured, non-relational (EBDIC binary, flat files, documents) data. In today's distributed computed environments, it is likely that certain systems under your watch will have to feed data to and receive data from IT complexes that touch outside that comfortable domain you have come to know as simply$B!&(Bin-house".

The ability for the main repository to accommodate XML configurations is paramount, and justifies a normalized physical model. A relational repository will be better able to support and parse delimited XML feeds, extracts, and transaction logs. Data should be justly enabled so that it can be read out of the repository, brought into a XML transformation process, and written to a intranet portal or outside exchange-exponentially expanding the reach of IT systems and customer/client/broker relationships in an unlimited fashion. This same process can work in reverse, as XML data is sent to a warehouse staging area or an ETL tool's domain for extraction, cleansing, and transformation before taking residency in the repository. Whether feeding the repository from a legacy system, or taking a feed from the warehouse, XML transformation engines can carry out transformations and resolve semantic and business differences that may have occurred as a result of CRM initiatives, Business to Business exchanges, or integration of third-party systems.

Archived and historical data (and associated processing) are sometimes better served by logical and normalized schemas. One of many issues: the large overhead of a star or snowflake schemas. Aggregate fact tables can eat up great amounts of disk space. If there are many dimensions to the data, the combination of these aggregate tables and their indexes will eat up many times more space than the raw data itself, providing numerous potential headaches when re-building indexes, running repository statistics, and so on.
Aggregated dimensional structures that are fused, linked into, or used to form the basis of a central warehouse repository may become a processing and maintenance nightmare due to the variance of source system data availability and batch deltas, drastically skewing dimensions oriented to date and time sensitivity. Inconsistencies related to data availability may cause synchronization havoc between fact and dimension tables that attempt to capture and reconcile time sensitive measurements.
As star schemas get bigger they become "snowflakes"-and effective data browsing capabilities may suffer as performance and integration issues (front-end tool access, metadata, etc.) escalate.
A star schema dimensional model will not support warehouse/mart "hub-and-spoke" architectures, where departmental data marts are built component-wise from a single enterprise-wide design.
Common bulk load (BCP, and SQL*Loader, etc.) methods that originate from a source system will work better with a non-dimensional gateway into the warehouse.
A logical normalized main repository will offer more flexibility in dispersal data loads, as it is possible to target distinct subject and business areas of the data model for loading iteratively.
With a normalized reporting repository, emergency maintenance of a large repository may become easier if certain tables-related to a particular logical area, say claim recoveries or claim reserves-are partitioned (RAID, etc.) on separate physical disks; making it possible to do things like swap out disks during a segmented/apportioned load or extract without taking an entire enterprise-focused functional unit of measurement (used across multiple regions and boundaries of business) fully offline.

For distributed information requirements, the entity relationship diagram and the star-schema diagram need not compete with each other within the scope of a data warehouse or reporting repository project. The two are complementary and should be working together cohesively. Each are valuable but must be used different places with different expectations. Subscribing to the Bill Inmon camp of data warehouse architecture-flexible warehouse design should spring forth from an organization's core data, initially modeled as closely to "third normal form" as possible. From this point, the seasoned architect may proceed to formulate an enterprise data road map, replete with star schema data marts, de-normalized staging areas, XQuery enabled ODS structures, and other sundry architecture morphologies as determined by the system's requirements and collected elements of datum. A supple and amenable multilevel architecture with detail data in the foundation, augmented by the use of star joins and multidimensional modeling in associated data marts-each facilitating a different set of OLAP reports from a particular business perspective-should provide a flexible, maintainable, extensible, and friendly architecture for most firms.