Data Warehousing Articles - Loyer TCG - Technology Consulting Group

BUILDING A REPORTING DATA WAREHOUSE
by William Laurent

Imagine the following familiar scenario: Your company has disparate implementations of reporting systems/mechanisms throughout the world. These systems are currently used to satisfy various levels (local, regional and corporate) of operational and analytical informational requirements. Each system was developed (and now maintained) separately with variances in approach and scope, causing continued redundancy, inconsistency, and confusion in reporting efforts. There is no existing central reporting system or capability in place that will accommodate the majority of the analytical needs of more than a few groups. The demand for information far outstrips the supply; the desires of the business--real-world opportunities for business intelligence and not just proof-of-concept resolutions--span far beyond the boundaries of current systems.

You have been tasked with the unenviable job of helping your company re-define and re-deploy their overall reporting strategy and architecture. A vision of a consolidated data model that accommodates (and [possibly] ultimately replaces) these disparate data sources, as well as a design that supports incremental implementation (for example: by geography, source system, line of business, or user base) is beginning to materialize. Project scope will include architecture, data model(s), source/target mappings and recommendations for data movement/delivery to the end user via data marts. Ongoing support requirements (staffing, business continuity procedures, change management, etc.) and "best practices" are also considered within the scope of this strategy.

The key question to be answered will be whether or not it is feasible to develop a single model to address all levels and types of reporting needs, from the daily transaction-oriented reports produced for a local office, to the strategic analysis performed by head-quarter's staff.

An experienced data warehouse project manager will usually determine if a single common centralized data model and environment (that will act as a foundation to the reporting repository and accommodate future additional data sources) will get the job done. The physical implementation of this model should serve ongoing reporting needs dynamically and incrementally, ensuring an acceptable level of alignment and long-term compliance between centralized and distributed data structures and elements. The goals of the centralized repository will have to be completely understood and agreed upon by all stakeholders and members of the project team from the outset.

Analytical initiatives will be diverse. Some objective goals may be to:

Improve customer retention, profile identification, and behavior pattern recognition via improved and personalized portfolio management components.
Global views of multinational customer demographics.
CRM and sales force automation efforts.
Facilitate "rapid time to market" cross-selling and mass marketing of B-to-C and B&B products and services.
Speed underwriting of new business across diverse geographic lines and legalities.

During the Discovery Phase, architects should become familiar with the design concepts, approach, extent, intended audience, redundancies, and gaps of each reporting solution in the enterprise. A clear understanding will develop concerning key business drivers and objectives, and how they support analytical reporting efforts throughout the organization.

What will become clear during the discovery phase is that there are many levels of reporting perspectives and needs within the organization. At a minimum, a generalized and (often used) delineation can be made between operational reporting (done, for example, by local branches to report on their daily transactional activity at a detailed level) and analytical reporting (such as trend analysis or risk/portfolio management reviews performed by management or HQ staff.) Each reporting perspective has different informational objectives in terms of scope, volatility, and level of detail: Operational reporting requires access to all transaction details, including deals or sales in progress, on at least a daily basis; analytical reporting is typically performed against historical, somewhat summarized data that is updated less often (monthly or weekly.) Although the individual solutions for each reporting perspective may vary, many enterprises require a single and comprehensive reporting complex--a main/primary reporting warehouse repository that will aid in the effective, cohesive, and streamlined propagation of information across the enterprise.

The differences between operational and analytical informational needs are significant in three critical ways:

Reporting Perspective	Scope	Volatility	Detail
Operational ("tactical")	Includes in-process and active transaction data (i.e.: WIP)	High; usually requires daily (even real-time) updates	Lowest level - local system terminology, data values, business rules (i.e.: local product)
Analytical ("strategic")	Includes historical data	Low; static data updated less often (monthly, weekly)	Cleansed/aggregated to a higher level of detail consistent across the organization

The requirements for distinct multi-level reporting perspectives cannot be satisfied by a single and contradiction-laden "one size fits all" solution. Attempting to address both tactical and strategic reporting requirements in a single reporting data warehouse will compromise and greatly complicate architecture and design, reducing the likelihood of success. A reporting strategy must be developed for analytical and operational perspectives separately--one that is coordinated and complementary to both perspectives, so as to provide the greatest flexibility and functionality. A common and shared meta data schema that allows navigation between analytical and operational reporting areas should form the nucleus of said reporting strategy and associated architectural frameworks.

On most occasions, when undertaking reporting repository project, you will inherit some sort of database that is referred to as the "data warehouse". Although in its current format, this warehouse may not satisfy an acceptable amount of the analytical reporting requirements, it certainly has value as a foundation, and will serve as a jumping off point for the designing and building a full analytical reporting strategy and solution. Scrutiny must be paid to this existing foundation:

Does it contain corporate standards for P&L, financial, and other information used and reported by the majority of organizational units and affiliates?
Has it proven its value already as a reporting vehicle? Is it currently indispensable?
Is performance acceptable?
Is it stable? Scalable?
Can it be considered to be truly enterprise level in scope? What are the gaps?
Is it deployed and maintained in multiple instances throughout the world? If yes, will this local autonomy constrain the business in quickly responding to global challenges in business continuity, reporting, and application development initiatives.
Can it serve as a template for a robust repository?

For the architect, the discovery process will always unearth omnifarious challenges, both potential and decisive, that will continually impact the design and development of the information repository/warehouse. Typical snafus and challenges could be as noted:

The OLTP reporting systems may satisfy local reporting requirements, but are so tightly bound to their corresponding transactional system that they have limited viability beyond or above their local scope. Data sharing becomes burdensome and impossible due to issues with mappings, semantics, aggregations, and more.
Because many localities define, track, and even group product, services, and financial adjustments differently, keeping this local transactional data in a centralized data warehouse might be impossible because this data will not make sense out of its native context. The only way to solve this dilemma is to virtually standardize operational source systems: Company wide entities and business definitions are controlled and mapped to a enterprise-wide data model, data dictionary, and meta data repository in which domains are enforced along business lines. (Learn to recognize when various subsets and levels of transactional data from disparate source systems will conflict with each other in a centralized warehouse.)
Project managers must remember that current physical (staffing, bandwidth, etc.) and logical (political and budgetary) constraints will almost always prevent clean designs, testing cycles, implementations, and roll-outs of a centralized reporting or enterprise data warehouse repository. They will choose their battles carefully.
Designing models and architecture solutions without taking into account hardware and bandwidth limitations will be the kiss of death. Use caution in assuming that premium meta data or a main reporting repository can be sourced in one localized place. Know what kinds of result sets are going to be moving back and forth across the WANs. Ask for many detailed examples of existing and proposed reports, as well as their accompanying access protocols early in the project! Involve your data steward or stewardship team in gathering and compiling intra-group data elements and domains in a form and fashion that will add hard-and-fast value to the project.
Maintaining and adhering to a warehouse driven reporting architecture will be much harder than establishing the initial architecture. Ongoing design and management of the warehouse will have to be done with flexibility and extensibility in mind. Such things as constantly changing business rules and formulas, iterative data movement and migration processes, and introduction of new data elements in accordance with new business lines will present great challenges. Everything modeled and materialized in a true enterprise warehouse will need to conform to explicit enterprise standards located in the meta data repository, which tie back to the transactional systems of record.

Often, a good approach is to suggest, agree upon, and formalize the optimal and overall warehouse design while simultaneously implementing a incremental data architecture. This will help solve immediate business problems, with the understanding that the 'best practice architecture' and full model for the reporting repository/data warehouse will be iteratively implemented in the future. With a centralized meta data layer in the main data repository, the initial data mart audiences will have a place of mutual consensus on the definition and structure of current and proposed dimensions and fields. As new data marts are modeled with a dimensional slant, they will be populated from subject areas that have already materialized and been completed in the warehouse. The data marts should retain some degree of isolation from the repository. Modification in the repository should not affect the core values of marts downstream.

Clearly defined data marts will support the complex analysis OLAP tools along via pre-defined measures. Measures will be determined by the nature of each dimensional star schema model (its cluster of fact tables and associated dimensions) and/or OLAP cube.

To insure optimum distribution of data, expectations will have to be clarified and the following questions will need to be answered:

Are we providing text reports, graphs, charts etc. to the data audience?
Will there be browser-based interrogation of data through an intranet?
Will new quantitative meta data be created? Will a new meta data layer be needed (for CRM, portal, etc.)?
What about the quality of external feeds from third parties and outside sources such as Bloomberg, ADP, and others?
How do we address the issue of timing when reporting data from the warehouse? In other words, how can we prevent the skewing of financials by incremental or inconsistent implementations and releases of data?

Designing a logical architecture for your data warehouse must be done in parallel with that of the physical architecture. Database administrators and systems administrators must be brought in the loop, so as to expose hardware, network, and resource limitations or issues. In addition to the outline above, many more specific issues related to the physical implementation of the data warehouse will need to be addressed. Some of these logistical and administrative issues may be:

Where to drop and re-establish indices and how to set index fill factors.
When updating a large portion of the database (i.e. 25% up), do we want to drop indices before a database load and then re-establish them after the load? If indexes are not dropped, do we want to set index fill factors so server's disk drives do not waste time looking for space in which to write index updates?
Where to use bulk loading facilities and parallel BCP-speeding up repository and staging area loading at the expense of data integrity checking.
How to parallelize table load and index maintenance or re-creation.
When to drop indices and recreate them.
When to update statistics on the warehouse tables.
Whether to incrementally update or truncate and reload a table.
When to perform disk mirroring and hot backups for flexibility and breathing room.
How to schedule and multi-task loading processes. The warehouse will have many processes; a dependency list is needed up front in order to implement restart capabilities. Identification and availability of checkpoints will be a lifesaver when a warehouse load process crashes.
Do we have the have proper human and automated support scheduled for running warehouse batches and processes?

The list of implementation considerations goes on and on...

Summary

Your company is most likely "data rich, but knowledge poor". Various operational and reporting systems deployed around the globe house a wealth of information about your business; however, data consistency, semantics, and beyond, all need to be consolidated above the local level, providing input and direction to strategic corporate initiatives. It is important the project team recognize the logical and physical risks associated with all proposed solutions, and that NO single enterprise solution will adequately accommodate operational requirements at both the local/regional and enterprise levels without requiring excessive time and effort for development, deployment, customization. Success will only be assured by assiduous attention to every level of detail of meta data, data transformation, and overall architecture.

A sound recommendation is to create a somewhat normalized Entity Relationship Model(s) that will form the logical basis for the reporting repository. This model will initially be coupled with and (loosely) based on existing ODS and transactional systems and their common data elements. This should serve as a nascent template, offering the organization a starting point in the reporting warehouse design, and defining the most common data elements from the outset. If beneficial, child models should inherit from the main model, as there will be certain shared elements across child models that will be indivisible. Via a robust meta data schema (that underlies your reporting architecture) data elements should be able to be viewed consistently at each level and in relation to each other. This will safeguard a truly global view of the business at all levels. Child models, and thus reporting data elements at company local/branch levels, will share common meta data schema elements with the (Parent) main reporting repository, ensuring data consistency of mappings, values, specification, and semantics across the organization. To help deploy such a scenario cleanly, next-generation XML enabled repositories may offer value, but only if these repositories are able to evolve beyond a document-centric skeleton.