Data Warehousing Articles - Loyer TCG - Technology Consulting Group

DATA WAREHOUSE ARCHITECTURAL CONSIDERATIONS
by William Laurent

Before we start talking about how we will implement an effective data warehouse, we need to start talking about what (architecture, feeds, etc.) is in place already. This document has been broken this document into distinct bulleted sections that represent physical and/or logical particulars of warehouse design, with the hopes that they will serve as a catalyst that will help project colleagues and contributors start talking about the "hows" of implementation. From each salient point listed, the team can focus and drill down to determine issues and approach where applicable.

Much of what I have categorized must be imbued with, and determined by, definitive and measurable business objectives. Certain concepts and practices that are fundamental to designing an effective warehouse architecture and model need to be driven by business needs. Such things as aggregations, user access requirements, scheduled data availability, security roles, and so on--these are all dependant on user needs, Service Level Agreements, and technology limitations, just to name a few. With so many areas of responsibility and development, rolling out one functional area of a data warehouse to end users while simultaneously constructing other components is usually the best way to keep management content and ensure continued upper level buy-in for the project. Permutations are endless. So in no particular order, I offer a mish-mash of notable issues that the perspicacious data warehouse analyst will want to snuggle up to.

Define Data Requirements and Determine How Users Want to Look at the Data.

Define general level of data granularity, data aggregations, etc..
Define hierarchies/trees.
Create definition of dimensions.
Devise calculations: sums, averages, statistics (across multiple dimensions), etc..
Define roll-ups, drill downs, aggregates.
Define history.
Define delta refresh methods and requirements.
Define preferred access methods.

Determine Automation and Execution of ETL Jobs and Other Batch Oriented Type Processes.

Inventory all relevant components.
Determine interface requirements.
ETL and batch integration of parts.
Scheduling and cron process on the operating system servers.
Create and define process control modules (PCM) and job wrappers.

QA and Testing

Determine how multiple terabyte environments will be maintained for test, QA, and production environments.
Determine subsets of production data in test and QA.
Dumps, loads, index creation, etc, of all environments.
Create migration plans, release notes, and production support documents.
Employ robust version control and tagging processes for parallel development efforts across all lines of business.
Don't overlook regression testing.

Reliance on "third party" data

Grasp compliance and CRM issues.
Identify what data is hosted outside company--the system periphery.
Formulate Business Continuity and contingency plans accordingly (per above).
Leverage data by employment of XML interfaces and robust messaging.
Open the door for discussion on e-commerce, and how to better serve and empower the customer, vendors, and partners.

Know the Limitations and Strengths of your ETL tools and processes.

May need to create pivots from long to short tables.
Ask questions about native DB Drivers. Don't rely on slow ODBC connections.
Understand fully how your tools deal with data truncation issues or problems.
Understand fully how your tools deal with NULL values under all DML circumstances.
Understand array functionality (in order to simulate and create pivot tables.)
Know the strength of tool integration with meta data repositories.
Know the strength of tool integration with XML (output/input).
Identify potential parsing and integration problems with tool's error and process logging.
Spot physical dependencies and conflicts of ETL batch jobs on other processes.
With the help of meta data, document each ETL job's multiple sources, targets, hubs, and spokes. Multi-mode inserts and updates quickly become complex.
Ask hard questions about the tool's scalability, bulk and delta capabilities, and source file/scanner abilities.
Make sure transformation rules are not buried in ETL tool code and kept in a repository.

OLAP

Understand implementation issues: Investigate the need for separate physical OLAP server.
Think about how OLAP tool will pull data from the warehouse.
Define a security model and responsible business analysts, approvers, publishers.
Shoot for multiple publishing and versioning capabilities: This can offer a very scalable data mart solution that keeps user warehouse access to a minimum. The data warehouse remains a black box and you have data separation on the last node of processing.
History
Know what kind of history is needed on all levels. For instance: different tables may have different requirements, i.e., transaction history is frequent and address change history is infrequent.
Be familiar with history pitfalls: may be deathly costly in performance, maintenance, etc..
History must be user driven and defined from the beginning of the project.
Define audit/access trails (user access and change history).
Reach early agreements with the business on transactional history issues.
Don't forget about process history (process times, benchmarks, etc.)!

Data Dictionary and Meta Data

Establish the need for an enterprise data dictionary and meta data repository from the beginning. (Mapping all the data points to new warehouse data is a massive job / Different physical functional areas may share data elements / Cross-reference scenarios many times are cause for revision in data models, etc.)
Know how to integrate this data with ETL tools, different schemas, data access layers, and onwards.
Make plans to host the data dictionary on corporate intranet.
Determine the best way to have centralized yet integrated meta data. You do not want meta data that quickly becomes orphaned or siphoned off into individual data marts, becoming local meta data. You need to share data definitions and semantics across the enterprise in order to avoid stovepipe data marts and provide a single version of the truth.
Know which business rules are extremely dynamic and function specific (i.e. compliance, legal); they may be candidates to stay on the application server.
Remind yourself that meta data should be approached from a mindset of integrating functional areas in a transparent way. Do not attempt create federated meta data unless you really know what you are getting yourself into!

Error Checking and Logging.

Promulgate requirements and assessments of error handling, error logging, error reporting, and exception logic/flow.
Understand where errors may occur (UNIX, ETL, database RI, Java, OLAP, etc.) and handle them correctly, in other words, store them in a database table.
Prescribe an error threshold matrix: How many error instances of type n necessitate job failure or stoppage?
Define validation failure levels.
Create e-mail enabled error notification.
Know when to use RI and when to avoid it. Same with triggers and other data base objects than may affect performance and/or integrity--often at the expense of each other.

Process and Flow Control Logging

Log job process information to attain effective benchmarking and comparative performance valuations: Processing a start and end time logged for each job establishes a benchmarking foundation as well as mean/median run times for SLA agreements, troubleshooting, etc.
Detail requirements for robust process control meta data. Processes that are part of batch jobs--stored procedures, shell scripts, OS and middleware, ETL routines, etc.--as well as parameters for those jobs can be maintained in meta data. Jobs can then be called sequentially in a programmatic fashion and new jobs and parameters can be added easily. It will become easy to track activity by the user population in order to pinpoint performance problems, facilitate maintenance, perform load balancing on hardware, etc. Can include such things as query start and end times, keep track of most used queries, number of reads on database, as well as number of rows returned, and other relevant factiods.

Fault Tolerance and Recovery

If the goal for your warehouse is 24-7 availability, never forget it!
You always need an up-to-date and complete back up and recovery plan.
Conduct frequent hardware needs assessments (DB servers, application servers, OLAP servers, etc.) and inventory.
Know how your data's movement impacts the network and its bandwidth.
Study RAID(0) disk striping and Raid(5) fault tolerance. Understand fail-over and redundancy issues and implement a logical and physical architecture accordingly.
Practice disc mirroring. Split databases across multiple disks (one to one--segment to devices) enhancing use of segments. You must have a good DBA who knows VLDB and hardware tricks, and who feels up to the task.

Application Server

Must consider that technology decisions made concerning application servers will play a huge part in an overall data warehouse architecture and its future scalability. A true n-tier warehouse will be able to offer intelligent load balancing and fail-over (via application server clustering)--the ability to support any combination of front and back-ends, with high availability and transparency.
Is there an advantage to keeping system authentication in one place, not specific to databases or applications? Sometimes it may be a good idea to authenticate on the application server (Access Control Layer/ACL) and pass the role to the database. Connections may be pooled, offering better system scaling and less a system less burdened with access logic.
Secure system access at all tiers--web server/NT, application server, database, as well as firewall. Understand single sign-on paradigms, with application server playing traffic cop via ACL, SSL, transaction objects, and brethren. With different front ends or a portal accessing warehouse, thinking about an application server becomes even more important.

Database Physical Design and Performance

Estimate the warehouse storage requirements: monthly, weekly, yearly.
Define ODS and staging areas for dynamic and extensible processing.
Chart database load, backup, replication, and restoration times.
Know when to use views and when to shy away from them.
Consider that your detail and summary data should probably be in different schemas, your main data warehouse non-volatile, and possibly a black-box like structure.
Practice debating and choosing sides over the battle of "Star/Snowflake Schema verses Normalization".
Try to detect when you may run into performance tradeoffs in your DML.
Will you use views as "virtual data marts"?
See the need for a staging area--either separate or incorporated into an ODS type structure. We don't want to upset the ODS with bulk loads; however, we don't want un-validated data to go into the black box data warehouse.
Know exactly where the best physical area in the database (ODS, staging, etc.) will be to store data that has been aggregated?
Understand the danger and advantages non-logged activity such as BCPs and table truncates. Weigh options that may minimize lock contention in the warehouse or data marts. For instance BCPs and table truncates offer "non-logged" operations that my help you bypass bottlenecks.
The warehouse will require a well thought out sizing, partitioning, and indexing strategy as well as deft mapping/placement of database objects to hardware.
Clustered indexes should be used very carefully with non-clustered indexes stripped on a separate physical disk from the base table where needed.
Have enough time windows to update indexes and statistics on huge tables to clean up fragmentation, page splits, etc.
Will main warehouse never have to deal with the burden of validation? Will only pristine data be inserted into this area, with a staging area taking up the validation and scrubbing burdens?

I have only listed a few warehouse cogitations, with the hope that you (as the data architect, manager, or other important person) may set in motion the creative juices necessary for the actualization and manifestation of a great data warehouse. The variations on these themes will be endless. Numerous new considerations--far above and beyond what I have listed--will take you down roads never before traveled. Good Luck!