A FEW GOOD MODELERS
by William Laurent

�

When embarking on a data warehouse project, a team will be assembled that will be responsible for a multitude of diverse missions, from learning the core business to ferreting out the details of the current IT system�s topology�the architectural point of departure. The sheer complexity and size of even the smallest data warehouse projects will stretch existing resources. A formidable number of diverse tasks will have to be assigned to team members based on their strengths, personalities, availability, and other intangibles. As the data warehouse battle is joined, you may share the trenches with the likes of these allies:

�

Data Architects

Systems Architects

Data Modelers

Programmers

Business Analysts

Project Managers

Network Engineers

Data Base Administrators

�

Regarding the above roles, I have recently shared with the Data Modeler many of their common plights. For the next few pages, I would like to enlist a few good men (or women)�a few good modelers!

�

It has always been my experience that the best data modelers are hybrids�hands on technologists (preferably DBAs) blessed with a business understanding of the subject matter to be modeled. Firstly, having an understanding of the business facilitates a higher quality of design, as well as faster information and requirements gathering, with primary business objectives and values held in esteem. Secondly, being cut from the DBA mold will reap enormous benefits downstream as the data modeling process will iteratively take into account things like table growth, locking issues, disk striping, index strategies, object placement, and other hands-on architectural and logistic requisites. The days of the pure data modeler are waning rather quickly: Companies now seek additional value from modeling experts as the business comes to the realization that no modeling should be done in a vacuum. A trained business and technical eye directed towards physical model implementation is imperative.

�

Data models and systems/network architectural blueprints are among the first components to be tackled in re-writing or building from scratch a data warehouse, business intelligence system, or reporting repository. If these critical constructs are flawed, the project is likely doomed to failure from the beginning. A logical model that fails to properly capture, explain, and promulgate business rules is bad; however, a poor physical fulfillment of even the best logical models could spell greater long and short-term disaster.

�

Think of the bountiful value and advantages added to a project when the people who are tasked with designing logical data models are also capable of realizing the model�s physical form as well. I can hear the arguments now: �The physical model is always going to be different from the logical model�; �The two are separate, and the physical model may vary greatly�; �Logical design must conform exactly to the business and these models should never be created with an eye toward their physical implementation.� I would agree in theory to all of these arguments; nevertheless, I have never seen a smoothly run project where a logical modeler turned over his model to a DBA without heavy�sometimes impassioned, always costly in time�discussions and back-and-forth Q&A (a.k.a. revision) sessions. Furthermore, having a business content expert create a logical model that mirrors exactly the business practice is usually a �nice to have�; it will seldom have a large captive audience. Also, let�s be honest: How many times are database objects created directly from an ER tool�s interface without bothering to keep the physical and logical models in-sync in said tool? Lots! Time and resource constraints many times dictate that the logical model is the physical model in most respects.

�

It is important for the data warehouse modeler to know the physical implications of his/her original logical model and how they fit in with the entire warehouse infrastructure. Models will have to constantly bend in order to accommodate the usual moving targets�from development objectives to project timelines to reporting and data-set distribution requirements�as well as physical hardware limitations; risk must constantly be weighed, evaluated, and managed on a daily basis. Who better to communicate the relevant issues and help make punctilious decisions than a data modeler that has DBA related experience? Ongoing difficult judgments and compromises in normalization, disk partitioning, object-to-segment/disk placement, data redundancy and storage, and other server side particulars will have to be made with wisdom and experience. Not only will the best data modelers be DBA conversant, they will have many of the unmistakable talents of a data architect and should be able to seize and run with the henceforth recommendations:

�

Always be thinking about architecture, network, and deployment issues. Know the power, processors, and capability of your servers, and where every data mart and database object fits into the big and small picture. If the initiative is a rewrite or redeployment, architectural hurdles may prove less daunting, but not less important. Understand user needs and timing issues: How much data will need to be sent over the WAN? What are the batch windows? How can the modeling process add value to performance and scalability in the future?

�

Do not get caught up in the game of creating tables for attributes that resemble "two-state" fields or do not require decoding. The business will probably know that the abbreviation �USD� is a United States Dollar; these types of dimensions are not likely to change. So you may want to eliminate the CURRENCY table (attribute values =YEN, USD, CAN, etc.) altogether, for instance, because it simply contains lookups (a few static dimensions) for attribute values that are already implicitly impregnated and ingrained in the end user community�s minds. In other words, know when to say when with lookup tables on static or slowly changing dimensions. You may want to commingle slowly changing dimensional data in an overloaded domain lookup table. This will be a slowly-changing or static-valued table used for data mappings, and to resolve/translate lookup values based on things like keys, abbreviations, acronyms. A domain table will be overloaded with values from opposing source systems and will be differentiated by a unique domain type per source system. On the initial load of the data warehouse, such a table may be a little difficult to populate; however, once populated it can fit in nicely to both normalized and star schema model types with less associative entities to boot.

�

Know how queries are parsed and optimized. To be a true world class data modeler, you must know your way around query execution plans and how SQL statements are parsed and compiled by different databases.

�

ERD modeling tools will not often have the built in flexibility to accommodate unique �homemade� relationship constructs. (For example, if you declare an identifying relationship between a parent and child entity, don�t expect that ERWIN, ER/Studio, et al. will let you selectively delete foreign keys associated with this relationship in your child entity.) Don�t waste valuable time trying to fit round pegs in square holes at the logical level. If you need to devise unwieldy relationships, do so virtually at the physical level through code or constraints if applicable. If the ER tool will not let you do certain things, it may be trying to tell you something anyway: Rethink your approach and be careful; when you make a conscious decision to violate the principles of relational design (as in the above example where the primary and foreign keys are purposely not matching up) you may have a lot of extra coding work down the road, having to rely on multiple work tables and complex max/min/between logic in order to properly aggregate data or traverse through certain tables.

�

Different business groups have a very difficult time consistently defining, implementing, and agreeing on broad ranges of business semantics, rules, and process definitions. As a data modeler, you will be in the middle of these disagreements. When you sense this, choose your battles and questions carefully, and remember: You may never reach consensus or acquiescence on certain concepts (i.e. what constitutes a department, branch, class, portfolio, etc.) Move forward and forge ahead anyway, without unanimous ratification; eventually these contested boundaries, if important enough, will not avoid inevitable escalation up the corporate ladder, and will ultimately be solved.

�

Formulate an ongoing consensus on how to model empirical entities that share real world attributes but have different business roles in a system. In most warehouses, modeling PEOPLE and COMPANY entities [capitalized here for the sake of clarity; consider them as entities] represent common and difficult undertakings. Case in point: A �person� could be a BROKER, TRADER, ACCOUNT OFFICER, ACCOUNT MANAGER, or GUARANTOR, with many further possibilities to boot. Likewise, a COMPANY could be a CLIENT, COUNTERPARTY, or represent any other third party (or first party) corporate entity/element. Anyhow, creating different tables for each role type can quickly lead to redundancy and lateral complexity. Normalizing and grouping (by logical supertype) together different roles (with ROLE tables) will certainly improve normalization (although code to load such structures may require more effort.) Thinking in object-oriented terms�and along the lines of UML practices�will help identify these problem areas that lend themselves to subtyping/supertyping, extension tables, or entity consolidation where role types are assigned.

�

It will help greatly to understand the locking conventions of the database which will host the physical model. Know how locks are promoted to different thresholds, what types of lock thresholds exist (i.e. shared, exclusive, update, and intent locks), and how/where data (row, page, etc.) may be affected from each lock type. For some data modelers, talk of increasing throughput and I/O or avoiding lock contention and dirty reads may sound like jumping way ahead into the future. I disagree.

�

Exalted is the data modeler who understands the concepts of table and index growth, and object allocation: minimum/maximum number of rows per page, fill factors and associated percentages, segment and extent apportion. Estimations of a table�s size and row count over time, and how to properly control and manage it, are valuable forethoughts. Sometimes data distribution, storage, batch limitations, and other issues will force you to consider the option of horizontally or vertically slicing tables, although you will want to be careful of maintenance issues associated with this practice.

�

Understand fully the requirements and needs that the data warehouse or repository is indentured to meet. If the warehouse is primarily a reporting repository, getting caught up in perfectly modeling the business in normalized forms is probably not required. While data in an OLTP environment is structured and optimized for many different and dynamic purposes, data in an executive dashboard, ODS, data warehouse, staging area, or reporting system will be accessed in countless disparate ways from transactional systems. Is your data warehouse a means to an end? If so, model for quick decision-support query capabilities. Is the repository an intermediate step to the creation of data marts? If yes, model for quickness of batch inserts, data archiving, and turnaround and distribution of dynamic business knowledge.

�

History and Rerun Capability

�

Perpetually brainstorm about information archiving and history; ask questions about associated requirements (or paucity thereof) from the earliest opportunity: Do they really make sense? Are they attainable? As I often say, trying to capture history�to even get consensus on capturing history�is the number one killer of data warehousing projects. The data modeler should know the historical requirements for all data components: What gets inserted, updated, purged, offloaded, backed-up, aggregated, ignored, massaged [you get the idea] per each batch run? A general and sequential daily batch plan needs to be formulated�one that lists dependencies of jobs, often influenced by historical requirements (and vice-versa). Tables will have to be loaded in the order that makes most sense. You may have to scan million row tables to find out if a customer is new (and should be inserted), or if an address has changed�creating huge performance squeezes if DML logic is not properly situated within a batch context. Plan the timing of all processing, loading, and inserting of repository tables accordingly; furthermore, structure your batch dependencies around referential integrity dependencies and user distribution requirements.

�

A warehouse repository should have seamless restart and batch rerun capability. Assume that a batch will need to be re-run on occasion due to error conditions in data or processing, and that other batches may have run between the date of the original batch run and the date the error was detected. Many repositories are not able to go back in time and fully rerun without adverse effects�because of the way data is archived and stored, because of the way batches are structured, because no viable process control mechanisms are in place. Well thought out techniques that involve multiple timestamps and batch designators on each table�defining units of batch work and batch process hierarchies�will help tremendously. What is important is that units of batch work and are captured and integrated with control processes in order to better handle mass data migration and movements�whether we append to a given table in a given batch, truncate and reload it, extend the table, or perform selective updates and move old rows to a history table. Usually a changed-based timestamping approach with some degree of complexity is in order. [See Ralph Kimball�s excellent articles for exact details of different date dimension/timestamping techniques.] History will often be tracked and handled differently depending on table type: Fact tables with daily trades or policies will usually deserve more attention than companion lookup tables and other slowly changing dimensions; with these differences come various strategies to handle table rollbacks, surrogate key reparation, and subsequent issues within the dominion of a system rerun.

�

Disparate Strands of Meta Data

�

Be fully aware of each tool�s value�its strengths and weaknesses in data creation, collection, and integration with respect to your overall meta data objectives. Always be heedful that you may (at least initially) wind up with two separate strands of meta data�one with model, and dictionary information (business meanings, descriptions, notes, table growth estimations, and more) from a data modeling tool such as Erwin or ER Studio, and the other (with source to target information, transformation logic, etc.) from your ETL tool. Meta data from both streams is important to the long term health and success of any silo or warehouse project, and depending on the amount and type of meta data integration needed, co-mingling, combining, and merging the two strands may range from easy to difficult. There is no hard-and-fast meta data standard; many modeling tools allow the data modeler to store model meta data in a normalized RDBMS table format (via an XML export or similar operation); but these tables are usually not intuitive or user-friendly in naming or layout, nor will they �sync up� with dictionary meta data produced by competing tools. Because business users, data stewards, and systems administrators will want to see various degrees of dictionary information fairly early in the project, a few extra steps to accommodate, customize, and extend the disparate ETL and modeling tool meta-streams may have to be part of your project plan. In this case, a high level action plan may look something like this:

�

Create data models and ETL transformations in parallel.
Export meta data and dictionary from modeling tool to EXCEL or database du-jour.
Create custom meta data/dictionary tables in the database that will be defined to include such things as mappings and transformations.
Bulk load (from the spreadsheet or tables) data created by the modeling tool into your custom meta data/dictionary tables via an ETL transformation�which merges the meta information created solely by the ETL tool with the modeling tool meta data.
Publish and distribute.
Devise an ongoing change management plan to prevent future disharmony of meta data strands.

�

Surrogate Keys and Normalization

�

I often argue against the use of surrogate keys in many data warehousing circumstances and predicaments; nonetheless, surrogate keys are often silver bullets that modelers have at their disposal for liberal use. Data modelers of all levels know that databases which consolidate information from multiple and distributed source systems will, more often than not, benefit from the use of surrogate keys, as they will help ensure the sanctity of physical primary keys. (We all know that over the course of multiple source systems, fields representing things like customer numbers, account numbers, and other�often primary keyed--attributes will often hold the exact same data value yet represent totally different customers, accounts, etc. depending on the system of origin.) Surrogate keys also offer other advantages:

�

Fact tables join much faster with smaller keys: Surrogates are usually 4 byte integers or the equivalent. Conversely, concatenated keys will potentially mean slower inserts and updates, as the negative impact on performance will increase exponentially with the number of keyed attributes. This problem will compound if you are trying to track history by adding dates (especially DATE datatypes) as part of a table�s key. Large concatenated keys will require the expertise of a DBA�somebody who will be able to properly plan and schedule the optimal rebuilding of indexes on data warehouse or VLDB tables.

�

Continuing the above idea, having a date or timestamp as part of a concatenated primary key may be the antecedent of painful adjustments in the future when asked to react to changing business conditions. For instance, if a data warehouse is architected and modeled to only accept source extracts and process batches on a daily basis, integration of new intra-day, weekly, or monthly changed-based data from other systems may be difficult. How are different levels of date-stamps going to be reconciled? How can ragged levels coexist? A data field that wound up NULL�very possible given anomalies in system feeds�would pose a problem as one bad value could ruin a whole table�s load. Also, you can never assume that timestamps will match up consistently and correctly; when faced with a rerun, reload, or a late-arriving batch/transaction, there may be obstacles to overcome. If there has to be an actual date field in a primary key, usually you will want it to be from your DATE dimension table.

�

In this age, businesses change at light speed. Surrogate keys can help isolate you from system changes brought about by a merger or new corporate initiative. With surrogation, you won�t have to worry if a source-system changes the length or datatype of an attribute (from a character to a numeric for instance) because you will not be using this field attribute as part of a primary key in the data warehouse.

�

Keep making educated guesses about performance and the administrative overhead involved to manage the physical implementations of your logical tables during the initial modeling process. Ask questions iteratively and often about the nature and importance of business data�about what comprises the main data focus of the repository or warehouse: What are its driving entities�equities trading activity for example, or information about the customers that are making these trades? Identify predominant driving tables and functional areas�concepts associated more with OLAP than ERD�even in your normalized framework, because all historical and batch loading efforts will be affected by what the modeler perceives to be the center of the universe in their data models. The physical risks and advantages of various normalized and denormalized approaches to data models (which will be determined by the data focus mentioned above) should be understood by all data modelers.

�

Denormalization will often have many selling points�ditto performance, architectural, and knowledge distribution advantages. Before undertaking any denormalization campaigns in the main warehouse repository, make sure that all source system data elements have been captured and modeled, that your universe is complete. Understand the tradeoffs of every denormalization action [see my white paper on the denormalization tradeoffs at www.loyertcg.com for more information] and guard against unwanted data redundancy and maintenance problems that may arise. More often than not though, the first cut of a repository model (both physical and logical) will be more normalized. There are many hidden arguments that can be made in support of normalization:

�

Tables will tend to be narrower and shorter in scope, with more rows squeezing onto a data page; hence searching, sorting, and creating indexes will be quicker to varied degrees. Overall, index searching and DML will be faster with fewer indexes.

�

With a more normalized approach, there will be more tables, thus more clustered indexes in the data base. In this case, tuning opportunities are increased and less horizontal-focused calibration is needed. Although more joins will be in the cards for warehouse developers, the joins can be tuned and analyzed. Proper indexing (and knowledge of how to implement it) should keep performance robust. (Query plans should always be used to make sure unwanted table scans are not occurring.)

�

There will be less redundant data and fewer null values, preventing many tables from being �space hogs�. You wind up with fewer null values and less redundant (and derived) data, making your database more compact. With less redundant data, there should be less temporary tables needed to do normal set operations; triggers will also execute more quickly.

�

You will be able to communicate more effectively with business users because data is stored in such a way that it will be contextually familiar to them.

�

Changes to existing design and tables are better self-contained and isolated from other areas of the database if distinct business rudiments are atomically resulting in more tables.

�

You will have better control over the placement of physical data onto segments due to the increased proliferation of tables.

�

Naming standards and conventions

�

Naming standards and conventions will come into play upon commencement of any data modeling activity, physical database design and deployment, and the continual expansion and unfolding of corporate data dictionaries. Since standards and conventions are usually considered a simple and dry subject, they are often overlooked during critical phases of development, as team members focus their brains on sexier technological and political considerations. Don�t fall into this trap! When the IT auditors, new hires, or consultants show up in your cube asking questions about database objects and components (which may require you to state the obvious), you may bemoan your lack of focus on such a rustic yet manageable subject.

�

Some good rules of thumb: There should be no plurals in database object names; numerical occurrences in object names should be avoided unless commonly accepted and legitimate business intimations are being represented. Although the system administrator, DBA, data steward, or data modeler will make the ultimate decisions on naming conventions, allow me to volunteer some suggestions for consideration.

�

Warehouse tables should be named either in all upper case or all lower case characters. Make use of underscores between words when needed. Example: ACCOUNT_STATUS, baseball_seas.
Warehouse work tables will be prefaced with WK and an underscore. Example: WK_ACCOUNT.
For all tables: Table name length will not exceed [x] number of characters; field length will not exceed [x] number of characters.
Table attributes will not be prefaced with table name information for keyed values or any other fields. All attributes that model the same exact real-world element should share the same name. For instance, an account number would be named acct_num whether in the account_details, payment, or customer table. This will save developers and DBAs lots of keystrokes and make code more readable. If the a convention is followed where the attribute has its table name in its own name (i.e. account_details.accout_details_acct_num) you may exceed the maximum number of characters allowed by the database for object names�DB2/AS400, with a 18 character limit on field names is a prime example.
In the same spirit as above, foreign key fields will have the same name as the primary key they refer to.
Stored procedures, triggers, and functions may follow many different conventions, being prefaced with s, t (or tr or trg), and f respectively. For triggers, it is usually a good idea to designate the operation type�insert (I), update (U), delete (D)�in their names as well.
Indexes will be prefaced by I_ + table name (or abbreviation) + descriptor. For example: I_ACCOUNT_STAT_PK.
Views will be prefaced with V_ and a descriptor. Example: V_ACCOUNT_HISTORY.

�

Here are some basic proletarian examples of standard field attributes mated to suggested abbreviations. ��Naming standardization practices are purely subjective processes and will depend on industry, personalities, and the strength (and quality) of the coffee in the snack room or employee cafeteria. (Again: these are only suggestions!)

�

Abbreviation�� Description

(to be continued....)

�

�

�

�

Data Architects Systems Architects Data Modelers Programmers Business Analysts Project Managers Network Engineers Data Base Administrators

Data Architects

Systems Architects

Data Modelers

Programmers

Business Analysts

Project Managers

Network Engineers

Data Base Administrators