Data Warehousing Articles - Loyer TCG - Technology Consulting Group

EVALUATING ETL TOOLS
by William Laurent

A good ETL tool will easily scale up (and down)--from a departmental solution to one suitable for the enterprise--to meet the most demanding requirements of any project, regardless of platform. There are more than 100 tools on the market that purport to have some ETL function, many of which offer complexity in the face of scalability. To load, change, reshape, and migrate data to both star schema and normalized databases, you must rely heavily on both sequential and parallel processing. Although the point-and-click graphical features of many tools seem to promise quick and easy implementation, the fact is that they don't deliver what they promise--you will have to write code to handle the bulk of your processing. But ETL is about more than transformation and scrubbing of data; automation, scheduling, and communications utilities all have to be provided within your ETL environment. Tools that focus primarily on straightforward extraction and vanilla transformation/loading processes should actually be referred to as data migration tools instead of full-function ETL processors. You must learn to identify the outstanding qualifications as well as the caveats of each ETL product you evaluate. Applying the questions and criterion that follow will help your evaluation team��s objective analysis.

Database Connectivity: What databases does this product connect to? What non-database data sources (mainframe, XML, flat files, etc.) will this product connect to?

Update Capabilities: Can the product dynamically insert flagged data into a table or update it when appropriate?

Surrogate Key Support: Does the product support the use and generation of surrogate keys in the warehouse?

Change Data Capture: Does the product support reading sources conditionally?

Scalability and Continuity: Will the scalability of the tool meet future demands due to system growth? How will it perform under unusual transactional or volumetric circumstances?

Conformity, Integration and Independence: How will the tool behave and operate against existing architectural limitations and requirements? Does it support the hardware platforms and operating systems currently used? Will it integrate with the current network and user security protocols? Is there true platform independence--i.e. the application works the same on all platforms and there will be no nasty surprises when porting compiled code from NT to UNIX--the application will behave exactly the same under two separate platforms and environments?

Intelligent Queries: Does the product write efficient queries? Can you leverage the performance tuning techniques of the database from the product?

Multi-Source Joins: Can the product join data from multiple sources in a simple, and more importantly, a quick and efficient manner?

Aggregate Capabilities: Does the product aid in the creation of aggregates?

BI Tool Integration and Modeling Tool Connectivity: How does the tool integrate with other tools, such as design tools like ERWin and ER Studio, or reporting tools like Brio, Cognos, etc.?

Metadata Support: What types of metadata is collected? What reporting facilities are available within the tool?

Concurrency, Parallelization Tool Architecture: Does the tool take advantage of the efficiencies of the database and hardware such as parallelism or SMP/MMP? What are the maximum number of simultaneous developers and production processes supported? Are true multithreading or multi processor jobs possible? Is the threading unlimited or set within the product? Are multiple jobs required to achieve high degrees of parallelization or can processing only be done in a serial fashion?

Quality Assessment and Management: Does the tool support data quality measurement on an on-going basis? What valuable functionality to handle error detection and processing is inherent in the tool? How flexible is the error condition testing? How are errors logged?

Logging: What are the logging features that help you monitor loads and handle error conditions? How robust are the debugging mechanisms?

Scheduling and Restartability: Can you schedule your loading processes (and restarts) using the product, or do you need to purchase another product to support this functionality--what is built into the tool and what is extra? What features does it have in place for data backup and local/distributed disaster recovery? Can the tool recover from a failed job without manual intervention? How flexible is the tool when executing dependent or sequential job streams? How well does the product respond to inevitable failures such as network problems, server crashes, or running out of disk space?

Ease of Development: How is the learning curve? How well does the tool support testing and debugging while in development modes--do you have to run a whole job all the way through in order to get error results? Are there transformation templates provided? Can transformations be built via drag-and-drop? How must the developer tackle variable length files with the tool? Can developers control the compilation process?

Price and Vendor Comfort: What is the bottom line price (including maintenance) of the product? Are you comfortable with the vendor��s stated price to performance ratio? What do IT advisory services like Gartner, Forrester, Yankee, and META have to say about the value of the product? Remember that real cost represents the purchase price and the hidden support costs associated with each product: It may be necessary to purchase additional hardware, software, or training to realize the full potential of the product in the context of your project.

Administration: How well does the tool handle issues of versioning and auditing? What form of documentation does the tool generate? Will this documentation meet internal documentation standards, both departmental and enterprise? Does the tool support performance and throughput benchmarking? What reporting capabilities (per error log, run log, etc.) are provided? How does the documentation stack up?

Product Acceptance and Market: What is the global installation base for the product? Are there viable and conspicuous users groups? Are there technical and discussion forums on the internet for the toolset/suite?

Support: What is the product support and training that you are being offered from the vendor? Are you comfortable with the availability (24/7, consulting, offshore vs. onshore) of the support? Does the maintenance contract seem fair? How often does the vendor upgrade the product? How painful are the upgrade installations? Will the vendor really work with you to help you define and meet your ETL objectives?

Cursors: How well are cursors supported? Do ETL processes create cursor processing in the underlying native database that make it impossible to view data updates and DML in real time? (Many products rely on system cursor based processing and therefore are not 100% open to tuning or real time row-by-row analysis of data processing. )

Versioning and Security:. Can security on database objects, data stores, and files, as well as transformation scripts be controlled through the tool? How are versioning and change management functions handled within the tool? What happens when developers attempt to make changes to the same routine or transformation object at the same time? Can older incarnations of various routines be retained for reference?

Proprietary Languages: What programming languages are supported: VB, C, C++, BASIC, COBOL? Are the built-in tool functions, string parsing, and macros robust? Does it have a custom script based modeling language aimed specifically at data warehousing and optimization of data movement and cleansing? How long for developers to get up to speed in learning new languages or methodologies?

The data convergence and warehousing goals of each business will be unique and different depending on their respective needs. Desires will be diverse and numerous; therefore, high-level core needs should be identified and agreed upon up front. Examples:

We must have a tool that runs on {specific operating system}.

We must be able to process {amount} GB of data nightly in a {number} hour batch window.

We must have connectivity to {type of data source}.

We must be able to distribute our data to {number and locations of users}.

We must have a tool that supports concurrent development by multiple sites.

After high-level considerations are enumerated, you can start to examine and list the specific requirements of your project and begin to ask pointed questions to your vendors, ensuring that all evaluations and purchases of ETL products are prudent.