Non-ETL Data Warehouse Tests You Can Do

Posted by Eric Jacobson at Thursday, June 02, 2011

Your mission is to test a new data warehouse table before its ETL process is even set up. Here are some ideas:

Start with the table structure. You can do this before the table even has data.

Do the columns logically fit the business needs? This was already discussed during design. Even if you attended the design, you may know more now. Look at each column again and visualize data values, asking yourself if they seem appropriate. You’ll need business domain knowledge for this.
Build your own query that creates the exact same table and populates it with correct values. Don’t look at the data warehouse source queries! If you do, you may trick yourself into thinking the programmer must be correct.

Once you have data in your table you can really get going.

Compare the record set your query produced with that of the data warehouse table. This is where 70% of your bugs are discovered.

Are the row counts the same?
Are the data values the same? This is your bread and butter test. This comparison should be done programmatically via an automated test so you can check millions of columns & rows (see my Automating Data Warehouse Tests post). Another option would be to use a diff tool like DiffMerge. A third option, just spot check each table manually.

Are there any interesting columns? If so, examine them closely. This is where 20% of your bugs are hiding. This testing can not be automated. Look at each column and think about the variety of record scenarios that may occur in the source applications; ask yourself if the fields in the target make sense and support those scenarios.

Columns that just display text strings like names are not all that interesting because they are difficult to screw up. Columns that are calculated are more interesting. Are the calculations correct?
Did the design specify a data type change between the source and the target? Maybe an integer needed to be changed to a bit to simplify data…was it converted properly? Do the new values make sense to the business?

How is corrupt source data handled? Does the source DB have orphaned records or referential integrity problems? Is this handled gracefully? Maybe the data warehouse needs to say “Not Found” for some values.
Build a user report based on the new data warehouse table. Do you have a trusted production report from the transactional DB? Rebuild it using the data warehouse and run a diff between the two.

What am I missing?

Keith Rozario said... June 2, 2011 at 10:28 PM

The first preliminary checking should also check for slightly 'technical' things like:
1) What kind of delimiters are we using
2) Is the report readable? fixed-width columns with some color coding (hopefully)

Once that's done. Some checking with business-domain knowledge is good. But the best would be to forward the reports (or part of it) to the business to validate it themselves. Users should be able to tell you what they missed, how the report can be improved..etc etc

Building your own query is a fantastic idea, and the underlying theme is that don't trust the developers by just eye-ball checking their data, try to re-create the results yourself and go through each case systematically. I'll definitely give this a try.
Shawn Marie said... June 15, 2011 at 2:50 PM

Are there foreign key relationships for the table? Should there be foreign key relationships? Are there any data constraints and are they enforced?

You mention verifying data types are correct based on the design but was the data type kept consistent between the source and the new table or did something change? If the data type did change from source to the new table how does this change the test or the test query?

Subscribe to posts