Is Automated Checking Valuable For Data Warehouses?
0 comments Posted by Eric Jacobson at Friday, February 26, 2016My data warehouse team is adopting automated checking. Along the way, we are discovering some doubters. Doubters are a good problem. They challenge us to make sure automation is appropriate. In an upcoming meeting, we will try to answer the question in this blog post title.
My short answer: Yes.
My long answer: See below.
The following are data warehouse (or database) specific:
- More suited to machines – Machines are better than humans at examining lots of data quickly.
- Not mentally stimulating for humans – (this is the other side of the above reason) Manual DB testers are hard to find. Testers tend to like front-ends so they gravitate toward app dev teams. DB testers need technical skills (e.g., DB dev skills). People who have them prefer to do DB dev work.
- Straight forward repeatable automation patterns – For each new dimension table, we normally want the same types of automated checks. This makes automated check design easier and faster to code. The entire DW automation suite contains a smaller amount of design patterns than the average appliction.
The following are not limited to data warehouse (or database):
- Time to market – (Automated checks) help you go faster. Randy Shoup says it well at 9:55 in this talk. Writing quick and dirty software leads to technical debt which leads to no time to do it right (“technical debt viscious cycle”). Writing automated checks as you write software leads to a solid foundation which leads to confidence which leads to faster and better (“virtuous cycle of quality”)…Randy’s words.
- Regression checking - In general, machines are better than humans at indicating something changed.
- Get the most from your human testing - Free the humans to focus on deep testing of new features, not shallow testing of old features.
- In case the business ever changes their mind - If you ever have to revist code to make changes or refactor, automated checks will help you do it quicker. If you think the business will never change their mind, then maybe automation is not as important.
- Automated checks help document current functionality.
- Easier to fix problems - Automated checks triggered in a Continuous Integration find problems right after code is checked in. These problems are usually easier to fix when fresh in a developer’s mind.
So, you’ve got a green thumb. You’ve been growing houseplants your whole life. Now try to grow an orchid. What you’ve learned about houseplants has taught you very little about orchids.
- Put one in soil and you’ll kill it (orchids grow on rocks or bark).
- Orchids need about 20 degrees Fahrenheit difference between day and night.
- Orchids need wind and humidity to strive.
- Orchids need indirect sunlight. Lots of it. But put them in the sun and they’ll burn.
- Fading flowers does not mean your orchid is dying (orchids bloom in cycles).
So, you’re a skilled tester. You’ve been testing functional applications with user interfaces your whole career. Now try to test a data warehouse. What you’ve learned about functionality testing has taught you very little about data testing.
- “Acting like a user”, will not get you far. Efficient data testing does not involve a UI and depends little on other interfaces. There are no buttons to click or text boxes to interrogate during a massive data quality investigation.
- Lack of technical skills will kill you. Interacting with a DB requires DB Language skills (e.g., TSQL). Testing millions of lines of data requires coding skills to enlist the help of machine-aided-exploratory-testing.
- Checking the health of your data warehouse prior to deployments probably requires automated checks.
- For functional testing, executing shallow tests first to cover breadth, then deep tests later is normally a good approach. In data testing, the opposite may be true.
- If you are skilled at writing bug reports with detailed repro steps, this skill may hinder your effectiveness at communicating data warehouse bugs, where repro steps may not be important.
- If you are used to getting by as a tester, not reading books about the architecture or technology of your system-under-test, you may fail at data warehouse testing. In order to design valuable tests, a tester will need to study data warehouses until they grok concepts like Inferred Members, Junk Dimensions, Partitioning, Null handling, 3NF, grain, and Rapidly Changing Monster Dimensions.
Testers, let’s respect the differences in the projects we test, and grow our skills accordingly. Please don’t use a one-size-fits-all approach.
Testing Against Live Read-Only Production Data
3 comments Posted by Eric Jacobson at Tuesday, October 08, 2013My data warehouse project team is configuring one of our QA environments to be a dynamic read-only copy of production. I’m salivating as I try to wrap my head around the testing possibilities.
We are taking about 10 transactional databases from one of our QA environments, and replacing them with 10 databases replicated from their production counterparts. This means, when any of our users perform a transaction in production, said data change will be reflected in our QA environment instantly.
Expected Advantages:
- Excellent Soak Testing – We’ll be able to deploy a pre-production build of our product to our Prod-replicated-QA-environment and see how it handles actual production data updates. This is huge because we have been unable to find some bugs until our product builds experience real live usage.
- Use real live user scenarios to drive tests – We have a suite of automated checks that invoke fake updates in our transactional data bases, then expect data warehouse updates within certain time spans. The checks use fake updates. Until now. With the Prod-replicated-QA-environment, we are attempting to programmatically detect real live data updates via logging, and measure those against expected results.
- Comparing reports – A new flavor of automated checks is now possible. With the Prod-replicated-QA-environment, we are attempting to use production report results as a golden master to compare to QA report results sitting on the pre-production QA build data warehouse. Since the data warehouse data to support the reports should be the same, we can expect the report results to match.
Expected Challenges:
- The Prod-replicated-QA-environment will be read-only. This means instead of creating fake user actions whenever we want, we will need to wait until they occur. What if some don’t occur…within the soak test window?
- No more data comparing? - Comparing transactional data to data warehouse data has always been a bread and butter automated check we’ve performed. These checks check data integrity and data loading. Comparing a real live quickly changing source to a slowly updating target will be difficult at best.
Non-ETL Data Warehouse Tests You Can Do
2 comments Posted by Eric Jacobson at Thursday, June 02, 2011Your mission is to test a new data warehouse table before its ETL process is even set up. Here are some ideas:
Start with the table structure. You can do this before the table even has data.
- If you’ve got design specs, start with the basics;
- Do the expected columns exist?
- Are the column names correct?
- Are the data types correct?
- Is the null handling type correct?
- Do the columns logically fit the business needs? This was already discussed during design. Even if you attended the design, you may know more now. Look at each column again and visualize data values, asking yourself if they seem appropriate. You’ll need business domain knowledge for this.
- Build your own query that creates the exact same table and populates it with correct values. Don’t look at the data warehouse source queries! If you do, you may trick yourself into thinking the programmer must be correct.
Once you have data in your table you can really get going.
- Compare the record set your query produced with that of the data warehouse table. This is where 70% of your bugs are discovered.
- Are the row counts the same?
- Are the data values the same? This is your bread and butter test. This comparison should be done programmatically via an automated test so you can check millions of columns & rows (see my Automating Data Warehouse Tests post). Another option would be to use a diff tool like DiffMerge. A third option, just spot check each table manually.
- Are there any interesting columns? If so, examine them closely. This is where 20% of your bugs are hiding. This testing can not be automated. Look at each column and think about the variety of record scenarios that may occur in the source applications; ask yourself if the fields in the target make sense and support those scenarios.
- Columns that just display text strings like names are not all that interesting because they are difficult to screw up. Columns that are calculated are more interesting. Are the calculations correct?
- Did the design specify a data type change between the source and the target? Maybe an integer needed to be changed to a bit to simplify data…was it converted properly? Do the new values make sense to the business?
- How is corrupt source data handled? Does the source DB have orphaned records or referential integrity problems? Is this handled gracefully? Maybe the data warehouse needs to say “Not Found” for some values.
- Build a user report based on the new data warehouse table. Do you have a trusted production report from the transactional DB? Rebuild it using the data warehouse and run a diff between the two.
What am I missing?
Automating Change Data Capture Data Warehouse Tests
1 comments Posted by Eric Jacobson at Tuesday, March 29, 2011Our Data Warehouse uses Change Data Capture (CDC) to keep its tables current. After collaborating with one of the programmers we came up with a pretty cool automated test template that has allowed our non-programmer testers to successfully write their own automated tests for CDC.
We stuck to the same pattern as the tests I describe in my Automating Data Warehouse Tests post. Testers can copy/paste tests and only need to update the SQL statements, parameters, and variables. An example of our test is below. If you can’t read C#, just read the comments (they begin with //).
BTW – if you want a test like this but are not sure how to write, just ask your programmers to write it for you. Programmers love to help with this kind of thing. In fact, they will probably improve upon my test.
Happy Data Warehouse Testing!
[TestMethod]
public void VerifyCDCUpdate_FactOrder()
{//get some data from the source
var fields = DataSource.ExecuteReader(@"
SELECT TOP (1) OrderID, OrderName
FROM Database.dbo.tblOrder");
var OrderID = (int)fields[0];
var originalValue = (string)fields[1];
//make sure the above data is currently in the Data Warehouse
var DWMatch = new DataSource("SELECT OrderID, OrderName FROM FactOrder WHERE OrderID = @OrderID and OrderName = @OrderName",
new SqlParameter("@OrderID", OrderID),
new SqlParameter("@OrderName", originalValue));
//fail test is data does not match. This is still part of the test setup.
DataSourceAssert.IsNotEmpty(DWMatch, "The value in the datawarehouse should match the original query");
try
{
// Set a field in the source database to something else
var newValue = "CDCTest";
DataSource.ExecuteNonQuery(
@"UPDATE Database.dbo.tblOrder SET OrderName = @NewValue WHERE OrderID = @OrderID",
new SqlParameter("@NewValue", newValue),
new SqlParameter("@OrderID", OrderID));
var startTime = DateTime.Now;
var valueInDW = originalValue;
while (DateTime.Now.Subtract(startTime).Minutes < 10)
{
// Verify the value in the source database is still what we set it to, otherwise the test is invalid
var updatedValueInSource = DataSource.ExecuteScalar<string>(@"SELECT OrderName FROM Database.dbo.tblOrder WHERE OrderID = @OrderID",
new SqlParameter("@OrderID", OrderID));
if (updatedValueInSource != newValue)
Assert.Inconclusive("The value {0} was expected in the source, but {1} was found. Cannot complete test", newValue, updatedValueInSource);
//start checking the target to see if it has updated. Wait up to 10 minutes (CDC runs every five minutes). This is the main check for this test. This is really what we care about.
valueInDW = DataSource.ExecuteScalar<string>(@"SELECT OrderName FROM FactOrder WHERE OrderID = @OrderID",
new SqlParameter("@OrderID", OrderID));
if (valueInDW == newValue)
break;
Thread.Sleep(TimeSpan.FromSeconds(30));
}
if (valueInDW != newValue)
Assert.Fail("The value {0} was expected in DW, but {1} was found after waiting for 10 minutes", newValue, valueInDW);
}
finally
{
// Set the value in the source database back to the original
// This will happen even if the test failes
DataSource.ExecuteNonQuery(
@"UPDATE Database.dbo.tblOrder SET OrderName = @OriginalValue WHERE OrderID = @OrderID",
new SqlParameter("@OriginalValue", originalValue),
new SqlParameter("@OrderID", OrderID));
}
}