Writing Tests for Data Access Code - Data Matters

When we write tests for our data access code, we use datasets for two different purposes:

  1. We initialize our database into a known state before our data access tests are run.
  2. We verify that the correct changes are found from the database.

These seem like easy tasks. However, it is very easy to mess things up in a way that makes our life painful and costs us a lot of time.

That is why I decided to write this blog post.

This blog post describes the three most common mistakes we can make when we use DbUnit datasets, and more importantly, this blog post describes how we can avoid making them.

The Three Deadly Sins of DbUnit Datasets

The most common reason why libraries like DbUnit have such a bad reputation is that developers use them in the wrong way and complain after they have shot themselves in the foot.

It is true that when we use DbUnit datasets, we can make mistakes that cause a lot of frustration and cost us a lot time. That is why we must understand what these mistakes are so that we avoid making them.

There are three common (and costly) mistakes that we can make when we are using DbUnit datasets:

1. Initializing the Database by Using a Single Dataset

The first mistake that we can make is to initialize our database by using a single dataset. Although this is pretty handy if our application has only a handful of functions and a small database with a few database tables, this might not be the case if we are working in a real-life software project.

The odds are that our application has many functions and a large database with tens (or hundreds) of database tables. If we use this approach in a real-life software project, our dataset is going to be HUGE because:

  • Every database table increases the size of our dataset.
  • The number of tests increases the size of our dataset because different tests require different data.

The size of our dataset is a big problem because:

  • The bigger the dataset, the slower it is to initialize the used database into a known state before our tests are run. To make matters worse, our tests become slower and slower when we add new database tables or write new tests.
  • It is impossible to find out what data is relevant for a specific test case without reading the tested code. If a test case fails, figuring out the reason for that is a lot harder than it should be.

Example:

Let's assume that we have to write tests for a CRM that is used to manage the information of our customers and offices. Also, each customer and office is located in a city. The first version of our dataset could look as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	
	<offices id="1" city_id="1" name="Office A"/>
</dataset>

We can see immediately that our test suite has to invoke one unnecessary INSERT statement per test case. This might not seem like a big deal but let’s see what happens when we have to have to write tests for functions that lists customers and offices that are located in a specific city. After we have written these tests, our dataset looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	<cities id="3" name="Turku"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
	
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="3" name="Office B"/>
</dataset>

As we can see,

  • Our test suite has to invoke three unnecessary INSERT statements per test case.
  • It is not clear what data is relevant for a specific test case because our dataset initializes the whole database before each test is run.

This might not seem like a catastrophic failure (and it isn’t), but this example still demonstrates why we shouldn't follow this approach when we write tests for real-life applications.

2. Creating One Dataset per Each Test Case or a Group of Test Cases

We can solve the problems created by a single dataset by splitting that dataset into smaller datasets. If we decide to do this, we can create one dataset per each test case or a group test cases.

If we follow this approach, each one of our datasets should contain only the data that is relevant to the test case (or test cases). This seems like a good idea because our datasets are smaller and each dataset contains only the relevant data.

However, we must remember that the road to hell is paved with good intentions. Although our tests are faster than the tests that use a single dataset, and it is easy to find the data that is relevant for a specific test case, this approach has one major drawback:

Maintaining our datasets becomes hell.

Because many datasets contains data that is inserted to the same tables, maintaining these datasets takes a lot of work if the structure of those database tables is changed (or should we say when?).

Example:

If we use this approach when we write tests for the CRM that was introduced earlier, we could split our single dataset into two smaller datasets.

The first dataset contains the information that is required when we write tests for the functions that are used to manage the information of our customers. It looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
</dataset>

The second dataset contains the information that we need when we are writing tests for the functions that are used to manage the information of our offices. The second dataset looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="3" name="Turku"/>
	
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="3" name="Office B"/>
</dataset>

What happens if we make changes to the structure of the cities table?

Exactly! That is why following this approach is not a good idea.

3. Asserting Everything

We can create a dataset which is used to verify that the correct data is found from the database by following these steps:

  1. Copy the data found from the dataset that is used to initialize the database into a known state before our tests are run.
  2. Paste its content to the dataset that is used to verify that the correct data is found from the database.
  3. Make the required changes to it.

Following these steps is dangerous because it makes sense. After all, if we have initialized our database by using the dataset X, it seems logical that we use that dataset when we create the dataset that is used to ensure that the correct information is found from the database.

However, this approach has three drawbacks:

  • It is hard to figure the expected result because often these datasets contains information that is not changed by the tested code. This is a problem especially if we have made either mistake one or two.
  • Because these datasets contains information that isn’t changed by tested code (such as common database tables), maintaining these datasets is going to take a lot of unnecessary work. If we change the structure of those database tables, we have to make the same change to our datasets as well. This is something that we don’t want to do.
  • Because these datasets often contain unnecessary information (information that is not changed by the tested code), verifying that expected information is found from the database is slower than it could be.

Example:

Let’s assume that we have to write tests for a function that updates the information of a customer (the id of the updated customer is 2).

The dataset that initializes the used database into a known state before this test is run looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
</dataset>

The dataset that ensures that the correct information is saved to the database looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="1" name="Company B"/>
</dataset>

Let's go through the drawbacks of this solution one by one:

  • It is pretty easy to figure out what information should be updated because the size of our dataset is so small, but it isn’t as easy as it could be. If our dataset would be bigger, this would naturally be a lot harder.
  • This dataset contains the information found from the cities table. Because this information isn’t modified by the tested function, our tests have to make irrelevant assertions and this means that our tests are slower than they could be.
  • If we change the structure of the cities database table, we have to modify the dataset that verifies that the correct information is saved to the database. This means that maintaining these datasets takes a lot of time and forces us to do unnecessary work.

Datasets Done Right

We have now identified the three most common mistakes developers make when they are using DbUnit datasets. Now it is time to find out how we can avoid making these mistakes and use datasets effectively in our tests.

Let’s start by taking a closer look at the requirements of a good test suite. The requirements of a good test suite are:

  • It must be easy to read. If our test suite is easy to read, it acts as a documentation that is always up-to-date, and it is faster to figure out what is wrong when a test case fails.
  • It must be easy to maintain. A test suite that is easy to maintain will save us a lot of time that we can use more productively. Also, it will probably save us from a lot of frustration.
  • It must be as fast as possible because a fast test suite ensures fast feedback, and fast feedback means that we can use our time more productively. Also, we must understand that although an integration test suite is typically a lot slower than a unit test suite, it makes no sense to abandon this requirement. In fact, I claim that we must pay more attention to it because if we do so, we can significantly reduce the execution time of our test suite.
You might be wondering why I didn’t mention that each test case must be independent. This is indeed an important requirement of a good test suite but I left it out because if we are already using a tool such as DbUnit, we have probably figured out that the our test cases must not depend from the other test cases.

Now that we know what are the requirements of our test suite, it is a whole lot easier to figure out how we can fulfil them by using DbUnit datasets.

If we want to fulfil these requirements, we must follow these rules:

1. Use Small Datasets

We must use small datasets because they are easier to read and they ensure that our tests are as fast as possible. In other words, we must identify the minimum amount of data that is required to write our tests and use only that data.

Example:

The dataset that is used to initialize our database when we test customer related functions looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
</dataset>

On the other hand, the dataset that initializes our database when we run the tests that test office related functions looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="3" name="Turku"/>
	
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="3" name="Office B"/>
</dataset>

If we take look at the highlighted rows, we notice that our datasets use different cities. We can fix this by modifying the second dataset to use the same cities than the first dataset. After we have do this, the second dataset looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="2" name="Office B"/>
</dataset>

So, what is the point? It might seem that we didn’t achieve much, but we were able to reduce the amount of used cities from three to two. The reason why this little improvement is valuable becomes obvious when we take a look at the next rule.

Typically DbUnit datasets are big and messy, and they contain a lot of redundant data. If this is the case, following this approach will make our datasets a lot more readable and make our tests a lot faster.

2. Divide Large Datasets into Smaller Datasets

We have already created two datasets that contain the minimum amount of data that is required to initialize our database before our tests are run. The problem is that both datasets contain "common" data and this makes our datasets hard to maintain.

We can get rid of this problem by following these steps:

  1. Identify the data that is used in more than one dataset.
  2. Move that data to a separate dataset (or to multiple datasets).

Example:

We have two datasets that looks as follows (common data is highlighted):

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
</dataset>
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="2" name="Office B"/>
</dataset>

We can eliminate our maintenance problem by creating a single dataset that contains the information inserted to the cities table. After we have done this, we have got three datasets that looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
</dataset>
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="2" name="Company B"/>
</dataset>
<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<offices id="1" city_id="1" name="Office A"/>
	<offices id="2" city_id="2" name="Office B"/>
</dataset>

What did we just do?

Well, the most significant improvement is that if we make changes to the cities table, we have to make these changes only to one dataset. In other words, maintaining these datasets is a lot easier than before.

3. Assert Only the Information that Can Be Changed by the Tested Code

Earlier we took a look at a dataset that ensured that the correct information is found from the used database when we update the information of a customer. The problem is that the dataset contains data that cannot be changed by the tested code. This means that:

  • It is hard to figure out the expected result because our dataset contains irrelevant data.
  • Our tests are slower than they could be because they have to make irrelevant assertions.
  • Our tests are hard to maintain because if we make changes to the database, we have to make the same changes to our datasets as well.

We can solve everyone of these problems by following this simple rule:

We must assert only the information that can be changed by the tested code.

Let's find out what this rule means.

Example:

Earlier we created a (problematic) dataset which ensures that the correct information information is saved to the database when we update the information of a customer (the id of the updated customer is 2). This dataset looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>
	<cities id="1" name="Helsinki"/>
	<cities id="2" name="Tampere"/>
	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="1" name="Company B"/>
</dataset>

We can fix its problems by keeping the essential data and removing other data. If we are writing a test that ensures that the information of the correct customer is updated to the database, it is pretty obvious that we don’t care about the information that is found from the cities table. The only thing that we care about is the data that is found from the customers table.

After we have removed the irrelevant information from our dataset, it looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>	
	<customers id="1" city_id="1" name="Company A"/>
	<customers id="2" city_id="1" name="Company B"/>
</dataset>
We should still assert the values of all columns found from the customers table. If we don't do this, we might accidentally update the value of a wrong column and get away with it.

We have now fixed the performance and maintenance problems, but there is still one problem left:

Our dataset has two rows and it not clear which row contains the updated information. This isn’t a huge problem because our dataset is rather small, but it can become a problem when we use bigger datasets. We can fix this issue by adding a comment to our dataset.

After we have done this, our dataset looks as follows:

<?xml version='1.0' encoding='UTF-8'?>
<dataset>	
	<customers id="1" city_id="1" name="Company A"/>
	
	<!-- The information of the updated customer -->
	<customers id="2" city_id="1" name="Company B"/>
</dataset>

A lot better. Right?

Summary

This blog post has taught us that:

  • The road to hell is paved with good intentions. The three most common mistakes that we can make when we are using DbUnit datasets seem like a good idea, but if we do these mistakes in a real life software project, we shoot ourselves in the foot.
  • We can avoid the problems caused by DbUnit datasets by using small datasets, dividing large datasets into smaller datasets, and asserting only the information that can be changed by tested code.
3 comments… add one
  • Greg Aug 21, 2014 @ 0:46

    For the "assert only" rule, how do you easily and reliably identify that program X is only able to update fields Y and Z? Unless you can do this reliably, are you not exposing yourself to bugs when you meant to update City but accidentally overwrite Street?

    I'm waiting for the Reddit-linked version of this post to crap all over you for even mentioning database updates and automated tests in the same sentence, but I totally understand where you are coming from.

    • Petri Aug 21, 2014 @ 9:23

      For the “assert only” rule, how do you easily and reliably identify that program X is only able to update fields Y and Z? Unless you can do this reliably, are you not exposing yourself to bugs when you meant to update City but accidentally overwrite Street?

      Oops.

      It seems that I should have been more clear about this. What I meant was that you should assert all the columns of the table (or tables) that can be updated by the tested code. This will of course make the dataset a bit harder to read, but you can add the comment to the modified row and explain what the expected outcome is.

      Thanks for pointing this out. I will update this blog post later today.

Leave a Reply