
In this article, we
To better understand the matter at hand, imagine a background service or job (a system under test) that consumes messages containing JSON documents from a queue and subsequently stores these documents in a cloud object storage such as Azure Blob Storage or AWS S3. Is there a way to test such a service end-to-end in a way that scales both with the number of tests and team members? Can such a test be run easily on a local workstation by anybody on the team?
Architecture of the example serviceTo test such a service, typically one of the following strategies is used:
Abstraction External infrastructure is abstracted from the code with patterns such as Gateway or Repository. These patterns are frequently paired with test doubles such as mocks, stubs, or fakes.
Real infrastructure Code is tested against real infrastructure, which is either deployed in a dedicated testing instance (frequently shared by multiple engineers or teams) or production infrastructure is used.
The code is not tested at all This is an easy way out, but it is clearly not sustainable in the long term.
Testing with real infrastructureLet’s start the discussion by asking whether real infrastructure dependencies such as SQL databases, Azure’s Storage Accounts, Key Vaults, Service Bus, Event Hubs, or AWS’s S3, Dynamo, Kinesis, or SES should be used during testing.
Based on our observation, with a slight grain of salt, there are two opposing camps of engineers with strong opinions on this topic.
On the more traditional side, there are engineers who intentionally design both production and test code so that external dependencies are used only in the latest stages of the testing pipeline (such as acceptance or system tests). These engineers are often willing to go to great lengths to abstract away access to all the external dependencies in the code and use mocks or other test doubles heavily. They attempt to cover as much of the codebase as possible with unit tests or other kinds of tests that do not require external dependencies.
On the opposite side, some engineers find almost zero value in any tests that are not running against a fully integrated system and disapprove introducing abstractions in the code only to decouple logic from the infrastructure access.
Obviously, both approaches have issues, and they both have merit. The optimal testing strategy will be a mixture of the two. However, selecting the optimal testing strategy in a given context is not a simple unidimensional decision; it requires evaluating the specific context of the given system and the team.
Now, let’s look into specific challenges we face when choosing one approach over the other.
When striving to use real infrastructure as little as possible (shifting right), most problems are caused by the introduced abstractions and the need for test doubles:
Incidental complexity Abstracting all infrastructure access leads to additional code and non-essential (incidental) complexity. Navigating multiple layers of indirections when debugging a simple HTTP call is neither pleasant nor effective.
Skill transfer Engineers who previously worked with some specific kind of infrastructure and used standard, widely known, SDKs to access it, cannot reuse their existing knowledge. Instead, they must learn new set of application-specific abstractions and interfaces.
Effort spent on test doubles The complexity of writing test doubles (mocks, stubs, fakes) might easily get out of hand. Especially with mocks, a large amount of non-reusable code is typically produced.
Coverage Some parts of the codebase (the part for infrastructure access) are tested only by tests that are run less frequently, are more expensive to create, and are painful to keep up to date. This makes such parts of the codebase more difficult to maintain and more error-prone.
On the other hand, when using the real infrastructure as much as possible (shifting left), problems arise with provisioning the infrastructure and complexity of the test code:
Setup It is not straightforward to ensure that each test runs on infrastructure in the state expected by the test. For a database, this means creating and cleaning up relevant tables, making sure that tests use an independent set of rows/tables, or recreating the database for each test. Any of these approaches will certainly increase the complexity of the test code.
Time Tests are taking much more time. Creating and cleaning the infrastructure is not instant; for example, creating a new Azure Storage Account might take 30 seconds. Also, the test execution is slower: assuming that a typical HTTP call takes 100 ms and a method call on a mock takes 100 nanoseconds, the difference might get huge (1 ms = 1 000 000 ns).
Need for isolation Special effort must be made to isolate the test runs. Compared to tests using in-memory test doubles (which have exactly zero chance of interfering with other engineers’ test runs), tests using external infrastructure might conflict and thus be flaky or non-deterministic.
Costs Tests might get expensive. Real infrastructure is not free. The additional cost must be taken into account, especially when the pay-as-you-go/serverless infrastructure cannot be used.
Fault simulation When using real infrastructure, it is almost impossible to simulate specific faults, such as requests timing out or failing, to which the system under test should be resilient. With in-memory test doubles, simulating such faults is easy.
Now, imagine, just for a moment, that all the problems with the real infrastructure in the test code disappear:
In such a hypothetical situation, would there be any reason not to use real infrastructure in all the tests? In our opinion, the answer is no, with one exception. When a test case needs to verify how the system under test responds to infrastructure faults or delays, using real infrastructure is, in most cases, not feasible. Otherwise:
Of course, this is not the reality we live in. However, with modern programming languages, infrastructure tools, cloud services, and testing frameworks, we are much closer to this ultimate state now (2024) than we were just a few years ago. This is especially true when compared to 10-20 years ago, when many best practices around software testing were devised. Despite these advancements, many people continue to dogmatically follow those older practices today.
We believe the advances in the tooling mentioned above will continue at the same or higher pace, and thus, it is worth betting on this when designing a testing strategy. However, we are still far from the ideal situation, so using real infrastructure in tests should be carefully considered. This leads to several principles being explained further.
Same as with software architecture, the correct testing strategy depends on many factors that continuously change. Factors to consider include:
How many engineers are on the development team? A system that a single indie developer develops should probably be tested differently than a system developed by a startup of 8 engineers, and such a startup has different needs than a department of 10 teams. The indie developer can spin-up one testing instance of a database and use it for all the tests with success. On the contrary, the entire department using a single database instance might lead to problems with large number of tests running concurrently, different requirements on seed data and such.
How many tests are there and how long do they take? Small and fast test suites are typically fine to be run sequentially. On the contrary, larger suites or suites containing long-running tests typically need to be run in parallel and thus designed accordingly.
Is the infrastructure access central to the functionality of the tested system? Consider two systems. First, a business application whose essential complexity is in evaluating business rules in memory over tabular data which can be fetched via ODBC from any relational database. Second, a data-intensive application carefully manipulates blocks of Azure Storage blobs as part of its core functionality. Testing with real infrastructure surely brings more value to the second application than the first one. A SELECT query in a relational database is an operation that is both trivial to execute and easy to emulate without real infrastructure thanks to existing tools in the area such as SQLite or Entity Framework. Manipulating blocks in Azure Storage blobs, on the other hand, is not so straightforward and existing emulation tools are lacking.
How sophisticated is the provisioning of the infrastructure? When infrastructure provisioning is fast and simple, it shifts the balance toward creating more tests relying on real infrastructure. For example, if an engineer is able to provision an AWS S3 bucket directly in test code via an API call, the cost-value ratio of writing a test that needs real infrastructure improves dramatically compared to a situation when creating the infrastructure requires manual steps or an unfeasible amount of time.
What are the capabilities of the toolchain being used? Not all programming languages, ecosystems of libraries, and build tools are equal. This aspect is especially relevant for mocking or similar techniques that heavily rely on reflection, code emitting, or source generation. Also, we might find interesting differences between language ecosystems regarding time abstraction. In some ecosystems, the time abstraction is deeply integrated (such as for .NET and its TimeProvider), while 3rd party libraries or custom abstractions are needed for others.
This rephrasing of Albert Einstein’s quote, “Everything should be made as simple as possible, but not simpler”, is the main principle when deciding test strategy. We should minimize the number of abstractions and test doubles (for the reasons explained above) but not beyond the threshold where various qualities of the tests would suffer. The test qualities that might be compromised include:
Designing for testability is a sound principle, battle-tested not only in software but also in hardware and other kinds of engineering. In software engineering, it is known not just for making testing more efficient but making the tested code itself better by:
Unfortunately, attempting to design code for testability also often leads to adding more layers of abstractions in the quest to make the code using external infrastructure testable without using the actual infrastructure. Such abstractions serve no other purpose than hiding access to the infrastructure. They have no meaning related to the purpose of the actual application.
We are convinced that such abstractions are only an incidental complexity and should be eliminated if possible. However, this does not mean giving up on the separation of concerns or breaking layering/modularization rules imposed by our architecture of choice.
When using DDD, we are not advocating starting to access infrastructure directly in the domain layer instead of using repositories. Repositories, event publishers, and such have a special, well-understood function in the DDD that goes beyond the simple hiding of infrastructure access. In the case of repositories, they are responsible for fetching domain model aggregates which includes constructing and deconstructing domain objects and might involve orchestrating access to more than one external dependency.
In general, DDD’s domain layer is designed to be highly testable without reliance on real infrastructure. However, when it comes to the application or infrastructure layer, we advocate implementing at least a few tests that utilize real infrastructure or a reliable emulator. It is also crucial to avoid introducing abstractions, beyond those required by DDD that serve only to hide access to infrastructure.
In software testing, many dogmas or “best practices” might have been valid only in the past or only in some specific contexts, but they are, even today, considered the only correct way to test code. We think it is very important to challenge these dogmas and find new ways that are adapted to a given system, team and state-of-the-art tools. Examples of such dogmas include:
“Unit tests cannot have any external dependencies” If my code unit needs the external dependency to perform its core functionality and I am able to provide the dependency within the test with reasonable complexity, why should I use a mock or other test double instead?
“Each test must be executed on an isolated infrastructure with the same initial state.” This practice definitely has its place in larger teams or for tests that validate code behavior under very specific, intricate conditions. This might include concurrent access or testing systems under specific kind of stress. However, many, if not most, tests validate logical operations that should, by design, behave exactly the same under any circumstances. In many situations, such tests are executable on a shared infrastructure without issues.
“Test kinds other than unit tests require complex tooling and setup.” This might have been true in the past, when, for example, web servers required separate heavyweight processes and obtaining a new database took hours or days. But these days, it is much easier. We can spin up a web server directly within the test process, connect to an in-memory database or a database created in the cloud, and finally, use a standard HTTP client to make a request and verify the response, all within a single process.
Building on the previous example of a dogma, here is a sample code snippet in C# using ASP.NET, Spotflow In-Memory Azure Test SDK and FluentAssertions library. This code demonstrates how straightforward it can be to test a web API that depends on Azure Blob Storage.
First, let’s introduce a production code that bootstraps and starts the web application. The application is accepting HTTP PUT requests on /upload/{name} route:
static class ExampleApplication
{
public static async Task StartAsync(BlobContainerClient client)
{
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddSingleton(client);
var app = builder.Build();
app.MapPut("/upload/{name}", HandleRequest);
await app.StartAsync();
return app;
}
static IResult HandleRequest(string name, HttpContext context, UploadService service)
{
service.Upload(name, context.Request.Body);
}
}The application makes use of UploadService class for uploading the incoming data to the Azure Blob Storage. This service internally uses BlobContainerClient from the official Azure SDK to perform the actual infrastructure operations:
class UploadService(BlobContainerClient client)
{
public void Upload(string name, Stream stream) => client.UploadBlob(name, stream);
}At last, let’s see how the test code looks like:
[TestMethod]
public async Task Example_Service_Should_Upload_Blob()
{
// Arrange: create in-memory storage account, start the app and prepare HTTP client
var storage = new InMemoryStorageProvider();
var account = storage.AddAccount();
var conn = account.CreateConnectionString();
var containerClient = new InMemoryBlobContainerClient(conn, "data", storage);
using var app = await ExampleApplication.StartAsync(containerClient);
using var httpClient = new HttpClient { BaseAddress = new Uri(app.Urls.First()) };
// Act: send HTTP request
var data = BinaryData.FromString("test-content").ToStream();
var content = new StreamContent(content);
var response = await httpClient.PutAsync("/upload/test-name", content);
// Assert: check response code, and existing blobs in the storage
response.Should().HaveStatusCode(HttpStatusCode.OK);
containerClient.GetBlobs().Should().HaveCount(1);
containerClient.GetBlobClient("test-name").Exists().Should().BeTrue();
}We believe that this is a great example of how easily, with modern tools, an application that requires two distinct pieces of infrastructure (Azure Storage and web server) can be tested.
This might be a more controversial point, but we firmly believe fakes are superior to mocks on many levels when test doubles are needed.
For further discussion, consider the following example: class MessageProcessor that requires AppendBlobClient class for appending data into an Azure Storage Append Blob. MessageProcessor internally calls the method AppendBlock(Stream content) method on the AppendBlobClient. This client also has DownloadContent method that allows to read all appended blocks at once in a form of a Stream.
class AppendBlobClient
{
void AppendBlock(Stream content) { .. }
Stream DownloadContent() { .. }
}We want to test scenarios where one or more messages are sent to MessageProcessor which should in turn append them into a blob so that all appended messages can be read at once.
With mocking, engineers typically craft mocks of dependencies of the code that is being tested and then assert that specific methods with specific parameters were called by the tested code, sometimes also in the specific order (so-called behavior verification). Such mocks are crafted for each specific test case. In the context of the example above, the engineer would create a mock object for the AppendBlobClient, for example with a library like Moq or NSubstitute in .NET, and then assert that the method AppendBlock was called with specific content.
With fakes, engineers implement a working version of the dependency that mimics the real behavior as closely as necessary but is as simplified as possible (e.g. storing data in memory only). This simplified version is then used in the tested code. To verify the correctness of the system under test, various properties of the fake are asserted after the test case is executed (so-called state verification). In the example above, the engineer could create a class that inherits AppendBlobClient, e.g., InMemoryAppendBlobClient : AppendBlobClient, provide a simple in-memory implementation of the blob and then assert that the blob has expected content by calling the DownloadContent method.
This might not seem like a big difference at first, but the implications are significant:
AppendBlock method was called.void/Unit return type, but it gets really tedious otherwise. With fake, only one implementation is needed that might be reused for multiple test cases, tested, versioned, shared externally, and overall treated as any other code.MessageProcessor will be updated to buffer multiple messages and call AppendBlock only once? With mocks, the existing tests will start failing. With fakes, the existing tests remain intact.So far, we have considered fakes to be implemented in the same language and used in the same process as the tested code. However, fakes can also be implemented in a language-agnostic way, most frequently as a Docker container that can be run locally. The tested code simply connects to it via the local network as it would connect to the real dependency, e.g., over the Internet.
Testing with in-memory test doubles/fakes
Testing with fake implementation running as a Docker container on local networkMoreover, with fakes, chances are that there are already existing fake implementations for many dependencies. Some examples:
We recommend an excellent article by Martin Fowler on the topic for an in-depth discussion of the differences between fakes, mocks, and other kinds of test doubles, or more generally, between state verification and behavior verification.
When it comes to testing code that requires external infrastructure, there are two main approaches: abstracting access to the external infrastructure via patterns such as Gateway and Repository, or using real infrastructure during testing. Both approaches come with specific challenges, such as incidental code complexity when using abstraction or test setup complexity when using real infrastructure.
For most non-trivial systems, both of these approaches need to be combined to achieve the desired levels of code testability as well as test complexity, repeatability and speed. Although there is no simple recipe that anyone can use to determine the testing strategy that would fit any specific system, team, and available tooling, we provide a framework for everybody to make their own decisions, alongside a few widely applicable principles. These principles can be summarized in the following points:
The author wishes to thank Michal Zatloukal, Jakub Švehla, and David Nepožitek for their valuable insights that contributed to this article.