This is the approach my uTest testing library (https://github.com/com-lihaoyi/utest) takes. I don't think it's unique to functional tests, even unit tests tend towards this pattern. Tests naturally form a tree structure, for multiple reasons:
- You usually have shared initialization nearer the root and the various cases you want to assert at the leaves.
- You want to group related tests logically together, so it's not one huge flat namespace which gets messy
- You want to run groups of tests at the same time, e.g. when testing a related feature
Typically, these different ways of grouping tests all end up with the same grouping, so it makes a lot of sense to have your tests form a tree rather than a flat list of @Test methods or whatever
Naturally you can always emulate this yourself. e.g. Having helper setup methods that call each other and form a hierarchy, or having a tagging discipline that forms a hierarchy to let you call tests that are related, or simply using files as the leaf-level of the larger filesystem tree to organize your tests. All that works, but it is nice to be able to simplify define a tree of tests in a single file and have all that taken care of for you
"One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it."
That's such a great condensation of why automated tests are worthwhile.
"To write your own testing framework based on continuation trees, all you need is a stack of databases (or rather, a database that supports rolling back to an arbitrary revision)."
PostgreSQL and SQLite and MySQL all support SAVEPOINT these days, which is a way to have a transaction nested inside a transaction. I could imagine building a testing system on top of this which could support the tree pattern described by Evan here (as long as your tests don't themselves need to test transaction-related behavior).
I feel uneasy about using transactions like that. Eventually somebody will be puzzled enough to enable statement logging, which will not contain the data they need. Then they'll set a breakpoint in the test and get a shell on the db to see what's actually there, and they'll find it empty.
And by eventually somebody, I mean two days ago me.
I'd much rather just have a utility that copies the database actually and hands my test the one it's allowed to mess with.
One trick I've experimented with is setting things up so that if an assertion fails a copy of the full current state of the database is automatically captured and written out to a file on disk.
An approach we've done is to use Postgres' "create database foo2 template foo1" syntax to essentially snapshot the db under test at various points and use those to rollback as needed.
End-to-end (e2e) tests are slow and flaky. They don't have to be, but effort to fix breakage starts consuming most available time.
One idea is to separate scraping from verification. The latter would run very fast and be reliable: it only tests against stored state.
Then scraping is just procedural, clicking things, waiting for page loads, and reading page elements into a database.
Some consequences are needing integrity checks to ensure data has been read (first name field selector was updated but not populated), self-healing selectors (AI, et al), and certifying test results against known versions (fixing the scraper amid UI redesign).
A lot of effort is saved by using screenshot diffing of, say, React components, especially edge cases. It also (hopefully) shifts-left test responsibility to the devs.
Ideally, we only have some e2e tests, mostly happy paths, that also act as integration tests.
We could combine these ideas with "stacked databases" from the article and save on duplication.
Finally, the real trick is knowing, in the face of changes, which tests don't have to run, making the whole run take less time.
If they are slow, it means your application is slow. Good thing your tests make you realize it so you can work on it.
If they are flaky either your application is flaky or your UI is hard to use. Anyway that's something your tests tell you you have to fix.
And last: if your tests are all independents, why not run them all in parallel? With IaC you should be able to provision one instance of your architecture per test (or maybe dozen tests) easily.
Even in parallel, there are tradeoffs: run them all in one container, or chunk them out to available workers. The first runs into resource constraints; the latter takes up everything in the shared pool.
With IaC, emulating a constellation of all dependent services along with the site is technically feasible. (There are other possible constraints.)
What's your ideal scenario? For example, k8s + cloud, ephemeral db, auto-merged IaC file of a thousand services, push-button perf testing, regression suite with a hundred bots, etc.
On premise k8s cloud where you can deploy many instances of the exact same services you have in prod. Let's say a E2E test takes 5s to run, your deployment takes 2mn and you want to stay under the 5mn line for running your test suite: deploy an instance of all your services and their databases per batch of 30 tests.
I can understand this not really being possible at an Amazon scale. But for most businesses? A good beefy server should be enough.
Even without the continuation piece, it has always puzzled me why the test frameworks that I've used (mostly pytest and catch) don't explicitly model dependencies between layers. Especially in a system where layers have been carefully levelized. Assuming for the sake of example that there are no mocks involved, if subsystem B depends on subsystem A (say A is some global utility classes), then I would want all of A's unit tests to pass before running any of subsystem B's tests. Not sure why this is absent, or perhaps I'm using the wrong test systems.
The C++ testing framework Catch2 enables this kind of testing. The first time I saw it I couldn't figure out how some of the tests would even pass.
It turns out that using some evil macro magic, each test re-runs from the start for each inner section [1]. It also makes deduplicating setup code completely painless and natural.
You just have to get over the completely non-standard control flow. It's a good standard bearer for why metaprogramming is great, even if you're forced to do it in C/C++'s awful macro system.
If you specify the operations (API) of your system in a relational algebra, then you can use that algebra to generate valid state transitions. (this essentially can construct the tree of continuations the article is discussing or enumerate the paths of this tree)
If you create a query language, then the state can be verified to match expectations at any point.
I don’t really know what you’re talking about, and have a hard time imagining how ideas from relational algebra can be applied to all APIs.
For example, many database-like things already use relational algebra and an actual query language, for sure. But how does this apply to, say, a GUI toolkit or an audio device driver?
Do you have a doc or article that describes more about this? I’ve worked with relational algebra before but I’ve never heard it described with an API before. Are the responses all table based? I suppose you could wrap all calls with a SQL style API?
Traditional advice is to keep tests independent of each other, that's why the setup part gets repeated instead of being inherited from the parent tests.
Independent tests can be run in parallel, dependent tests cannot be.
But I can see how this approach allows for parallelism as well, I especially like the fact that you only get one failure in case one of the steps fail
on one hand I suspect too much code has explicit and implicit global state for this technique to be useful; on the other hand using this from the beginning might prevent introducing that sort of global state in the first place.
You would lose the performance optimisations of a copy, but you can still express tests in this tree fashion with global state. You just need to recurs the tree from the top down and run the entire path up to each node. A simple example:
>>> on one hand I suspect too much code has explicit and implicit global state for this technique to be useful ...
>> Any collaborations having observable side effects such as these are very difficult to prove correct, regardless the test approach employed.
> yes but resetting the environment before each test is one way to deal with them
Not really.
Resetting an environment for each test requires distinct tests exist for all anticipated workflow permutations. While this is onerous when the side effects are limited to in-process mutable state (such as global, session, and thread-local data), it is infeasible when global state is a persistent store[0].
FWIW, languages which support Kleisli[0] types can achieve a similar benefit by defining functional tests composed of same. Many times, "lower level" Kleisli constructs can be shared across related functional tests to reduce duplication.
I wouldn't call 5-level-nested code "surprisingly clean", and continuations are cursed. I wouldn't want to have to debug tests that relied on continuations unnecessarily.
The code doesn't have to be nested, it can be factored into methods. Additionally, these aren't technically continuations in the sense of the "continuation passing style" popular in web programming, so don't confer the same drawbacks. Tests are essentially organised as a pipeline of functions which gets run, it's just that the pipeline has forks in it where the data flows along both forks.
Whenever I see a test suite do 9 steps of setup to assert one thing and then (mostly) the same 9 steps again to assert some other thing, I die a little inside. Especially when the setup takes multiple seconds for each case.
The lesser evil is to just ”do what you need and test everything once you are arranged”.
You won’t get hundreds of neatly separated well-named test cases which fail for a single reason. But for slow tests that isn’t as important as keeping the redundant setup away.
I like the tree idea but once we have simple pure/immutable we don’t really have the problem of redundant setup being slow, just ugly.
Note that the tree concept is orthogonal between writing and running - tests written like a tree can just as well desugar into a linear list of all possible combinations, re-running the setup steps every time.
How do you “test once arranged” as your test are modifying things that conflict?
Setup: login.
Test 1: delete your account.
Test 2: can change username.
The last time I saw this handled was a test copy of db per thread and transactional test rolled back. Not great but it did 10x our pipelines and avoided locks and issues.
If they conflict then it's a separate "story". They couldn't be part of the same timeline. You don't end up with one big ball of mud test which tests everything in the app, you end up with N different tests based on what setup they need, each one doing as many asserts as is supported by that setup. Here it might be test1: delete account. Test2: test changing things on the account (email, profile picture, ...) and asserting that each of the changes work.
BUT obviously you can just test deleting the account at the end of the test that modifies the account.
SetAccountUserName(account, "New name");
GetAccount(account.id).UserName.Should().Be("New name");
A related evil is when people start with a test that orchestrates the whole flow, copy it, and make one small change towards the beginning of the flow. They know, at the time of doing this, that the part they care about happens after one second. They could just let it crash and burn after making the critical assertion and the test would be just as useful.
But they commit an entirely new test filled with redundant stuff that takes way longer than is necessary and makes it unclear which assertion was the critical one. Because hey, look at all that green text in my PR. I'm so thorough.
This is the approach my uTest testing library (https://github.com/com-lihaoyi/utest) takes. I don't think it's unique to functional tests, even unit tests tend towards this pattern. Tests naturally form a tree structure, for multiple reasons:
- You usually have shared initialization nearer the root and the various cases you want to assert at the leaves.
- You want to group related tests logically together, so it's not one huge flat namespace which gets messy
- You want to run groups of tests at the same time, e.g. when testing a related feature
Typically, these different ways of grouping tests all end up with the same grouping, so it makes a lot of sense to have your tests form a tree rather than a flat list of @Test methods or whatever
Naturally you can always emulate this yourself. e.g. Having helper setup methods that call each other and form a hierarchy, or having a tagging discipline that forms a hierarchy to let you call tests that are related, or simply using files as the leaf-level of the larger filesystem tree to organize your tests. All that works, but it is nice to be able to simplify define a tree of tests in a single file and have all that taken care of for you
"One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it."
That's such a great condensation of why automated tests are worthwhile.
"To write your own testing framework based on continuation trees, all you need is a stack of databases (or rather, a database that supports rolling back to an arbitrary revision)."
PostgreSQL and SQLite and MySQL all support SAVEPOINT these days, which is a way to have a transaction nested inside a transaction. I could imagine building a testing system on top of this which could support the tree pattern described by Evan here (as long as your tests don't themselves need to test transaction-related behavior).
Since ChatGPT Code Interpreter works with o3-mini now I had that knock up a very quick proof of concept using Python and SQLite SAVEPOINT, which appears to work: https://chatgpt.com/share/67d36883-4294-8006-b464-4d6f937d99...
I feel uneasy about using transactions like that. Eventually somebody will be puzzled enough to enable statement logging, which will not contain the data they need. Then they'll set a breakpoint in the test and get a shell on the db to see what's actually there, and they'll find it empty.
And by eventually somebody, I mean two days ago me.
I'd much rather just have a utility that copies the database actually and hands my test the one it's allowed to mess with.
One trick I've experimented with is setting things up so that if an assertion fails a copy of the full current state of the database is automatically captured and written out to a file on disk.
An approach we've done is to use Postgres' "create database foo2 template foo1" syntax to essentially snapshot the db under test at various points and use those to rollback as needed.
End-to-end (e2e) tests are slow and flaky. They don't have to be, but effort to fix breakage starts consuming most available time.
One idea is to separate scraping from verification. The latter would run very fast and be reliable: it only tests against stored state.
Then scraping is just procedural, clicking things, waiting for page loads, and reading page elements into a database.
Some consequences are needing integrity checks to ensure data has been read (first name field selector was updated but not populated), self-healing selectors (AI, et al), and certifying test results against known versions (fixing the scraper amid UI redesign).
A lot of effort is saved by using screenshot diffing of, say, React components, especially edge cases. It also (hopefully) shifts-left test responsibility to the devs.
Ideally, we only have some e2e tests, mostly happy paths, that also act as integration tests.
We could combine these ideas with "stacked databases" from the article and save on duplication.
Finally, the real trick is knowing, in the face of changes, which tests don't have to run, making the whole run take less time.
> End-to-end (e2e) tests are slow and flaky.
If they are slow, it means your application is slow. Good thing your tests make you realize it so you can work on it.
If they are flaky either your application is flaky or your UI is hard to use. Anyway that's something your tests tell you you have to fix.
And last: if your tests are all independents, why not run them all in parallel? With IaC you should be able to provision one instance of your architecture per test (or maybe dozen tests) easily.
Even in parallel, there are tradeoffs: run them all in one container, or chunk them out to available workers. The first runs into resource constraints; the latter takes up everything in the shared pool.
With IaC, emulating a constellation of all dependent services along with the site is technically feasible. (There are other possible constraints.)
What's your ideal scenario? For example, k8s + cloud, ephemeral db, auto-merged IaC file of a thousand services, push-button perf testing, regression suite with a hundred bots, etc.
> What's your ideal scenario?
On premise k8s cloud where you can deploy many instances of the exact same services you have in prod. Let's say a E2E test takes 5s to run, your deployment takes 2mn and you want to stay under the 5mn line for running your test suite: deploy an instance of all your services and their databases per batch of 30 tests.
I can understand this not really being possible at an Amazon scale. But for most businesses? A good beefy server should be enough.
Even without the continuation piece, it has always puzzled me why the test frameworks that I've used (mostly pytest and catch) don't explicitly model dependencies between layers. Especially in a system where layers have been carefully levelized. Assuming for the sake of example that there are no mocks involved, if subsystem B depends on subsystem A (say A is some global utility classes), then I would want all of A's unit tests to pass before running any of subsystem B's tests. Not sure why this is absent, or perhaps I'm using the wrong test systems.
You can absolutely do this with hypothesis stateful testing. https://hypothesis.readthedocs.io/en/latest/stateful.html
The C++ testing framework Catch2 enables this kind of testing. The first time I saw it I couldn't figure out how some of the tests would even pass.
It turns out that using some evil macro magic, each test re-runs from the start for each inner section [1]. It also makes deduplicating setup code completely painless and natural.
You just have to get over the completely non-standard control flow. It's a good standard bearer for why metaprogramming is great, even if you're forced to do it in C/C++'s awful macro system.
[1] https://github.com/catchorg/Catch2/blob/devel/docs/tutorial....
If you specify the operations (API) of your system in a relational algebra, then you can use that algebra to generate valid state transitions. (this essentially can construct the tree of continuations the article is discussing or enumerate the paths of this tree)
If you create a query language, then the state can be verified to match expectations at any point.
I'm not sure why we don't program like this.
I would love to learn more about this too.
I don’t really know what you’re talking about, and have a hard time imagining how ideas from relational algebra can be applied to all APIs.
For example, many database-like things already use relational algebra and an actual query language, for sure. But how does this apply to, say, a GUI toolkit or an audio device driver?
Do you have a doc or article that describes more about this? I’ve worked with relational algebra before but I’ve never heard it described with an API before. Are the responses all table based? I suppose you could wrap all calls with a SQL style API?
Traditional advice is to keep tests independent of each other, that's why the setup part gets repeated instead of being inherited from the parent tests. Independent tests can be run in parallel, dependent tests cannot be.
But I can see how this approach allows for parallelism as well, I especially like the fact that you only get one failure in case one of the steps fail
on one hand I suspect too much code has explicit and implicit global state for this technique to be useful; on the other hand using this from the beginning might prevent introducing that sort of global state in the first place.
You would lose the performance optimisations of a copy, but you can still express tests in this tree fashion with global state. You just need to recurs the tree from the top down and run the entire path up to each node. A simple example:
As you can see, the first step of the test is re-run from scratch every other branch of it.> on one hand I suspect too much code has explicit and implicit global state for this technique to be useful ...
Any collaborations having observable side effects such as these are very difficult to prove correct, regardless the test approach employed.
yes but resetting the environment before each test is one way to deal with them
>>> on one hand I suspect too much code has explicit and implicit global state for this technique to be useful ...
>> Any collaborations having observable side effects such as these are very difficult to prove correct, regardless the test approach employed.
> yes but resetting the environment before each test is one way to deal with them
Not really.
Resetting an environment for each test requires distinct tests exist for all anticipated workflow permutations. While this is onerous when the side effects are limited to in-process mutable state (such as global, session, and thread-local data), it is infeasible when global state is a persistent store[0].
0 - https://en.wikipedia.org/wiki/Data_store
FWIW, languages which support Kleisli[0] types can achieve a similar benefit by defining functional tests composed of same. Many times, "lower level" Kleisli constructs can be shared across related functional tests to reduce duplication.
0 - https://en.wikipedia.org/wiki/Kleisli_category
I wouldn't call 5-level-nested code "surprisingly clean", and continuations are cursed. I wouldn't want to have to debug tests that relied on continuations unnecessarily.
The code doesn't have to be nested, it can be factored into methods. Additionally, these aren't technically continuations in the sense of the "continuation passing style" popular in web programming, so don't confer the same drawbacks. Tests are essentially organised as a pipeline of functions which gets run, it's just that the pipeline has forks in it where the data flows along both forks.
Whenever I see a test suite do 9 steps of setup to assert one thing and then (mostly) the same 9 steps again to assert some other thing, I die a little inside. Especially when the setup takes multiple seconds for each case.
The lesser evil is to just ”do what you need and test everything once you are arranged”.
You won’t get hundreds of neatly separated well-named test cases which fail for a single reason. But for slow tests that isn’t as important as keeping the redundant setup away.
I like the tree idea but once we have simple pure/immutable we don’t really have the problem of redundant setup being slow, just ugly.
Note that the tree concept is orthogonal between writing and running - tests written like a tree can just as well desugar into a linear list of all possible combinations, re-running the setup steps every time.
How do you “test once arranged” as your test are modifying things that conflict?
Setup: login. Test 1: delete your account. Test 2: can change username.
The last time I saw this handled was a test copy of db per thread and transactional test rolled back. Not great but it did 10x our pipelines and avoided locks and issues.
If they conflict then it's a separate "story". They couldn't be part of the same timeline. You don't end up with one big ball of mud test which tests everything in the app, you end up with N different tests based on what setup they need, each one doing as many asserts as is supported by that setup. Here it might be test1: delete account. Test2: test changing things on the account (email, profile picture, ...) and asserting that each of the changes work.
BUT obviously you can just test deleting the account at the end of the test that modifies the account.
SetAccountUserName(account, "New name"); GetAccount(account.id).UserName.Should().Be("New name");
SetAccountEmail(account, "new@email"); GetAccount(account.id).Email.Should().Be("new@email");
DeleteAccount(account); GetAccount(account.id).Should().BeNull();
This is a working timeline
A related evil is when people start with a test that orchestrates the whole flow, copy it, and make one small change towards the beginning of the flow. They know, at the time of doing this, that the part they care about happens after one second. They could just let it crash and burn after making the critical assertion and the test would be just as useful.
But they commit an entirely new test filled with redundant stuff that takes way longer than is necessary and makes it unclear which assertion was the critical one. Because hey, look at all that green text in my PR. I'm so thorough.