Functional Tests as a Tree of Continuations (2010)

69 points by amenghra 4 months ago

lihaoyi 4 months ago

This is the approach my uTest testing library (https://github.com/com-lihaoyi/utest) takes. I don't think it's unique to functional tests, even unit tests tend towards this pattern. Tests naturally form a tree structure, for multiple reasons:

- You usually have shared initialization nearer the root and the various cases you want to assert at the leaves.

- You want to group related tests logically together, so it's not one huge flat namespace which gets messy

- You want to run groups of tests at the same time, e.g. when testing a related feature

Typically, these different ways of grouping tests all end up with the same grouping, so it makes a lot of sense to have your tests form a tree rather than a flat list of @Test methods or whatever

Naturally you can always emulate this yourself. e.g. Having helper setup methods that call each other and form a hierarchy, or having a tagging discipline that forms a hierarchy to let you call tests that are related, or simply using files as the leaf-level of the larger filesystem tree to organize your tests. All that works, but it is nice to be able to simplify define a tree of tests in a single file and have all that taken care of for you

simonw 4 months ago

"One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it."

That's such a great condensation of why automated tests are worthwhile.

"To write your own testing framework based on continuation trees, all you need is a stack of databases (or rather, a database that supports rolling back to an arbitrary revision)."

PostgreSQL and SQLite and MySQL all support SAVEPOINT these days, which is a way to have a transaction nested inside a transaction. I could imagine building a testing system on top of this which could support the tree pattern described by Evan here (as long as your tests don't themselves need to test transaction-related behavior).

Since ChatGPT Code Interpreter works with o3-mini now I had that knock up a very quick proof of concept using Python and SQLite SAVEPOINT, which appears to work: https://chatgpt.com/share/67d36883-4294-8006-b464-4d6f937d99...

addled 4 months ago

An approach we've done is to use Postgres' "create database foo2 template foo1" syntax to essentially snapshot the db under test at various points and use those to rollback as needed.
__MatrixMan__ 4 months ago

I feel uneasy about using transactions like that. Eventually somebody will be puzzled enough to enable statement logging, which will not contain the data they need. Then they'll set a breakpoint in the test and get a shell on the db to see what's actually there, and they'll find it empty.
And by eventually somebody, I mean two days ago me.
I'd much rather just have a utility that copies the database actually and hands my test the one it's allowed to mess with.
- simonw 4 months ago
  
  One trick I've experimented with is setting things up so that if an assertion fails a copy of the full current state of the database is automatically captured and written out to a file on disk.

turtleyacht 4 months ago

End-to-end (e2e) tests are slow and flaky. They don't have to be, but effort to fix breakage starts consuming most available time.

One idea is to separate scraping from verification. The latter would run very fast and be reliable: it only tests against stored state.

Then scraping is just procedural, clicking things, waiting for page loads, and reading page elements into a database.

Some consequences are needing integrity checks to ensure data has been read (first name field selector was updated but not populated), self-healing selectors (AI, et al), and certifying test results against known versions (fixing the scraper amid UI redesign).

A lot of effort is saved by using screenshot diffing of, say, React components, especially edge cases. It also (hopefully) shifts-left test responsibility to the devs.

Ideally, we only have some e2e tests, mostly happy paths, that also act as integration tests.

We could combine these ideas with "stacked databases" from the article and save on duplication.

Finally, the real trick is knowing, in the face of changes, which tests don't have to run, making the whole run take less time.

arkh 4 months ago

> End-to-end (e2e) tests are slow and flaky.
If they are slow, it means your application is slow. Good thing your tests make you realize it so you can work on it.
If they are flaky either your application is flaky or your UI is hard to use. Anyway that's something your tests tell you you have to fix.
And last: if your tests are all independents, why not run them all in parallel? With IaC you should be able to provision one instance of your architecture per test (or maybe dozen tests) easily.
- turtleyacht 4 months ago
  
  Even in parallel, there are tradeoffs: run them all in one container, or chunk them out to available workers. The first runs into resource constraints; the latter takes up everything in the shared pool.
  With IaC, emulating a constellation of all dependent services along with the site is technically feasible. (There are other possible constraints.)
  What's your ideal scenario? For example, k8s + cloud, ephemeral db, auto-merged IaC file of a thousand services, push-button perf testing, regression suite with a hundred bots, etc.
  - arkh 4 months ago
    
    > What's your ideal scenario?
    On premise k8s cloud where you can deploy many instances of the exact same services you have in prod. Let's say a E2E test takes 5s to run, your deployment takes 2mn and you want to stay under the 5mn line for running your test suite: deploy an instance of all your services and their databases per batch of 30 tests.
    I can understand this not really being possible at an Amazon scale. But for most businesses? A good beefy server should be enough.

RossBencina 4 months ago

Even without the continuation piece, it has always puzzled me why the test frameworks that I've used (mostly pytest and catch) don't explicitly model dependencies between layers. Especially in a system where layers have been carefully levelized. Assuming for the sake of example that there are no mocks involved, if subsystem B depends on subsystem A (say A is some global utility classes), then I would want all of A's unit tests to pass before running any of subsystem B's tests. Not sure why this is absent, or perhaps I'm using the wrong test systems.

gnulinux 4 months ago

You can absolutely do this with hypothesis stateful testing. https://hypothesis.readthedocs.io/en/latest/stateful.html

widdershins 4 months ago

The C++ testing framework Catch2 enables this kind of testing. The first time I saw it I couldn't figure out how some of the tests would even pass.

It turns out that using some evil macro magic, each test re-runs from the start for each inner section [1]. It also makes deduplicating setup code completely painless and natural.

You just have to get over the completely non-standard control flow. It's a good standard bearer for why metaprogramming is great, even if you're forced to do it in C/C++'s awful macro system.

[1] https://github.com/catchorg/Catch2/blob/devel/docs/tutorial....

darioush 4 months ago

If you specify the operations (API) of your system in a relational algebra, then you can use that algebra to generate valid state transitions. (this essentially can construct the tree of continuations the article is discussing or enumerate the paths of this tree)

If you create a query language, then the state can be verified to match expectations at any point.

I'm not sure why we don't program like this.

DowsingSpoon 4 months ago

I would love to learn more about this too.
I don’t really know what you’re talking about, and have a hard time imagining how ideas from relational algebra can be applied to all APIs.
For example, many database-like things already use relational algebra and an actual query language, for sure. But how does this apply to, say, a GUI toolkit or an audio device driver?
- darioush 4 months ago
  
  the core idea is to model the state and valid state transitions formally (which may be more directly applicable to the audio driver), so they can be enumerated exhaustively up to a certain bound.
  it may be that a bug only shows when hundreds of transitions are performed (like an overflow or bug due to large data), but that's more stress testing. many bugs have repros involving a few state transitions.
  relational algebra is a useful tool in my opinion because much of programming involves adding/removing things from sets or testing for their membership in a set. also relations are powerful as they can express recursive ideas like which widgets are contained within others (from the GUI example).
  relations also allow defining invariants at a high level which must be true at any state. (eg, there should be no state like: audio_buffer_is_empty and audio_playing)
  additionally we have languages such as SQL or for example https://alloytools.org/applications.html that can help programmers specify this in a familiar way.
  - DowsingSpoon 4 months ago
    
    That is fascinating. Thank you!
    Is it necessary to be able to exhaustively enumerate states? I remember a project I worked on some years ago which had me sketching out a statechart for a system which was not structured as an explicit FSM. The end result was surprisingly complex. I imagine some systems might even have an unbounded number of “states.”
    I’ll definitely read more about Alloy too.
brad0 4 months ago

Do you have a doc or article that describes more about this? I’ve worked with relational algebra before but I’ve never heard it described with an API before. Are the responses all table based? I suppose you could wrap all calls with a SQL style API?

zem 4 months ago

on one hand I suspect too much code has explicit and implicit global state for this technique to be useful; on the other hand using this from the beginning might prevent introducing that sort of global state in the first place.

James_K 4 months ago
You would lose the performance optimisations of a copy, but you can still express tests in this tree fashion with global state. You just need to recurs the tree from the top down and run the entire path up to each node. A simple example:
```
  test_fork<A, B, C>(a: A, f: (A) -> B, gs: List<(B) -> C>) -> List<C> {
    fs.map(g -> g(f(a)))
  }
```
As you can see, the first step of the test is re-run from scratch every other branch of it.
AdieuToLogic 4 months ago

> on one hand I suspect too much code has explicit and implicit global state for this technique to be useful ...
Any collaborations having observable side effects such as these are very difficult to prove correct, regardless the test approach employed.
- zem 4 months ago
  
  yes but resetting the environment before each test is one way to deal with them
  - AdieuToLogic 4 months ago
    
    >>> on one hand I suspect too much code has explicit and implicit global state for this technique to be useful ...
    >> Any collaborations having observable side effects such as these are very difficult to prove correct, regardless the test approach employed.
    > yes but resetting the environment before each test is one way to deal with them
    Not really.
    Resetting an environment for each test requires distinct tests exist for all anticipated workflow permutations. While this is onerous when the side effects are limited to in-process mutable state (such as global, session, and thread-local data), it is infeasible when global state is a persistent store[0].
    0 - https://en.wikipedia.org/wiki/Data_store

AdieuToLogic 4 months ago

FWIW, languages which support Kleisli[0] types can achieve a similar benefit by defining functional tests composed of same. Many times, "lower level" Kleisli constructs can be shared across related functional tests to reduce duplication.

0 - https://en.wikipedia.org/wiki/Kleisli_category

mola 4 months ago

That's exactly how we e2e tests for our ai conversational agents All state is immutable, the agent is basically a reducer so having these continuation trees is easy.

But I can't find a nice way to have pytest make a test per node in the tree. We end up with a single test for the whole tree which is less than ideal for dev experience.

Anyone with pytest hacking skills and an idea?

aszen 4 months ago

Traditional advice is to keep tests independent of each other, that's why the setup part gets repeated instead of being inherited from the parent tests. Independent tests can be run in parallel, dependent tests cannot be.

But I can see how this approach allows for parallelism as well, I especially like the fact that you only get one failure in case one of the steps fail

djha-skin 4 months ago

I wouldn't call 5-level-nested code "surprisingly clean", and continuations are cursed. I wouldn't want to have to debug tests that relied on continuations unnecessarily.

James_K 4 months ago

The code doesn't have to be nested, it can be factored into methods. Additionally, these aren't technically continuations in the sense of the "continuation passing style" popular in web programming, so don't confer the same drawbacks. Tests are essentially organised as a pipeline of functions which gets run, it's just that the pipeline has forks in it where the data flows along both forks.

alkonaut 4 months ago

Whenever I see a test suite do 9 steps of setup to assert one thing and then (mostly) the same 9 steps again to assert some other thing, I die a little inside. Especially when the setup takes multiple seconds for each case.

The lesser evil is to just ”do what you need and test everything once you are arranged”.

You won’t get hundreds of neatly separated well-named test cases which fail for a single reason. But for slow tests that isn’t as important as keeping the redundant setup away.

I like the tree idea but once we have simple pure/immutable we don’t really have the problem of redundant setup being slow, just ugly.

sevenK8s 4 months ago

How do you “test once arranged” as your test are modifying things that conflict?
Setup: login. Test 1: delete your account. Test 2: can change username.
The last time I saw this handled was a test copy of db per thread and transactional test rolled back. Not great but it did 10x our pipelines and avoided locks and issues.
- alkonaut 4 months ago
  
  If they conflict then it's a separate "story". They couldn't be part of the same timeline. You don't end up with one big ball of mud test which tests everything in the app, you end up with N different tests based on what setup they need, each one doing as many asserts as is supported by that setup. Here it might be test1: delete account. Test2: test changing things on the account (email, profile picture, ...) and asserting that each of the changes work.
  BUT obviously you can just test deleting the account at the end of the test that modifies the account.
  SetAccountUserName(account, "New name"); GetAccount(account.id).UserName.Should().Be("New name");
  SetAccountEmail(account, "new@email"); GetAccount(account.id).Email.Should().Be("new@email");
  DeleteAccount(account); GetAccount(account.id).Should().BeNull();
  This is a working timeline
int_19h 4 months ago

Note that the tree concept is orthogonal between writing and running - tests written like a tree can just as well desugar into a linear list of all possible combinations, re-running the setup steps every time.
__MatrixMan__ 4 months ago

A related evil is when people start with a test that orchestrates the whole flow, copy it, and make one small change towards the beginning of the flow. They know, at the time of doing this, that the part they care about happens after one second. They could just let it crash and burn after making the critical assertion and the test would be just as useful.
But they commit an entirely new test filled with redundant stuff that takes way longer than is necessary and makes it unclear which assertion was the critical one. Because hey, look at all that green text in my PR. I'm so thorough.