The Attraction of Complexity

pdf version

Introduction

How is complexity distributed through a codebase? Does this distribution present similarities across different projects? And, especially, are we more likely to change complex code or simple code?

This summer, towards the end of my article about Technical Interest, I found that I wanted to write something like: “not only working on complex pieces of code is expensive, but we are also more likely to work on those complex pieces than on the simple ones”. This rings true to my ears, and everyone I talked with on the topic of complexity reasons that, the more complex a piece of code is, the more logic it contains; since the goal of a system modification is to change its logic, we are more likely to end up touching a logic-rich part than a low complexity part. But is this reasoning backed by reality?

I could think of only one way to know: go through all the changes in the history of multiple projects and calculate the complexity of the functions containing each change, then group data together to see if it makes any sense.

Collecting the Data

It all starts with git, of course: I created a node.js script that scans the history of a GitHub project and checks out the files changed by every commit, putting them in a folder named with the commit id and writing a file: changes.csv, which lists the filepath and line number of every change.

Then I wrote a script in SonarJS that reads changes.csv, parses every file mentioned therein and calculates the Cognitive Complexity of the function directly containing the line listed in the change. It stores the filepath, the function name, its complexity and its size in a new file: stats.csv.
The same script also calculates the complexity of all functions currently (checkout HEAD) in the project, it stores the aggregated size of all functions with the same complexity, in a file: complexity_distribution.csv.

The Dataset

The projects I selected are :

I picked them to get a mix of large, medium and small projects, both libraries and final products. I also wanted to have a large number of changes overall: these five projects together account for roughly 220.000 JS function changes.

Why only JavaScript

Since I work on SonarJS most of the time, it’s easy for me to customize it as I need, thus focusing exclusively on JavaScript projects spared me a lot of effort. This leaves an interesting area of investigation wide open: how these findings transpose, if at all, to code-bases written in languages with different features and domains of application?

Project Code Distribution

I collected the overall code distribution (the data found in complexity_distribution.csv) mostly because I needed it for the change frequency normalizations later on, but a first interesting finding is that, looking at how code is distributed across complexity at a given point in time, the distributions are quite similar across very diverse projects. Here’s a couple examples and then the aggregation of all five under study. The amount of code in the projects’ graphs is normalized to the largest value, so all normalized values are within 1 and 0. I calculated absolute values by counting the number of expressions/statements and, depending on the project, the value 1 can represent tens of thousands or hundreds of thousands of those.

code_distribution_restbase
code_distribution_lighthouse
For this and any other aggregations across the five target projects, don’t forget that I add normalized values, this way the relative of each project has no impact.
code_distribution_combined

Complexity and Change Frequency

We are now getting closer to the core of the matter, let’s see how many times code is changed over the history of a project depending on the code complexity. Here’s Restbase:
code_changes_restbase
The vast majority of changes is happening in low complexity code. It would seem that there’s an inverse relationship between the likelihood of change and the complexity of code; but this diagram is also strikingly similar to Restbase’s general code distribution. If I normalize the number of changes and plot them together with the code distribution, the similarity is obvious:
code_changes_and_distribution_restbaseThe same is true for the other projects, for instance, Keystone:

code_changes_and_distribution_keystone
And all projects combined:

code_changes_and_distribution_combined

This makes sense: if a project contains code, that code has to come from changes, more code, more changes. I can’t stop here though, the sheer mass of code and the changes that were required to initially write it might be hiding the true trends of change frequency over complexity.

Change Frequency Density

If I normalize the number of changes over the amount of code I am calculating how frequently a piece of code changes on average. This I call the change frequency density, or change frequency per expression. This measure can be interpreted also as how likely is a piece of code to change depending on the complexity of the function that contains it.

density

This looks almost like random noise: after an initial peak (one change for every ten expressions) for code at zero-complexity, the number of changes per expression drops and continues to decrease until well after complexity 15, then it starts to jump around randomly.

Function Gravity

The problem is that complexity is a metric that is function-based and the goal of all this exercise is to find out if a function with a given complexity is more or less likely to change, not if a specific expression is more or less likely to change when it belongs to a function with a given complexity. What I want is to get the number of changes in the history of a function of a given complexity, so I group together all the changes of every function (this is something that I can do because I’ve collected the fully-defined name of the function enclosing every git change) and then I calculate the average complexity of the function through its history, all the changes of the function are then accounted for at that average complexity. In short, function-aggregated changes.

aggregated-changes

Looking carefully, the pink plot shows some interesting bumps, especially in the higher complexities, but the overall code distribution still dominates the shape of the data. What if I calculate the aggregated changes divided by the number of functions? That should really show how much a single function aggregates changes regardless of the sheer amount of functions existing at a given complexity. I call this ‘Function Gravity’.

gravity

I suspect the final peak is more of an anomaly due to the extreme rarity of functions with an average complexity above 25, Let’s reduce the range to maximum 25, to stay in an area where the data-set is not too thin.

gravity-25

I distinguish two zones of this graph. From zero to complexity 10, the attraction is almost monotonically growing, after 10 there are peaks and valleys, with a similar mean growth, but minima that, while being more than twice the complexity, can have a much lower attraction to change than the functions with average complexity 10.
This means that using the complexity of a function is a relatively good way to target the refactoring effort, functions with complexity 8 and 12 today are likely to accumulate much more changes, as they get to higher complexities, than functions with complexity 0 to 6 would (remember the graph shows a function’s average complexity over its life).
This is no longer true after complexity 12 though: while targeting a function that today has complexity 16 might allow you to refactor a piece of code that will change a lot, the vast oscillations in this part of the graph make this an hazardous bet, you can win big, but you can also spend days to refactor a complicated function that in the future will not change more frequently than a 0-complexity getter.
Can we get a better metric to spot functions which change very frequently?

Change Potential

Function that have historically received many changes should be larger on average. But does that hold true?I believe the macroscopic approach has now reached its limits and to proceed further I should start studying specific functions and see if there are any indicators to their change-frequency future, but still, a final set of graphs:

gravity-and-size-25

This seems to confirm that between 0 and 10 favoring the refactoring of higher complexity functions over lower complexity functions is a safe bet, especially in the range 7 – 10. It also shows the random nature of the code changes after complexity 15: it becomes possible to both have lower average sizes with proportionally much larger change histories and large functions which change very rarely. This might be because very complex code impacts developer behavior: for instance, it’s not unusual for developers to group multiple changes to especially complex functions to avoid having to re-learn them multiple times. It’s in those places, at high complexity, where functions changes and function sizes diverge strongly, where I would like to abandon the aggregate view and look at specific cases, but that will have to wait.

Conclusions

There seems to be a correlation between function complexity and number of changes, beyond what mere function size would suggest, at least between complexity 0 and 10. After complexity 10 things start to change, after complexity 15 the appearance of relatively small functions with lots of changes and large, high-complexity functions with few changes, makes for some great opportunities and big risks.

Side Notes

While I was writing this article I happened to think of the open-closed principle. OCP is notoriously hard to measure and in a discussion on the XP mailing list years ago I even went so far as to say that it’s impossible to measure (or apply) OCP up front, it can only be used to evaluate the soundness of a system’s modularity after the fact. From this point of view, the distribution of changes over complexity might be seen as a metric to measure how well OCP is respected: if existing complexity is most often the site of new changes, OCP is not being respected, if new changes tend to happen mostly in low complexity zones of code instead, we are implementing new features without modifying existing logic, and OCP is respected.

Advertisements

Quantifying the cost of Technical Debt

pdf version

TL/NR

To make sure a cleanup effort produces the maximum positive impact on the code base, in the past I used a heuristic that centers on frequently changed code that also suffers from Technical Debt and poor automated test coverage. The idea is to minimize Technical Interest, which I define as the effort lost due to Technical-Debt-generated resistance to change. In order to quantify Technical Interest accrued for a set of changes over a period, without measuring actual development time against a benchmark, I propose the following rough formula:

tlnr

Which Technical Debt?

In this article I mention Technical Debt frequently, but it’s far from a well-defined term, moreover, the common understanding of this concept has diverged from what its creator, Ward Cunningham, originally meant.
Cunningham defined Technical Debt as the misalignment between a team’s current best understanding of the problem domain and what is instead expressed in the code: imagine a cab fleet management system written in perfectly clean code, its design expressively describing the concepts of drivers, vehicles and car positions. Let’s then imagine that at some moment the system’s developers discover that by introducing the concept of ‘areas of availability’ they would be able to both enrich and simplify their complicated cab-selection features. If they decide they can’t afford to do the necessary refactoring to introduce this new concept right now and instead keep this insight in their mind, but not in their code, for a while, they are accruing Cunningham’s definition of Technical Debt.

The common understanding of Technical Debt is instead related to code which is just poorly implemented: obscure and convoluted logic that does not express the underlying problem domain at all; poor responsibility distribution; absent, inconsistent or not isolated components and more hallmarks of poor technique. By this definition the cab fleet management system would suffer Technical Debt when the driver module owns and manages the current car’s mileage, the car position is a plain string containing gps coordinates that get passed around in every other entity of the system and slow calls to remote resources are all synchronous with the UI.
If the first definition is concerned about expertly painted portraits not catching facets of a complex personality, the second is about accidental brush strokes and misplaced facial features.

This article is concerned with the Technical Debt by the common definition: the thing that is not just preventing software to excel on the long term, but that is able to calcify its evolution to the point where, after just a few man-months of work, dozens of man-days are needed to add a new drop-down box, while refactorings are as traumatizing as full rewrites, and even less likely to succeed.

Debt and Interest

For years now people have been talking of Technical Debt: developers wail about it; seniors prove their salt attacking it with sweeping refactorings whenever the stakeholders are looking the other way, often loosing themselves and their credibility in the crusade; many a failed project’s corpse has been imputed to a manager letting this pest breed in it ‘until after the release’.Regardless on how frequently it’s blamed, calcifying low quality pervades the industry.
I believe that one of the reasons for stakeholder complacency in generating Technical Debt (a complacency that starts with obviously low-quality-producing staffing practices) and developers’ ineffectiveness in repaying it, is the fact that, while Technical Debt is quantified, its effects are not.
We know (or at least believe we know) how much we would need to work to ‘fix it all’, but we have no clue how much we are being slowed down by not fixing it right now.This also means that, provided a large landscape of debt in a codebase, we don’t know where reducing the Technical Debt will produce the greatest benefit for future efforts.Which part of our indebted code is generating the highest interest, the highest amount of attrition to our limited development resources?

Paying Interest

Unlike monetary debt, whose interest rate is expressed over time, Technical Debt does not generate interest linearly, nor continuously, with time. The most hideous working-mess-of-code will not generate extra effort if it never needs to evolve. Much like a game where the situation stays still until one of the players makes a move, in software development nothing happens unless you have to act on the code.
When you do have to act though, depending on the depth of debt in the area you are working on, you’ll pay more or less interest. This will happen in the form of time spent understanding complicated code, manually testing untested code and bug-fixing regressions.
So, when are we paying Technical Interest? When we change code. For every modified statement there’s a price to pay.
How much do we pay? We could measure it empirically by developing a feature in the system as it is, then refactoring the system until it gets to near-zero Technical Debt and re-developing exactly the same feature. The difference in effort is the Technical Interest. The problem with this approach is obvious: from a commercial point of view it is an exercise in futility.In the absence of such empirical data I propose a formula that I believe might approximate truth:

    • The effort to ‘understand the code’ is linear to the Cognitive Complexity of the function containing the changed statement : U * CC
    • U can be considered constant in most cases and can be set to 1 until the amount of effort actually spent in developing the change is available, at which point it can be set to u
    • The effort to ‘bug-fix’ the change and induced regressions is linear to the lack of automated test Branch Coverage of the function containing the changed statement T * (1 – BC)
    • T can be considered constant in most cases and can be set to 1 until the amount of effort actually spent in developing the change is available, at which point it can be set tot

     

  • Finally, the overall interest paid for a given period is the sum of all bits of interest paid for each software change applied in the period:
  • first

     

    When U and T cannot be assumed constant

    If system module boundaries are fuzzy (signatures are not there, or just wrong…) or entirely absent, a developer who needs to understand the code that he needs to change will not be able to stop at function calls, but actually move around reading implementations of things that are being called by the target code. Since efferent coupling gives an indication of how many things you depend on from your code, in the case of weak boundaries it also gives an indication of how many things the developer will have to understand beyond the immediate scope he has to act upon. This is why I suggest a second approximation that defines the U factor as a function of the Efferent Coupling from the function we are changing:

  • U_2

     

    Similarly, if test isolation is very poor the obvious symptom is that many tests break at every change and the more tests break, the harder to find the source of the break. Presuming homogeneously poor test isolation, the more a change is depended upon, the more the test failures (We might avoid making this assumption by using a metric that I call ‘test distance’ and which I’ve not yet documented). I thus suggest a second approximation that defines the T factor as a function of the Afferent Dependencies to the function we are changing:

  • T_2
  •  

     

    Combining everything together this gives the following second approximation formula:

    second

    Final Thoughts

      • In this article I’ve used Cognitive Complexity as a way to evaluate the effort to understand code; while it is not perfect, I prefer it over Cyclomatic Complexity. I believe that, by dropping the relation to logical branches, Cognitive Complexity better models how a human brain is impacted by code structures; but since Cognitive Complexity has only recently been defined and it is supported only by some code quality tools, all my direct experiences on the topic of Technical Interest are based on Cyclomatic Complexity.
      • To reiterate, the gist of the second section: what I’m proposing relates to targeting a cleanup effort, raw code quality improvement, not a conceptual refactoring (again, referring to Cunningham’s definition of Debt). Refactoring from one conceptual model to another one should not be conditioned by change-frequency considerations, but rather by how salient the new model will be for the future of the system being developed.
      • While I’ve used variants of the formula proposed above on code-bases I was intimate with, in the absence of a dedicated tool it’s impractical to go through the history of a project to find the hotspot. As a result, I often find myself (and others) taking a shortcut to find the next cleanup target: pick the most highly complex code. The rationale for this is that, considering that the high complexity induces high change cost that such a large amount of logic is very likely to attract future changes, high complexity functions are a safe bet. I’ve recently developed a set of scripts to find out exactly this: are high-complexity functions a good target for refactoring if we don’t have the luxury of a full historical analysis of Technical Interest? I hope to report the results soon.

Things I have learnt as the software engineering lead of a multinational

A surprisingly long cycle has just closed for me and I think it’s a good time to share some lessons learned.

I have been collecting these points in the last six months, but none of them popped up recently, they have taken shape over a few years and they include both things I did and that I failed to do.

Most of these points act as a personal reminder as well as a set of suggestions to others; don’t be surprised if some of them read cryptic.

So, here it is. A summary of what I learnt in the last six years :

The productive organization

1.  When the plan is foggy, that’s the moment to communicate as broadly as possible. In fact you should not frame it as a plan, you frame it as the situation and the final state you want to achieve. When you have a detailed plan, that’s when you DON’T need to communicate broadly. So, clearly state the situation and clearly state the foggy goal to everyone who will listen.

2.  Don’t be prudish. If you fear that people will lose faith in you because of the foggy goals and dire situation statement you are painting yourself in an heroic corner. People just need to hear and share the situation they are into. Having a common understanding will act as bonding among them and with you, which is actually all you need to make them work out the right answers.

3.  Don’t assume that a specific communication medium can’t be effective. Mass emails and top-down communication are not taboo: just because most such communications are irrelevant it doesn’t mean yours will be perceived as such.

4.  Teams don’t self-organize unless you organize them to do so.

5.   Fostering personal initiative in every developer requires showing the vital need for personal initiative. These are birds that mostly don’t fly out of the cage just because you open the door. You must find a way to show them that the cage is burning.

6.  People sitting side-by-side can communicate less than people sitting a continent away. Communication is a chemical reaction that requires catalysers, the thing you get by co-locating people is lowering the cost of the catalysers, but no setup creates automatic communication.

7.  Within a development organization both good and bad communication exist, but they are not a function of politeness or rudeness, it’s much more a matter of clarity and goals. You need to learn what the good kind of communication looks like, find some examples and use them as a reference for everyone.

8.  Fire people whenever you can. There’s often someone to fire, but not many opportunities to do so. When you are given a lot of opportunities to fire people, it is often due to a crisis situation and you’ll likely fire or otherwise loose the wrong people. People appreciate when you fire the right people, so don’t worry about morale. Also, the average quality of people tends to grow more when dismissing than when recruiting.

9.  Hire only for good reasons. Being overworked is not a good reason to hire. Instead, hire to be ready to catch opportunities, not to survive the current battles.

10.  It’s often better to loose battles than to staff desperately and win desperate battles at all costs (World War I anyone?).

11.  Don’t export recruitment, recruitment must be the direct responsibility of everyone in the organization.

12. People must select their future colleagues, there are infinite benefits in this, but it must not become a conclave. Keep the process in the hands of the people who do the work, but make it as transparent as possible.

13.  Always favor actual skill testing in recruitment. When you don’t feel that you are directly testing the candidate skills you are either not competent enough in that skill or you have switched to just playing a set piece (I call this The Interview Theatre) and you will ultimately decide on a whim. Not good.

14.  Build some of your teams as training and testing grounds for freshmen. Put some of your best people there.

15.  Lack of vision is not agile, it is not data-driven, it is not about ‘taking decisions as late as possible’, it is not something that you should paint out in a positive light at all. It’s just lack of vision, and it’s not good.

16.  Construction work is not a good metaphor for software/product development. Factory work neither. Allied junior-officer initiatives during the first week after the d-day in WWII is probably a good guideline, but it is still not a good metaphor overall and, anyway, not known enough to base your communication on.

Yourself

17.  Train people to do all of the previous points. Including this one.

18.  Don’t shy away from leading without doing, it is unavoidable, so just do it. Then do some work to stay pertinent.

19.  If you are not able to hire and fire people, leave. Or stay for the retirement fund if you can stomach it.

20.  The Sith are right, rage propels. But the Jedi are right, you must not let it control yourself. What nobody tells you is that the rage game is intrinsically tiring and rage will take control as soon as you get too tired, so stop well before.

21.  Write down the situation, for your own understanding just as much as for the others’.

22.  If you feel like you don’t know what you are doing it’s probably because you don’t know what you are doing and that’s bad. Anyway, until you learn you don’t really have much of an alternative. Just don’t let that feeling of desperation numb your ability to learn. It does.

23.  There’s more and more good content to read and absorb on effective organizations. Don’t despair and don’t stop reading.

24.  Don’t let entropy get at your daily routine. Avoid entropy-driven work.

25.  Ask questions to people in order to make sure they understand. Trust people who do the same to you. “Do you understand?” is NOT a valid question.

26.  Avoid having people waiting on you. Don’t create direct dependencies on your work or decisions, make sure people feel that they can take decisions and still stay true to the vision without referring to you (hence the importance of point 1).

27.  Take the time to coach people in depth. Really, spend time with the people who are or have the potential to be great professionals in the organisation.

28.  The time you spend with the people you see most potential in is endorsement enough. Avoid any other kind of endorsement of individuals. Unless you are leaving.

The Entropic Organization

29.  An organization populated by a majority of incompetents has less than zero net-worth : it is able to destroy other adjacent organizations that are not similarly populated.

30.  Incompetence is fiercely gregarious while knowledge is often fractious; the reason for this is that raw ideas transfer more easily through untrained minds than refined ideas transfer through trained minds. There’s a reason why large organisations focus so much on simple messages, pity that difficult problems often have simple solutions that don’t work.

31.  Entropy self-selects. Hierarchical  and other kinds of entropic organizations always favor solutions that survive within entropic organizations. Thus they will favor easy over simple, complex over difficult, responsibility-dilution over empowerment, accountability over learning, shock-therapy over culture-nurturing. This is the reinforcing loop that brings ever-increasing entropy in the system : entropy generates easy decisions with complex and broken implementations, which in turn generate more entropy. An example of easy decision with a complexity-inducing implementation: this scenario “our company does not have a coherent strategy, as such many projects tend to deliver results that are not coherent, hampering the organic growth of our capabilities.” will be answered by the most classic knee-jerk decision-making pattern “we don’t know how to do X, so let’s overlay a new Y to enforce X”, in this case :”Group together strategic projects into a big strategic program that will ensure coherence”. The difficult but simple option will not even be entertained : “let’s discover our real strategy and shape the organization around it.”

32.  Delivery dates have often irrelevant but very simple to understand impacts. Good and bad solutions have dramatic but very difficult to understand impacts. The Entropic Organization will thus tend to make date-based decisions. The Entropic Organization will always worry about the development organization ability to deliver by a given date, never about the ability to find the right solution. There are some very rare cases where delivery date is more important than what you are delivering, but modern management seems to delight in generalizing this unusual occurrence to every situation. People do get promoted for having been able to deliver completely broken, useless and damaging solutions on time. If that’s the measure of project success you can expect dates to rule (even when they continuously slide). After all, if you are not a trained surgeon, and the only thing you are told is that a given surgery should last no more than X hours, guess what will be the one criterium for all your actions during the operation. This showcases the direct link between the constituents incompetence and the establishment of classic Entropic Organization decision-making.

33.  Having a strategy will only go so far when you face the Entropic Organization, since it will be only able to appropriate that strategy at the level of energy (understanding) they can attain, which, being entropic, is very low. This results in something that does not look like a strategy at all : ever seen a two-years old play air-traffic control? He got the basic understanding of “talking to planes”, but that’s it.

34.  Partially isolating the Development Organization to stay effective does not work. Adapting your organization to be accepted by an incompetent background does not work either. What is left in the scope of alternatives is radical isolation supported by the attempt to radical results and crossing fingers for top-management recognition (also known as `Deux-ex-machina for the worthy’) and the top-down sales-pitch (or POC) to the CEO (also known as “He who has the ear of the King…”). But don’t forget : Nemo propheta in patria, so, act and look like an outsider as long as you can.

35.  Growth-shrink symmetry. When an organization grows unhealthily (too fast, for bad reasons or through bad recruitment) it will also shrink unhealthily. When it grows it’s bold and confused, when it shrinks it’s scared and nasty.

36.  Most of the ideas that will pop up naturally from the Entropic Organization are bad in the context of modern knowledge-based work, but possess a superficial layer of common-sense to slide through. Exercise extreme prejudice.

Architecture

A quick thought.

Today I started reading Roy Fielding’s PhD thesis : Architectural Styles and the Design of Network-based Software Architectures and the first chapter begins with a priceless sentence :

In spite of the interest in software architecture as a field of research, there is little agreement among researchers as to what exactly should be included in the definition of architecture.

Of course he moves on to define it and it is a reasonably good definition, with encapsulation in the core of it and a clear explanation that every level of abstraction manifests an architecture of its own.

Yet, there’s something fishy for a topic that has professionals, books, courses named after it, whole hierarchies of people working in it and yet… what exactly should be included?

Test Driving a Unity Game

Test Driving a Unity Game

It’s been so long that I feel a newcomer to WordPress’ user interface.

A few years ago, just after the last Italian Agile Day I attended (2013), I was thinking of writing something about using TDD when developing in Unity.

Recently I saw an email passing by the tdd mailing list asking about exactly this topic and, as it happens, I’m on holiday right now, so I finally got around at writing the article I should have written years ago. What a serendipitous accident.

First some notes :

  1. This is not about testing pure “unity-neutral” C# code. In Unity, at some point, you start using objects which have no relationship on Unity types (like MonoBehaviour, Transform etc..), but this happens quite low in the call stack and, at times, not at all. Besides, you can test drive those with normal tooling, describing that is redundant to any good tdd book.
  2. It is unlikely that many of your interesting game features will be described entirely within the context of unity-neutral code. Unity is pervasive and it is not built for technological isolation (probably my biggest issue with a product I otherwise love). After all, if you are writing a 3d game, most of the stuff you need to code is touching concepts like a transform, a collision, an animation; it is possible to express them neutrally, but Unity has decided to sacrifice isolation for immediacy. I’ve tried reintroducing that isolation and it’s not nice, I thus do it very selectively.
  3. A lot of the design questions you need to answer when developing in Unity are related to the distribution of responsibilities among the MonoBehaviours you attach to GameObjects (Unity’s game logic is structured around the Extension Object Pattern, see Gamma’s paper in PLoP 96). Skipping those parts of your logic just because they are unity-dependent pauperises tdd in Unity into irrelevance.
  4. Since a lot revolves around which MonoBehaviour of which GameObject does what, the collaborations between those behaviours are equally critical to your design; those collaborations are wired into life by Unity’s declarative dependency-injection mechanism: the Scene. The Scene is thus the seed of all your fixtures; trying to bypass it, while possible (factories, resource load, stubs), is often not worth it.

Now, all of the above is an admission that, if I want to apply my usual holistic approach to tdd, most of my tests will not look like unit tests and will need to cope with Unity’s environment.

It took me some time to accept this fact, but when I did, I started to see that there were interesting advantages in accepting a Unity scene as my test runtime; the rest of this post will be about how exploiting those advantages shaped my approach to doing tdd in Unity.

You can and should simulate unit test isolation by exploiting physical space locality

This means that if the Scene is your runtime, if you take care to build your test in such a way that its effects stay within a well-defined volume, your Scene will behave similarly to a suite of properly code-isolated tests in classic unit testing.

This has the side effect of forcing me to avoid as much as possible world-spanning searches of other objects through tags or names: all of my GameObject-to-GameObject interaction is defined by colliders and explicitly injected dependencies. The effort to keep the tests isolated in space is already influencing my design.

After a while I’ve come to actually materialise the bounded volume of a test with a cage. This makes boundary enforcement more natural while building and running the test (you can’t ignore that something is leaving the test volume when that volume is graphically represented) and has the nice benefit of giving your test scene the look of a well-managed zoo :

the-zoo

If I really feel hardcore I can use a more advanced kind of cage that has colliders for walls, these colliders destroy anything that they touch and throw an exception, failing the test. Frankly, while cool, this is overkill if you run your test Scene in Unity and glance at what is happening, but I believe that if the tests are meant to run in a headless runtime (not even sure if it is possible), say, within a CI build, those exception-throwing cage walls become necessary: they are the only way to spot an abusive test early on, before it pollutes other tests.

A final note on the physical space isolation : if you look at what Unity has provided as automated testing tools, they have taken a different approach. Every test is run sequentially and, while it runs, only the GameObject representing the test (and its hierarchy) exists; the test can thus play with the whole empty scene, without limitations. I was already using my “caged” approach before Unity published the testing tools, so I am biased to my solution, but I can articulate a bit why I prefer my solution : first, the limitation of having everything present in the scene at the same time while running the tests informs my design, as I explained above: it ensures that my logic is intrinsically bounded in space and not world-spanning; second, my approach runs all tests at the same time, which is critical when most of your tests require one or two seconds to pass to let objects move around; I can run dozens of tests within a few seconds, with everything happening at the same time.

Here you can see what happens on the ground floor of my test zoo within a few seconds, a dozen tests doing their thing at the same time :

 

The cage must contain only the (Game)Objects you want to test

The problem with not limiting the test to pure, unity-neutral logic, is that the test can grow to become a monster. Just like a classic unit test should not setup the whole system with dozens of components only to test a specific case, I always make sure that I can setup a minimum amount of components, all of them neatly bounded by the cage, and still test what I need to test. If my design is correct, I will be able to demonstrate the feature I want with only the objects that will contain the feature.

This quality is core. Failing this, the tests are not declaring a unit of expected behaviour and everything devolves to automated smoke tests. Interesting, but big, clunky and of little value as a design tool.

Here’s how I built the test that brought the “Planter” and the “ConstructionZone” concepts into my code. This is the very first cage I built for this game (it’s about city construction, in case you were wondering) :

planter-and-construction-zone

The test goal is to declare that the planter tool, when triggered by the user’s finger touching a construction zone, builds terrain and a building. I created a construction zone as a collider (the bottom, green square) and a dummy finger (the yellow line), replacing the user mouse or touch, that “clicks” on the square at Start. See below.

planter-click

Then I created a second collider, the top, green cube highlighted below, which contains an assertion that succeeds if a “building” collides with it.

planter-check

The solution to this test has been to implement two MonoBehaviours, one attached to the “finger”, the Planter tool, the other attached to the bottom collider, the ConstructionZone.

Once everything is working the construction zone and the planter tool spawn a building as soon as the finger “clicks”, the building collides with the assertion collider and the test passes. If something goes wrong, no building, no collision, failure exception in the Unity console.

Below is the result that appears when everything is fine. The building is the white cylinder, admittedly ugly, but that’s not the point.

planter-result

Below is the setup of the finger and the zone, showing how few and simple the objects involved are (the Dummy Finger and Tool User are the classes I developed to act on behalf of the user, the TestBuilding referenced in the Planter is the white cylinder).

After this first test was succeeding I moved on to refine the behaviour of the Planter by, for instance, stating that it is not affected by any obstacles, which is a characteristic I need on every tool available to the user : if it touches an action zone it doesn’t matter if it first intersects a cloud or other piece of landscape, it must still trigger it. Below you see the second cage : the flat, solid white panel is the obstacle which the Planter must ignore to touch the zone below and spawn the building. The rest of the test logic is the same as the first cage, except that, by this time, I had also created the pavement mesh, with its nice grass & ground texture that you can see below the building in the previous test result, and as such I could place the assertion collider below, where the pavement spawns, and I don’t need to actually spawn an ugly cylinder, the pavement collides and passes the test.

obstacle

Much later on, after I had completed most of the building logic, I moved to develop artillery, with tests that look like the one below. Here you can see the cannon ball flying towards a rotated cube, where I attached my “Structure” MonoBehaviour that will get damaged (depending on the angle) by the ball colliding with it.

The assertion is also sitting on the cube, waiting for a collision and checking that the Structure’s health is lower than the initial health.

cannon-ball-damages-structure

You must write your own assertion and dummy player logic to simulate every interaction you need

What I did start to use out of Unity’s testing tools are some of the assertion components, but they are far from sufficient and anyway I always need to write custom scripts to generate the actions and transient behaviours that I need to simulate events that happen in the game and that my logic needs to react to.

For instance, in the movie below I’m testing that, even if the user wants to fire, the cannons on the wall will actually fire only when a target is in firing range.

The piece of pavement that moves on the left is the target; it contains a small script (part of my testing utilities for this game) that moves it at specific intervals by a specific amount. I called it the KinematicMover (since it does not use physics to move the object). “Using” statements edited out.

public class KinematicMover : MonoBehaviour {

        public Vector3 movement;
        public int steps;
        public float pause = 1;
        private Delay delay;

        void Start() {
                delay = gameObject.AddComponent<SystemDelay>();
                delay.repeat(steps, pause, () => this.transform.Translate(movement));
        }
}

Some of the test harness logic you create will likely stay such, like the assertion checking that there were indeed some cannon balls flying within a collider during a specific time window. It is attached to the middle collider, to have the test pass of fail depending on the timing of the cannons firing from the wall.

public class ProjectileChecker : MonoBehaviour {
         
         public float windowStart = 0f;
         public float windowEnd = 10f;
         private bool complete = false;
         
         void OnTriggerEnter(Collider other) {
                 if(complete) return;
                 if(other.gameObject.GetComponent<Projectile>()) {
                         this.complete = true;
                         if(Time.fixedTime > windowStart && Time.fixedTime < windowEnd) {
                                 IntegrationTest.Pass(this.gameObject);
                         } else {
                                 IntegrationTest.Fail(this.gameObject);
                         }
                 }
         }

        void Update() {
                if(complete) return;
                if(Time.fixedTime > windowEnd + 1) {
                        this.complete = true;
                        IntegrationTest.Fail(this.gameObject);
                }
        }
}

On the other hand I’ve found that, even more frequently than in classic non-unity tdd, some of the logic driving the events for your tests turns out to be very useful game logic of its own; for instance, the gunner logic that tries to fire all the time became almost instantly part of the game’s basic opponent AI. Meet the Aggressive Gunner :

public class AggressiveGunner : MonoBehaviour {

        public City city;
        
        void Update() {
                city.fire();
        }
}

In short, you must not be scared to create quite a bit of test stubs, custom assertions, movers and shakers. They are key to well isolated and focused tests, while at the same time potentially reusable in the main code itself. They should be easy to write, if your design is ok.

Conclusion

My tdd approach in Unity ends up being what many people would define as very granular, very isolated integration testing that only later on, for very specific logic, gets down to pure C# tests. It does work pretty well and produces almost all of the design feedback I need, along with a nice test Scene(s) that makes me feel safe and in control as the game logic gets more complex.

Web Apps in TDD, Appendix, the User

Here’s the User class and its collaborators as it is right now. It is a bit more evolved than its original form : when I first wrote it, all of the logic was in the User itself as I had no need to evaluate Javascript outside of html, later I separated the two responsibilities (parsing xml/html and evaluating Javascript) since I had a need to evaluate Javascript no matter where.


public class User {
    private final JavaScript javaScript = new JavaScript();
    private final Result result = new Result();

    public User() {
        javaScript.evaluateFile("browser.js");
    }

    public User lookAt(String htmlPage) {
        JavaScriptSource source = new XomJavaScriptSource(htmlPage);
        source.evaluateWith(javaScript);
        triggerOnLoad();
        result.readOutput(javaScript);
        return this;
    }

    public String currentSight() {
        return result.nextValue();
    }

    private void triggerOnLoad() {
        javaScript.evaluateScript("window.onload();", "onload");
    }

}



And here’s “JavaScript”, which manages everything Rhino-related.


public class JavaScript {
    private final Context context;
    private final ScriptableObject scope;

    public JavaScript() {
        context = Context.enter();
        scope = context.initStandardObjects();
    }

    public Object valueOf(String variableName) {
        return scope.get(variableName);
    }

    public void evaluateScript(String script, String scriptName) {
        context.evaluateString(scope, script, scriptName, 1, null);
    }

    public void evaluateScript(String script) {
        evaluateScript(script, "script");
    }

    public void evaluateFile(String sourceFileName) {
        try {
            context.evaluateReader(scope, read(sourceFileName), sourceFileName, 1, null);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    private InputStreamReader read(String sourceFileName) {
        return new InputStreamReader(getClass().getClassLoader().getResourceAsStream(sourceFileName));
    }
}




This is “Result”, which extracts values from the results array in Javascript.


public class Result {
    private NativeArray output = new NativeArray(0);
    private int current = 0;

    public void readOutput(JavaScript javaScript) {
        output = (NativeArray) javaScript.valueOf("output");
    }

    public String nextValue() {
        return (String) output.get(current++);
    }
}



Finally, this is the class that hides the fact that scripts are mixed within html.


public class XomJavaScriptSource implements JavaScriptSource {

    private final Document document;

    public XomJavaScriptSource(String htmlPage) {
        document = parsePage(htmlPage);
    }

    @Override
    public void evaluateWith(JavaScript javaScript) {
        Nodes scriptNodes = document.query("//script");
        for (int i = 0; i < scriptNodes.size(); i++) {
            evaluateNode(scriptNodes.get(i), javaScript);
        }
    }

    private final Document parsePage(String htmlPage) {
        try {
            return new Builder().build(htmlPage, null);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    private void evaluateNode(Node scriptNode, JavaScript javaScript) {
        if (scriptNode instanceof Element) {
            Attribute sourceAttribute = ((Element) scriptNode).getAttribute("src");
            if (sourceAttribute != null) {
                javaScript.evaluateFile(sourceAttribute.getValue());
                return;
            }
        }
        javaScript.evaluateScript(scriptNode.getValue());
    }
}

Web Apps in TDD, Part 4

Multiple buildings!

I bet that you can tell where this is going, I’ve one building, now I want multiple buildings on my page and then finally put the player in the middle.
So, I add two more rectangles to my investigation.html.


            ...
            var map = new Raphael(0,0,600,400);
            map.rect(10,10,50,40);
            map.rect(80,10,30,40);
            map.rect(10,70,100,40);
            ...

Which produces :

Now, for the test that will lead me to implementing this…


    @Test
    public void shouldRenderMultipleBuildings() throws Exception {
        HtmlScreen htmlScreen = new HtmlScreen();
        htmlScreen.addBuilding(10,10,50,40);
        htmlScreen.addBuilding(80,10,30,40);
        htmlScreen.addBuilding(10,70,100,40);
        User user = new User().lookAt(htmlScreen.render());
        assertThat(user.currentSight(), is("A Rectangle at [10,10], 40px high and 50px wide"));
        assertThat(user.currentSight(), is("A Rectangle at [80,10], 40px high and 30px wide"));
        assertThat(user.currentSight(), is("A Rectangle at [10,70], 40px high and 100px wide"));
    }

The current result is :


Expected: is "A Rectangle at [10,10], 40px high and 50px wide"
     got: "A Rectangle at [10,70], 40px high and 100px wide"

Which is the last building only. This is due to my implementation of the html screen. Which stores the last building and overwrites the previous one. Easy fixed.


    private String renderBuildings() {
        String renderedBuildings = "";
        for (Building building : buildings) {
            renderedBuildings += building.render(vectorGraphics);
        }
        return renderedBuildings;
    }
    
    public Screen addBuilding(int x, int y, int width, int height) {
        buildings.add(new Building(x, y, width, height));
        return this;
    }

function Raphael(x, y, width, height){

    this.rect = function(x, y, width, height) {
        output[invocations++] = "A Rectangle at [" + x + "," + y + "], " +
                    height + "px high and " + width + "px wide";
    }

}

var output = [];

    public String currentSight() {
        return (String) output.get(current++);
    }

Now, for the real-life test


    public static void main(String... args) throws Exception {
        HtmlScreen htmlScreen = new HtmlScreen();
        htmlScreen.addBuilding(10,10,50,40);
        htmlScreen.addBuilding(80,10,30,40);
        htmlScreen.addBuilding(10,70,100,40);
        new Boss(11111, htmlScreen);
    }

But, the very first HtmlScreenBehavior test is not happy, it did expect a rectangle, now that rectangle needs to be added explicitly.


    @Test
    public void shouldRenderABuildingAsARectangle() throws Exception {
        User user = new User().lookAt(new HtmlScreen().addBuilding(10, 10, 50, 40).render());
        assertThat(user.currentSight(), is("A Rectangle at [10,10], 40px high and 50px wide"));
    }

Now it passes. All tests are green.

I’m so glad it’s time to refactor, because my tests are looking very bad. For instance, have a look at the test just after the one I just modified :


    @Test
    public void shouldRenderTheBuildingWithTheRightPositionAndDimensions() {
        User user = new User().lookAt(
        		new HtmlScreen().addBuilding(50, 30, 80, 40).render());
        assertThat(user.currentSight(),
        		is("A Rectangle at [50,30], 40px high and 80px wide"));
    }

Yes, they are the same, the only difference is in the values. This is pretty much the only instance were I do consider erasing a test without a change in features : when it says exactly the same thing as another test.

So, adieu! I delete the second one, as I like the first one’s name better.

What else? Well, I’m growing bored of typing all of these “A Rectangle…”.


    @Test
    public void shouldRenderABuildingAsARectangle() throws Exception {
        User user = new User().lookAt(new HtmlScreen().addBuilding(10, 10, 50, 40).render());
        assertThat(user.currentSight(), is(aRectangle(10, 10, 50, 40)));
    }

    @Test
    public void shouldRenderMultipleBuildings() throws Exception {
        HtmlScreen htmlScreen = new HtmlScreen();
        htmlScreen.addBuilding(10, 10, 50, 40);
        htmlScreen.addBuilding(80, 10, 30, 40);
        htmlScreen.addBuilding(10, 70, 100, 40);
        User user = new User().lookAt(htmlScreen.render());
        assertThat(user.currentSight(), is(aRectangle(10, 10, 50, 40)));
        assertThat(user.currentSight(), is(aRectangle(80, 10, 30, 40)));
        assertThat(user.currentSight(), is(aRectangle(10, 70, 100, 40)));
    }

    private String aRectangle(int x, int y, int width, int height) {
        return "A Rectangle at [" + x + "," + y + "], " + 
        			height + "px high and " + width + "px wide";
    }