Monday, December 15, 2008

Consider separate booleans instead of a single enumeration field

[1]  It can apparently be tempting for programmers to introduce home-baked enumeration types in situations where modelling with separate booleans would make the resulting code more readable and maintainable. From what I have seen this applies equally to all types of data constructs that developers have the freedom to design, be it fields in a Java class, XML attributes, SQL table columns, etc. [2]

There is a dose of irony at play here, as it is more common to suggest [3] readability will be improved by replacing uses of booleans by enumeration types (obviously deemed more advanced than that most primitive of types). Particularly with respect to otherwise unclear method arguments (lacking support for named parameters) I think the advise to prefer enumeration types is sound. What I think are unfortunate uses of enumeration types though are really special cases of the general pitfall of not building enough structure into your data by throwing together different 'aspects of state' into a single field.

The problem with home-baked enumeration types ('home-baked' indicating that you're not merely formalizing a law of nature or widely held real-world convention, like 'days of the week' or 'suits in a deck of cards') is that what seems like a fixed set of mutually exclusive options now, might easily fail to facilitate desired flexibility in the future. But well... YAGNI and we can always refactor later, right? And it seems easy enough to add one or more values to the enumeration. Unfortunately, the effort involved in adapting the code's existing logic to the new situation is often underestimated and error prone. This is so because (non-)equality tests involving enumeration values often carry implicit context-specific assumptions that get overlooked easily. Even if you've been disciplined enough to have written a proper amount of unit tests, there's a fair chance these will keep succeeding spectacularly despite now faulty logic, because your test assertions are based on the same old assumptions regarding which values could be expected and what they mean.

The problem manifests itself most clearly when modelling process states. Consider for instance the (simple) enumeration [DRAFT, SENT, RECEIVED] as representations of the possible states of a message inside yet another social networking application's internal e-mail-like messaging subsystem (as, for instance, Facebook and LinkedIn have). Next to fields for sender, receiver, subject and body, every message in the system would have a 'state' attribute that contains one of these three values. Naive? Probably, but: 'worse is better', 'time to market', etc.

At some point the site's user community started to protest that the fact that a message was received didn't necessarily mean or convey that the message was actually read. No problem: a READ value was added to the enumeration...

"Ah, that gives it away! You're arguing a straw man situation. Nobody would do that. Of course read would always be a boolean..." Well, fair enough, but are you really sure about that? Cause I know I've seen some designs by experienced developers... (and who's the straw man here anyway?)

With the READ value added, the RECEIVED state got a more specific meaning: 'the addressed user has been notified of the incoming message, but has not yet read it'. Obviously, some (display) code had to be adapted to deal with the new model. Unfortu­nately, it then took a while to figure out why everybody was suddenly being notified that their messages couldn't be delivered, while in fact these had already been read. This was of course blamed on the junior programmer who had failed to write the following expression to simply check for the SENT state (although he defended himself by saying that all the unit tests succeeded and the QA department had fully approved the previous version):
// Java; ignoring the charming convention that fields should
// only be exposed through public 'getters' for conciseness.
if (!DRAFT.equals(message.state) &&
!RECEIVED.equals(message.state) &&
message.due < now) {

The State Pattern (moving all state depending logic into polymorph state implementing classes) could have been applied here and I believe this would take out some of the risk of introducing bugs like the above during refactoring. Even so, the State Pattern is obviously only available inside a General Purpose Language and not in a persisted or for-transport form of the data. Also, I am afraid the pattern is not well-known enough that I'd be confident that all future maintainers of the code base will consistently stick to it. Frankly, I have never seen it applied in production code... Had it been applied, the (now surprisingly bug free) code snippet would probably look like this:
or (obviously):
but I remain unconvinced that this is a good idea as the check is only relevant in one particular situation and being forced to build such a 'hook' into every state class suggests we are coupling things too tightly already... So probably this (contrived) example is not the best suited for the State Patter or am I missing something?

After the bug was fixed (with no State Pattern in sight), all users (5 of them by raw estimate) seemed content for a while, until some of them started to complain again. This was great news, because it meant the product was still being used. Actually, it was being used so actively that the users' mailboxes got pretty full, even with lots of messages they'd rather forget about as soon as possible. The team got together and decided that the system needed to be extended with a 'Delete' function (destroying all traces of the message's existence). This turned out not to be completely satisfactory either, so eventually an ARCHIVED state was proposed and added to the enumeration.

I hope the problem with that design is obvious... I grant that this example is a bit far fetched and perhaps not disastrous enough to really bring home my point. I wish I could have used the cases that I actually encountered, but I would have had a lot of explaining to do with respect to the domain context and we can't afford to give away our valuable trade secrets either. :-)

Just to be sure: while a READ state at least implicitly conveys that the message was previously RECEIVED, you can't deduce whether a message was READ or not from the fact that the message is currently in the ARCHIVED state. (Supposedly) relevant information was lost in the state transition. It will not come as a surprise now that I suggest the message 'state' should have been modelled with boolean flags sent and received from the beginning, to be extended with flags read and archived when those features were introduced. The code snippet would simply read:
if (message.sent &&
!message.received &&
message.due < now) {

This approach of modelling 'aspects' of state with separate flags [4] increases flexibility in a big way. Code becomes less interdependent. New aspects can be added later without pain regarding existing code reasoning over existing properties. Aspects of state that were originally modelled to be mutually exclusive (on purpose or by accident), like DRAFT and ARCHIVED, can effortlessly be made to play together nicely as system requirements develop... ("Hey, why shouldn't we allow draft messages to be archived?")

This touches on one argument that might be put forward in defense of using enumeration types: "they enforce data integrity much better, because with separate booleans an object/record might be allowed to contain values that represent an illegal/impossible state....!!!" This is true of course (e.g. sent=false, received=true). However, in all real-world cases that I've reviewed, a programmer would have to put in a lot of extra ill-conceived effort at the business logic level to make that happen. So, unless you're really worried about attackers breaking in at this level (which seems an unlikely proposition), I don't think you need to be focussing on this kind of safety net. And if programmers find a way to get to an supposedly illegal state after all, that state might actually turn out to represent a valid use case that was overlooked in the analysis phase.

Also in complex scenario's with multiple processes influencing an object/record's state, it is very helpful to be able to split up the state representation into separate fields. Depending on their given responsibility each process can then take sole ownership of a particular flag or field, taking away the danger of a 'newer state' (e.g. RECEIVED) getting overwritten with an 'older state' (like SENT), as the result of e.g. unexpected thread scheduling or network latency.

In case somebody is still tempted to protest that the Boolean type is just a special case of an enumeration type... ;-), it might be helpful to observe in summary that the Boolean is superior in its predictability. Never will a boolean need to facilitate for other values than just True or False [5] and never will the possible values of a boolean cease to be mutually exclusive. A programmer dealing with a Boolean field will of course never even give those nightmarish suggestions any thought and can therefore write clear, concise and safe code to reason over its potential values.

Next time you think you know beforehand which (3, 4, 5, etc.) states will only ever apply in a process, ask yourself the question whether those states could be expressed as a combination of meaningful booleans. This may require a bit of creative thinking in some cases, but is often easy to do. If you find this is indeed feasible, it seems obvious to me that you should do exactly that. I only wrote this down because apparently some other guys didn't see it that way... What do you think?

[1] The title of this post was modelled after some of the items in Joshua Bloch's educational ¨Effective Java (2nd edition)¨ that I enjoyed reading recently. The structure and authoritative, yet balanced style of the material alone made it a joy to read. I wouldn't dare to suggest that I think I could match it.

[2] Note that where booleans share pretty universal syntax and semantics across environments, an enumeration type can take many forms: a special language construct, a set of named integer constants, an agreed set of strings, a foreign key to a dedicated code table, a set of classes that implement the same interface (as in the State Pattern), etc.

[3] e.g. in ¨Effective Java (2nd ed.)¨ on page 190 and for instance in [], but note the discussion following the original argument there (no guarantees of course that that wiki entry will stay how I found it).

[4] Boolean flags don't necessarily need to be stored as such 'physically', as they can take an implicit form: a 'virtual' flag is considered 'set' (or 'unset' if that makes more sense in the context) when a corresponding optional field has a value (like a timestamp that indicates when the message was received; a value in the 'due' timestamp field referenced above might similarly imply that the message was sent). Depending on the environment, 'optional' here usually either means 'can have a null value' or 'can be omitted' (XML). And when the interface to the object is separate from its implementation (which is clearly not the case in plain XML and SQL) a client of that interface doesn't even need to know the flag is not actually stored as a separate boolean field. This type of redundancy minimization serves as a pro-boolean argument in itself: it is natural to take this approach when working from a mental model that focuses on boolean reasoning, but it doesn't work coming from an enumeration type perspective.

[5] I know: it is possible, fun and useful in some contexts to define a Boolean-like type that allows for one or more fuzzy values between True and False (e.g. Unknown), but as you write your code dealing with those instances there can never be any doubt that this is not your regular Boolean.