Do you really need to return that error?

We are taught to be conscientious about handling errors. Always check your return values. Always catch your exceptions.

Handling errors properly obviously is the right thing to do. But since it’s often such a drag, we have force ourselves to bother doing it: Oh, you want to use this API? Call these seven functions. And oh by the way, each of them can return an error you need to think about.

Hypothetically, if not all of the functions returned an error, wouldn’t it be much easier to write correct code?

But how can we make such an API? Wouldn’t we just be sweeping errors under the rug?

In this article I want to talk about a way of designing APIs where you still let give the user of API full control of what to do when errors happen, without forcing them to always write an an if-statements for each function.

To go from abstract to concrete, consider how we would read a file in C. We need fopen, maybe some fseek and ftell to pre-allocate a buffer, fread in a loop and fclose. Each function can return an error or fail in some way.

This is not a great place to be. We have to make five different decisions, one for each f... function we are calling, to choose to what to do when an error occurs. Do we retry the function call? Do we go down some else-path. Do we return an error to the user? Did we remember to call fclose?

To write a correct and robust program, we have to answer all these questions correctly, i.e. write the correct if/else. Again, it’s not obvious how to eliminate the errors. fopen, fread & co. are implemented on top of syscalls, and those syscalls might fail.

A complete example: sd-bus

Let’s take a step back, and construct a full example. We’ll use sd-bus to talk to systemd-networkd, and ask it which network links (eth0, wlan0, etc.) it is managing.

If you already are familiar with DBus, sd-bus, and/or systemd-networkd, you might want to just look at the full example and skip this section.

To find out which network links are managed, we first need to get a list of network links. We can do this by calling the ListLinks DBus method. On the command line, that looks something like this:

To talk to systemd-networkd, we’ll use systemd’s sd-bus API. To clarify, I like this API. It is clear, consistent, and has good documentation. My only problem with it is how it forces me as the the user of the API to do error handling.

It boils down to this: You want to write many calls to sd_bus_message_read_... and sd_bus_message_append_.... Each of these functions return an error. If an error occurs, you usually want to bail on whatever you are doing, do some cleanup, and return an error up the stack.

Using the sd-bus C API, we’d first use sd_bus_new_method_call, which gives us a sd_bus_message object. Since the call takes no parameters, we can move on to sd_bus_call, which gives us a new sd_bus_message reply object.

This is where the fun starts: DBus messages have a tree-like structure (the contents of "data" you see in the JSON output above). To get the the link names (lo, eth0, …) out from our sd_bus_message reply object, we have to walk through the tree.

sd_bus_message_enter_container(reply) // enter the top-level array
while (sd_bus_message_peek_type(reply)) {
    sd_bus_message_enter_container(reply) // enter one of the nested arrays

    int32_t index = 0;
    const char *name = NULL;
    const char *path = NULL;

    // read the three members of the array:
    sd_bus_message_read_basic(reply, 'i', &index)
    sd_bus_message_read_basic(reply, 's', &name)
    sd_bus_message_read_basic(reply, 'o', &path)

    // now we can do something with our data

    sd_bus_message_exit_container(reply)
}
sd_bus_message_exit_container(reply)

(Note that there are some convenience functions, e.g. the variadic sd_bus_message_read, which could collapse the three sd_bus_message_read_basic calls into a single call. I’m skipping over these convenience functions for now, since I don’t think they significantly affect the basic point I’m trying to get at).

If this was all there was to using sd-bus, I would be very happy. But as you might have noticed, there is no error handling. What happens if systemd-networkd isn’t running? Or if our code doesn’t match up with the tree structure in the actual message (maybe because we made some mistake, or maybe the DBus API has changed since we wrote the code)?

With how systemd-networkd works, we have to change every sd_bus_... call into the following:

int res;
res = sd_bus_...
if (res < 0) {
    // handle the error...
}

For the error handling itself, I usually just log the error. If errors occur in this code, usually it is either because I incorrectly assumed the reply would have a different format and wrote the wrong sd_bus_message_... call, or because I forgot to check some precondition and the service I’m talking to isn’t actually running.

But in addition to error handling, we have to remember to do cleanup. At the very least, we’ll have to call sd_bus_message_unref, regardless of whether or not an error occurs. Also, we might have allocated memory for storing the result data, and if we want to discard the result when an error occurs we’ll have to free this memory.

In the full example, I deal with cleanup using goto. Writing the most straight-forward no-thinking version of the code to call ListLinks, using goto for error handling, we end up with about 4x as much code as in the example above.

Aside from the fact that goto is generally shunned, the fact that we need to do if (res < 0) for every sd-bus call takes focus away from the basic structure of what our code is doing, making it harder to understand the code and to spot other mistakes.

What are our alternatives?

Instead of these three options, I’d like to explore another alternative: Poisoning APIs.

Poisoning!

Now, the user of our API can write code assuming that it never fails. Then, at the end, they just have to remember to call is_ok and log an error as appropriate. We’ve just gone from N to 1 places were the API user has to handle errors.

And notice that we’ve not lost any control: If there is some piece of code which really shouldn’t run if an error has occurred, the user can always do if (is_ok(...)).

In the case of sd-bus, we can get a poisoning API by creating a wrapper. call_helper.h shows a proof of concept, and I’ve rewritten the example code to use this helper. Notice how the code now looks very similar to the rough outline we used to introduce sd-bus:

SdBusCall call = {bus};
call_init(&call, "org.freedesktop.network1", "/org/freedesktop/network1", "org.freedesktop.network1.Manager", "ListLinks");
call_run_with_timeout(&call, 1.0f);

call_read_enter_container(&call);
while (call_read_peek(&call)) {
    call_read_enter_container(&call);
    int32_t index = call_read_int32(&call);
    const char *name = call_read_string(&call);
    const char *path = call_read_opath(&call);
    call_read_exit_container(&call);

    // allocate some space to store data, etc...
}
call_read_exit_container(&call);

if (!call_ok(&call)) {
    printf("%s\n", call.error);
    // if we allocated space for data, we might want to free it here...
}
call_deinit(&call);

I believe that it is much easier to write correct code when using a poisoning API, compared to when using a each-function-returns-an-error style API. In most cases, you can write straight forward usage code straight, and you only have to think about error handling at the end.

And importantly: No longer do you have to write “this example skips error checking for clarity” in your documentation. Error checking should not have to come at the cost of clarity!

Conclusion

We’ve seen how a poisoning API can reduce the mental burden on the user of the API, by not forcing them to think about error handling at every step.

Such an API requires no special language features, and can be employed in any imperative language.

Of course, it should be said that the best option of all is to not have errors. If you can change your code so it can not fail at all, then you don’t have to worry about error handling. Unfortunately, once you start dealing with IO and IPC, this stops being an option fairly quickly.

One thing to mention is that in many cases I would not bother creating a wrapper around an API, just to make a every-function-returns-an-error API into a poisoning-API. As the sd-bus example shows, creating the wrapper means we have to write a bunch of extra code, and that eats into the benefit we gain from using a simpler API.

Looking at call_helper.h, if we were writing the API directly instead of writing a wrapper, we could have stuck bool poisoned directly into struct sd_bus_message or struct FILE.

A complete example: sd-bus

What are our alternatives?

Poisoning!

Conclusion

A complete example: `sd-bus`