Why signal retraction is hard (Part 1)

While dusting off some Iris code the other day, I found a crash in ServiceBrowser.  Apparently, deep down in the internals, some unexpected event was being received.  Specifically, an asynchronous operation was started and then later canceled, but a completion event for that operation was delivered anyway and the application would naively process it.

There’s two solutions to a problem like this.  1) the application should ignore irrelevant/obsolete/invalid events, or 2) the subsystem should not deliver irrelevant/obsolete/invalid events.  Solution #1 is often the most straightforward, but it is also the least satisfying, and requires some way of identifying an event as irrelevant/obsolete/invalid.  Solution #2 is more difficult, but it leads to a system that works in a way the user (of your API) is more likely to expect.  I’m reminded of what I wrote in my Signal Rectraction article, about how association between an event and the action that caused it is almost always implied in Qt programming.  That is, we don’t go around passing context ids in our signals.  You just have to assume that all Qt signals are valid.  So, for someone like me, the natural solution is #2.

(Side note: I’m not positive as to what the catalyst to my particular problem was, but I think it had to do with my recent IPv6 adjustments, where LAN queries are sent on both the IPv4 and IPv6 interfaces simultaneously if you happen to have both.  Once a reply is received on one interface, the query on the other is canceled.  The unexpected event was coming from the other interface.  Most likely it was delivering an answer to a query that I had canceled and no longer cared about.)

In any case, I started out by going through the layers, from the bottom up, to see where any problem situations were.

First up: JDNS.  In JDNS, you call jdns_query() to start a query and jdns_cancel_query() to stop a query.  You run the engine by calling jdns_step() as appropriate (without going into too much detail, jdns_step() performs one iteration of the JDNS engine and then returns instructions about how and when you’re supposed to call jdns_step() again).  After the jdns_step() call completes, there might be some events waiting for you, which you can pick up by calling jdns_next_event().  JDNS only ever appends events to the event queue.  It never removes events.  It is your job to pop them off with jdns_next_event().  When you start a query, you are given an integer id as a “handle” to associate with it.  You can use this id to cancel the query, and incoming events for the query have this id as a property.

My first question about JDNS was: is it possible to receive events for canceled queries?  Yes, I could think of some ways.  At the very least, the user could run a query, accumulate an event for it, and then cancel the query.

// start a query
int id = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

// cancel the query
jdns_cancel_query(session, id);

// read an event
jdns_event_t *event = jdns_next_event(session);
// event->id == id

There will be an event in the queue, not associated with any active query, that the user could misinterpret.  Before considering this to be a flaw in JDNS, I wondered if this could simply be considered a misuse of the API (the misuser being me, heh).

Maybe the user could just filter out invalid events by comparing the id of a received event with the ids of any active queries.

// a container to hold active queries
QSet<int> queries;
...
// start a query
int id = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
queries.insert(id);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

// cancel the query
jdns_cancel_query(session, id);
queries.remove(id);

// there may be an event, but we can easily see if it is for a query we care about
jdns_event_t *event = jdns_next_event(session);
// queries.contains(event->id) == false

This should be possible as long as events are only generated during the jdns_step() call.  I say this because, well, imagine code like this:

// a container to hold active queries
QSet<int> queries;
...
// start a query
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
queries.insert(id1);
...
// cancel it
jdns_cancel_query(session, id1);
queries.remove(id1);
...
// start another query
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
queries.insert(id2);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

jdns_event_t *event = jdns_next_event(session);
// queries.contains(event->id) == true, and answer is for "jabber.org"

The idea here is that even if the second query were to use the same exact id value as the first query (id1 == id2), there would be no problem for the application to differentiate events between the two.  Why?  Because there wouldn’t be any events for the first query, since jdns_step() wasn’t called while the query was valid.

Finally, for any of this to work, the user wouldn’t be allowed to leave events lying around in the queue otherwise they might get mismatched with future queries.  It would basically rely on the user to not initiate any new queries between the call to jdns_step() and the last call to jdns_next_event().  That’s probably a fair restriction, and the app should always be able to know what events are legit.

That’s what I thought at first, anyway.  However I then discovered that JDNS caching doesn’t play by these rules. JDNS will cache answers for common queries, and there is at least one case where answer events might be queued up immediately at the call to jdns_query(). For example, if an answer for “msn.com” was available in the cache, then the jdns_query() call would immediately put that answer into the event queue.

// let's assume "msn.com" is cached
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
// (there is now an event in the queue)
jdns_cancel_query(session, id1);
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
jdns_step(session);
jdns_event_t *event = jdns_next_event(session);
// event->id == id1, and answer is for "msn.com".
// this is a problem if id2 was allocated to the same value as id1

If the second call to jdns_query() happened to assign the same handle id as the first call, then the application could read the first event from the queue and associate it with the second query. Someone looking for jabber.org might end up at msn.com, and that would be a tragedy!

There are a couple of solutions to this problem.  One is that I could internally refrain from queuing the event until the next call to jdns_step(), even if the answer is available in the cache. In other words, jdns_query() shouldn’t be creating events. The other is that when jdns_cancel_query() is called, I could remove any related events from the event queue. Also, before you say it, I’m aware that I could just base the handle allocations on a strictly increasing integer (FWIW, this is actually what JDNS does anyway), but this is unsatisfying as a solution to the event mismatch problem. Theoretically, an event mismatch wouldn’t happen on a 32-bit counter unless you performed 4 billion queries without calling jdns_next_event(), which is incredibly impractical and would never happen, but… what can I say, I just don’t consider it a solution.

I went with the second option, where events are cleared from the queue when jdns_cancel_query() is called.

// let's assume "msn.com" is cached
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
// (there is now an event in the queue)
jdns_cancel_query(session, id1);
// (event now removed!)
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
jdns_step(session);
jdns_event_t *event = jdns_next_event(session);
// event->id == id2, and answer is for "jabber.org".
// it doesn't matter if id2 was allocated to the same value as id1

This pretty much solves all problems.  It even lifts that restriction discussed earlier about how the user would have to read all events immediately following a jdns_step() call. You can start or cancel queries wherever you like. It just makes everything work as expected.

Despite JDNS being plain C and not Qt, this is still fundamentally a signal retraction problem (in fact, all of the Delta Object Rules are environment-neutral, despite my framing in a Qt/C++ context). The difficulty suggested by the title of this article is not necessarily about the implementation, as in this case it was rather easy (I just added a few lines to the jdns_cancel_query() function). No, here the difficulty was spotting the problem in the first place. Most signal retraction bugs are edge cases that are hard to replicate. How fortunate of me to get repeated crashes on something that was surely the result of a network race condition.

So, that fixes JDNS. In the next article of this series, I’ll show you how the next layer up, QJDns, managed to negate the fixes and reintroduce the problem again.

Leave a Comment

Bad Behavior has blocked 809 access attempts in the last 7 days.