Archive for July, 2008

Why signal retraction is hard (Part 1)

While dusting off some Iris code the other day, I found a crash in ServiceBrowser.  Apparently, deep down in the internals, some unexpected event was being received.  Specifically, an asynchronous operation was started and then later canceled, but a completion event for that operation was delivered anyway and the application would naively process it.

There’s two solutions to a problem like this.  1) the application should ignore irrelevant/obsolete/invalid events, or 2) the subsystem should not deliver irrelevant/obsolete/invalid events.  Solution #1 is often the most straightforward, but it is also the least satisfying, and requires some way of identifying an event as irrelevant/obsolete/invalid.  Solution #2 is more difficult, but it leads to a system that works in a way the user (of your API) is more likely to expect.  I’m reminded of what I wrote in my Signal Rectraction article, about how association between an event and the action that caused it is almost always implied in Qt programming.  That is, we don’t go around passing context ids in our signals.  You just have to assume that all Qt signals are valid.  So, for someone like me, the natural solution is #2.

(Side note: I’m not positive as to what the catalyst to my particular problem was, but I think it had to do with my recent IPv6 adjustments, where LAN queries are sent on both the IPv4 and IPv6 interfaces simultaneously if you happen to have both.  Once a reply is received on one interface, the query on the other is canceled.  The unexpected event was coming from the other interface.  Most likely it was delivering an answer to a query that I had canceled and no longer cared about.)

In any case, I started out by going through the layers, from the bottom up, to see where any problem situations were.

First up: JDNS.  In JDNS, you call jdns_query() to start a query and jdns_cancel_query() to stop a query.  You run the engine by calling jdns_step() as appropriate (without going into too much detail, jdns_step() performs one iteration of the JDNS engine and then returns instructions about how and when you’re supposed to call jdns_step() again).  After the jdns_step() call completes, there might be some events waiting for you, which you can pick up by calling jdns_next_event().  JDNS only ever appends events to the event queue.  It never removes events.  It is your job to pop them off with jdns_next_event().  When you start a query, you are given an integer id as a “handle” to associate with it.  You can use this id to cancel the query, and incoming events for the query have this id as a property.

My first question about JDNS was: is it possible to receive events for canceled queries?  Yes, I could think of some ways.  At the very least, the user could run a query, accumulate an event for it, and then cancel the query.

// start a query
int id = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

// cancel the query
jdns_cancel_query(session, id);

// read an event
jdns_event_t *event = jdns_next_event(session);
// event->id == id

There will be an event in the queue, not associated with any active query, that the user could misinterpret.  Before considering this to be a flaw in JDNS, I wondered if this could simply be considered a misuse of the API (the misuser being me, heh).

Maybe the user could just filter out invalid events by comparing the id of a received event with the ids of any active queries.

// a container to hold active queries
QSet<int> queries;
...
// start a query
int id = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
queries.insert(id);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

// cancel the query
jdns_cancel_query(session, id);
queries.remove(id);

// there may be an event, but we can easily see if it is for a query we care about
jdns_event_t *event = jdns_next_event(session);
// queries.contains(event->id) == false

This should be possible as long as events are only generated during the jdns_step() call.  I say this because, well, imagine code like this:

// a container to hold active queries
QSet<int> queries;
...
// start a query
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
queries.insert(id1);
...
// cancel it
jdns_cancel_query(session, id1);
queries.remove(id1);
...
// start another query
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
queries.insert(id2);
...
// run the engine, long enough such that the query succeeds
jdns_step(session);

jdns_event_t *event = jdns_next_event(session);
// queries.contains(event->id) == true, and answer is for "jabber.org"

The idea here is that even if the second query were to use the same exact id value as the first query (id1 == id2), there would be no problem for the application to differentiate events between the two.  Why?  Because there wouldn’t be any events for the first query, since jdns_step() wasn’t called while the query was valid.

Finally, for any of this to work, the user wouldn’t be allowed to leave events lying around in the queue otherwise they might get mismatched with future queries.  It would basically rely on the user to not initiate any new queries between the call to jdns_step() and the last call to jdns_next_event().  That’s probably a fair restriction, and the app should always be able to know what events are legit.

That’s what I thought at first, anyway.  However I then discovered that JDNS caching doesn’t play by these rules. JDNS will cache answers for common queries, and there is at least one case where answer events might be queued up immediately at the call to jdns_query(). For example, if an answer for “msn.com” was available in the cache, then the jdns_query() call would immediately put that answer into the event queue.

// let's assume "msn.com" is cached
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
// (there is now an event in the queue)
jdns_cancel_query(session, id1);
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
jdns_step(session);
jdns_event_t *event = jdns_next_event(session);
// event->id == id1, and answer is for "msn.com".
// this is a problem if id2 was allocated to the same value as id1

If the second call to jdns_query() happened to assign the same handle id as the first call, then the application could read the first event from the queue and associate it with the second query. Someone looking for jabber.org might end up at msn.com, and that would be a tragedy!

There are a couple of solutions to this problem.  One is that I could internally refrain from queuing the event until the next call to jdns_step(), even if the answer is available in the cache. In other words, jdns_query() shouldn’t be creating events. The other is that when jdns_cancel_query() is called, I could remove any related events from the event queue. Also, before you say it, I’m aware that I could just base the handle allocations on a strictly increasing integer (FWIW, this is actually what JDNS does anyway), but this is unsatisfying as a solution to the event mismatch problem. Theoretically, an event mismatch wouldn’t happen on a 32-bit counter unless you performed 4 billion queries without calling jdns_next_event(), which is incredibly impractical and would never happen, but… what can I say, I just don’t consider it a solution.

I went with the second option, where events are cleared from the queue when jdns_cancel_query() is called.

// let's assume "msn.com" is cached
int id1 = jdns_query(session, "msn.com", JDNS_RTYPE_A);
// (there is now an event in the queue)
jdns_cancel_query(session, id1);
// (event now removed!)
int id2 = jdns_query(session, "jabber.org", JDNS_RTYPE_A);
jdns_step(session);
jdns_event_t *event = jdns_next_event(session);
// event->id == id2, and answer is for "jabber.org".
// it doesn't matter if id2 was allocated to the same value as id1

This pretty much solves all problems.  It even lifts that restriction discussed earlier about how the user would have to read all events immediately following a jdns_step() call. You can start or cancel queries wherever you like. It just makes everything work as expected.

Despite JDNS being plain C and not Qt, this is still fundamentally a signal retraction problem (in fact, all of the Delta Object Rules are environment-neutral, despite my framing in a Qt/C++ context). The difficulty suggested by the title of this article is not necessarily about the implementation, as in this case it was rather easy (I just added a few lines to the jdns_cancel_query() function). No, here the difficulty was spotting the problem in the first place. Most signal retraction bugs are edge cases that are hard to replicate. How fortunate of me to get repeated crashes on something that was surely the result of a network race condition.

So, that fixes JDNS. In the next article of this series, I’ll show you how the next layer up, QJDns, managed to negate the fixes and reintroduce the problem again.

Comments

Introducing PsiMedia

PsiMedia Test screenshot

Voice (and video) chat is a feature we’ve wanted in Psi for a long time.  However, implementing voice/video chat is not straightforward, and this is partly due to all of the new concepts that have to be introduced into the application in order to make it happen.  Cameras, microphones, codecs, and RTP are all just very foreign to Psi.  The code necessary to handle a multimedia “stack” could easily exceed the amount of code in our own IM stack!  Fortunately, there are libraries out there to handle the task.

In 2004, we considered RealNetworks’ Helix framework.  For receiving content, we found this framework to be quite mature.  However, for transmitting content, it was clearly not designed for end-user desktop applications and was even GPL-incompatible in that scenario.  Quite some work went into the Psi+Helix effort, but ultimately it was abandoned.

In 2005, we considered Google’s libjingle.  We managed to get voice chat working with it, but the code never went beyond the experimental stage.  This was due to the limited platform support at the time (Linux audio only at first, though Remko managed to add in Mac audio support) and libjingle’s lack of maintenance.  Libjingle works as a black box, handling not only multimedia but also the Jingle protocol.  Unfortunately, this meant that as the Jingle protocol changed, libjingle fell out of spec.  We also felt it was a tad intrusive for libjingle to be handling XMPP stuff.

In 2006, we investigated GStreamer.  This framework has proved to be the most interesting thus far, for a number of reasons.  Unlike the limited libjingle black-box, GStreamer is a comprehensive and flexible multimedia framework, similar in nature to Helix.  It goes further than Helix though, by offering a better API for transmitting, by being GPL-compatible throughout, and by being easier to extend.  I feel confident we can accomplish everything we need with GStreamer.

Today there is Phonon, however it lacks input and transmission facilities at this time.  We will keep an eye on it for the future.  There is also Farsight, which integrates with GStreamer.  We may make use of Farsight, depending on our needs.

In any case, I’ve started a new “wrapper” project called PsiMedia.  The goal of PsiMedia is to offer an API designed for the purpose of adding voice and video chat to Psi or a Psi-like client.  All of the details the client does not care about will be hidden behind PsiMedia.  It solves only the multimedia aspects, and not Jingle/XMPP, as I consider these two problems to be orthogonal.  Currently PsiMedia wraps GStreamer, but the requirements are abstract enough that the client should not care what is actually wrapped.  PsiMedia can be considered the successor of the old “Media” module I started in 2004, to wrap Helix.

Below are the requirements of the system.

What PsiMedia does:

  • Tell you what audio and video devices are available.
  • Tell you what audio/video modes are possible (codecs, sample rates, video resolutions, etc).
  • Allow you to specify your desired modes, and the modes of the remote party, to arrive at a list if common modes.
  • Capture audio/video and encode as RTP into a series of QByteArrays.
  • Accept QByteArrays containing RTP, and playback any audio/video contained within.
  • Play back video in a QWidget.
  • Allow displaying video currently being captured (preview of yourself).
  • Volume controls.
  • Ability to separate the backend into a plugin, so that no new compile-time dependencies are introduced to Psi.

(RTP, by the way, is a standard packet format for transporting multimedia data in real-time.  It is used by SIP, Jingle, and, well, everybody.)

What PsiMedia does not do:

  • Use the network.
  • Implement Jingle or anything XMPP.
  • Expose anything more than very basic multimedia details.  There are no filters, no pipelines, etc.

In short, PsiMedia should make implementing voice/video chat in Psi straightforward.

Comments (2)

Bad Behavior has blocked 294 access attempts in the last 7 days.