<!--# set var="title" value="Optimistic parsing" -->
<!--# set var="date" value="2019-05-05" -->
<!--# include file="include/top.html" -->
<p>I've been writing a <ahref="https://www.mit.edu/~yandros/doc/specs/fcgi-spec.html">FastCGI</a> protocol parser. The protocol is reasonably well designed (if not accurately documented), and parsing it follows a familiar pattern:</p>
<ul>
<li>Sequential structures/records (e.g. header followed by body)</li>
<li>Stateful: subsequent parsing depends on past parsing (e.g. length field in header tells you the size of body)</li>
<li>Variable length: most messages are short, but some can be long</li>
<li>Stream-based: underlying protocol doesn't provide any help with message boundaries</li>
</ul>
<p>This pattern presents an implementation challenge: how to balance efficiency with implementation complexity. It seems easy to read and parse each structure/record sequentially, but there are pitfalls and fundamental problems:</p>
<p>The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using <ahref="https://en.cppreference.com/w/cpp/io/c/fopen"><code>fopen()</code></a> and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.</p>
<p>The biggest thing that makes such code ugly is handling running out of data while in the middle of parsing a structure/record. If you have some state (say, a header), preserving that state until the remaining data arrives is messy and error-prone, especially in a many-connections-per-thread model. However, you've already read that data from the socket/buffer, so you can't just throw it away. But what if you had a smarter buffer?</p>
<p>Consider:</p>
<pre><code>void HandleReadable() {
buffer.Read(socket);
while (1) {
// Rewind any previous reads past our last commit point.
buffer.ResetRead();
if (!ParseHeader(buffer)) {
break;
}
if (!ParseBody(buffer)) {
break;
}
HandleRequest();
// The buffer can discard any data that we've read so far.
buffer.Commit();
}
// Move data to make room for the next Read().
buffer.Consume();
}
</code></pre>
<p>You still keep state between calls to this function, but it's entirely inside the buffer.</p>
<p>Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. <code>ResetRead()</code> moves the read pointer to the rewind pointer. <code>Commit()</code> moves the rewind pointer to the read pointer. <code>Consume()</code> moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.</p>
<p>This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls <code>Consume()</code>, minimizing not only syscalls but data copies.</p>
<p>In exchange for that risk of multiple re-parsing, we get a relatively clean, readable implementation. The <code>break</code> statements are probably the ugliest, but function return error handling patterns are a topic for another day.</p>
<p>Here's a simple implementation of such a smart buffer: <ahref="https://github.com/firestuff/firebuf">firebuf</a></p>