2019-05-05-optimistic-parsing.html

<!--# set var="title" value="Optimistic parsing" -->
<!--# set var="date" value="2019-05-05" -->

<!--# include file="include/top.html" -->

<p>I've been writing a <a href="https://www.mit.edu/~yandros/doc/specs/fcgi-spec.html">FastCGI</a> protocol parser. The protocol is reasonably well designed (if not accurately documented), and parsing it follows a familiar pattern:</p>

<ul>
<li>Sequential structures/records (e.g. header followed by body)</li>
<li>Stateful: subsequent parsing depends on past parsing (e.g. length field in header tells you the size of body)</li>
<li>Variable length: most messages are short, but some can be long</li>
<li>Stream-based: underlying protocol doesn't provide any help with message boundaries</li>
</ul>

<p>This pattern presents an implementation challenge: how to balance efficiency with implementation complexity. It seems easy to read and parse each structure/record sequentially, but there are pitfalls and fundamental problems:</p>

<ul>
<li>You have to consistently handle interrupted system calls (<code>EINTR</code>) and short reads for every <code>read()</code></li>
<li>It only works in a thread-per-connection model</li>
<li>It results in extra system calls, which are <a href="2019-05-05-syscall-efficiency.html">expensive</a></li>
</ul>

<p>The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using <a href="https://en.cppreference.com/w/cpp/io/c/fopen"><code>fopen()</code></a> and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.</p>

<p>The biggest thing that makes such code ugly is handling running out of data while in the middle of parsing a structure/record. If you have some state (say, a header), preserving that state until the remaining data arrives is messy and error-prone, especially in a many-connections-per-thread model. However, you've already read that data from the socket/buffer, so you can't just throw it away. But what if you had a smarter buffer?</p>

<p>Consider:</p>

<pre><code>void HandleReadable() {
    buffer.Read(socket);

    while (1) {
        // Rewind any previous reads past our last commit point.
        buffer.ResetRead();

        if (!ParseHeader(buffer)) {
            break;
        }

        if (!ParseBody(buffer)) {
            break;
        }

        HandleRequest();

        // The buffer can discard any data that we've read so far.
        buffer.Commit();
    }

    // Move data to make room for the next Read().
    buffer.Consume();
}
</code></pre>

<p>You still keep state between calls to this function, but it's entirely inside the buffer.</p>

<p>Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. <code>ResetRead()</code> moves the read pointer to the rewind pointer. <code>Commit()</code> moves the rewind pointer to the read pointer. <code>Consume()</code> moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.</p>

<p>This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls <code>Consume()</code>, minimizing not only syscalls but data copies.</p>

<p>In exchange for that risk of multiple re-parsing, we get a relatively clean, readable implementation. The <code>break</code> statements are probably the ugliest, but function return error handling patterns are a topic for another day.</p>

<p>Here's a simple implementation of such a smart buffer: <a href="https://github.com/firestuff/firebuf">firebuf</a></p>

<!--# include file="include/bottom.html" -->
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00			`<!--# set var="title" value="Optimistic parsing" -->`
			`<!--# set var="date" value="2019-05-05" -->`

			`<!--# include file="include/top.html" -->`

			`<p>I've been writing a <a href="https://www.mit.edu/~yandros/doc/specs/fcgi-spec.html">FastCGI</a> protocol parser. The protocol is reasonably well designed (if not accurately documented), and parsing it follows a familiar pattern:</p>`

			`<ul>`
			`<li>Sequential structures/records (e.g. header followed by body)</li>`
			`<li>Stateful: subsequent parsing depends on past parsing (e.g. length field in header tells you the size of body)</li>`
			`<li>Variable length: most messages are short, but some can be long</li>`
			`<li>Stream-based: underlying protocol doesn't provide any help with message boundaries</li>`
			`</ul>`

			`<p>This pattern presents an implementation challenge: how to balance efficiency with implementation complexity. It seems easy to read and parse each structure/record sequentially, but there are pitfalls and fundamental problems:</p>`

			`<ul>`
Syscall efficiency 2019-05-05 23:11:57 +00:00			`<li>You have to consistently handle interrupted system calls (<code>EINTR</code>) and short reads for every <code>read()</code></li>`
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00			`<li>It only works in a thread-per-connection model</li>`
Syscall efficiency 2019-05-05 23:11:57 +00:00			`<li>It results in extra system calls, which are <a href="2019-05-05-syscall-efficiency.html">expensive</a></li>`
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00			`</ul>`

Syscall efficiency 2019-05-05 23:11:57 +00:00			`<p>The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using <a href="https://en.cppreference.com/w/cpp/io/c/fopen"><code>fopen()</code></a> and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.</p>`
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00
			`<p>The biggest thing that makes such code ugly is handling running out of data while in the middle of parsing a structure/record. If you have some state (say, a header), preserving that state until the remaining data arrives is messy and error-prone, especially in a many-connections-per-thread model. However, you've already read that data from the socket/buffer, so you can't just throw it away. But what if you had a smarter buffer?</p>`

			`<p>Consider:</p>`

			`<pre><code>void HandleReadable() {`
			`buffer.Read(socket);`

			`while (1) {`
			`// Rewind any previous reads past our last commit point.`
			`buffer.ResetRead();`

			`if (!ParseHeader(buffer)) {`
			`break;`
			`}`

			`if (!ParseBody(buffer)) {`
			`break;`
			`}`

			`HandleRequest();`

			`// The buffer can discard any data that we've read so far.`
			`buffer.Commit();`
			`}`

			`// Move data to make room for the next Read().`
			`buffer.Consume();`
			`}`
			`</code></pre>`

			`<p>You still keep state between calls to this function, but it's entirely inside the buffer.</p>`

Syscall efficiency 2019-05-05 23:11:57 +00:00			`<p>Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. <code>ResetRead()</code> moves the read pointer to the rewind pointer. <code>Commit()</code> moves the rewind pointer to the read pointer. <code>Consume()</code> moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.</p>`
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00
Syscall efficiency 2019-05-05 23:11:57 +00:00			`<p>This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls <code>Consume()</code>, minimizing not only syscalls but data copies.</p>`
Optimistic parsing (partial) 2019-05-05 20:55:37 +00:00
			`<p>In exchange for that risk of multiple re-parsing, we get a relatively clean, readable implementation. The <code>break</code> statements are probably the ugliest, but function return error handling patterns are a topic for another day.</p>`

			`<p>Here's a simple implementation of such a smart buffer: <a href="https://github.com/firestuff/firebuf">firebuf</a></p>`

			`<!--# include file="include/bottom.html" -->`