Syscall efficiency

2019-05-05 23:11:57 +00:00
parent 23ca1a4609
commit aac374c331
7 changed files with 197 additions and 11 deletions
--- a/2019-05-05-optimistic-parsing.html
+++ b/2019-05-05-optimistic-parsing.html
@@ -15,12 +15,12 @@
 <p>This pattern presents an implementation challenge: how to balance efficiency with implementation complexity. It seems easy to read and parse each structure/record sequentially, but there are pitfalls and fundamental problems:</p>
 <ul>
-<li>You have to consistently handle interrupted system calls (EINTR) and <a href="TODO">short reads</a> for every read()</li>
+<li>You have to consistently handle interrupted system calls (<code>EINTR</code>) and short reads for every <code>read()</code></li>
 <li>It only works in a thread-per-connection model</li>
-<li>It results in extra system calls, which are <a href="TODO">expensive</a></li>
+<li>It results in extra system calls, which are <a href="2019-05-05-syscall-efficiency.html">expensive</a></li>
 </ul>
-<p>The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using <a href="https://en.cppreference.com/w/cpp/io/c/fopen">fopen()</a> and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.</p>
+<p>The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using <a href="https://en.cppreference.com/w/cpp/io/c/fopen"><code>fopen()</code></a> and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.</p>
 <p>The biggest thing that makes such code ugly is handling running out of data while in the middle of parsing a structure/record. If you have some state (say, a header), preserving that state until the remaining data arrives is messy and error-prone, especially in a many-connections-per-thread model. However, you've already read that data from the socket/buffer, so you can't just throw it away. But what if you had a smarter buffer?</p>
@@ -54,9 +54,9 @@
 <p>You still keep state between calls to this function, but it's entirely inside the buffer.</p>
-<p>Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. ResetRead() moves the read pointer to the rewind pointer. Commit() moves the rewind pointer to the read pointer. Consume() moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.</p>
+<p>Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. <code>ResetRead()</code> moves the read pointer to the rewind pointer. <code>Commit()</code> moves the rewind pointer to the read pointer. <code>Consume()</code> moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.</p>
-<p>This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls Consume(), minimizing not only syscalls but data copies.</p>
+<p>This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls <code>Consume()</code>, minimizing not only syscalls but data copies.</p>
 <p>In exchange for that risk of multiple re-parsing, we get a relatively clean, readable implementation. The <code>break</code> statements are probably the ugliest, but function return error handling patterns are a topic for another day.</p>
--- a/2019-05-05-syscall-efficiency.html
+++ b/2019-05-05-syscall-efficiency.html
@@ -0,0 +1,93 @@
 <!--# set var="title" value="Syscall efficiency" -->
 <!--# set var="date" value="2019-05-05" -->
 <!--# include file="include/top.html" -->
 <p>System calls have come a long way since the interrupt days. <a href="http://man7.org/linux/man-pages/man7/vdso.7.html">vDSO</a> has moved many simple syscalls (e.g. <code>getpid()</code>) into pure userspace, making them as fast as indirect function calls.</p>
 <p>That still doesn't work for I/O, though. Arkanis has some <a href="http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead">simple measurements</a> that show the difference between <code>getpid()</code> and <code>read()</code>, and it's clear that minimizing <code>read()</code> calls is still worth the effort.</p>
 <p>So, how's the world doing on that? Here's <a href="https://nghttp2.org/documentation/h2load-howto.html">h2load</a>, an HTTP/2 load testing client that clearly has a reason for a performance focus, connecting and making 3 requests to a server:</p>
 <pre><code>socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 5
 setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
 connect(5, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "...", &amp;sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLOUT, {u32=5, u64=4294967301}}) = 0
 epoll_wait(4, [{EPOLLOUT, {u32=5, u64=4294967301}}], 64, 59743) = 1
 getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
 write(5, "\26\3\1\2\0\1\0\1\374\3\3$Xo\3a5:I\275\351Ge;\216b\201\300\312.\202\326"..., 517) = 517
 read(5, 0x7fd01ac32ac3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 epoll_ctl(4, EPOLL_CTL_MOD, 5, {EPOLLIN, {u32=5, u64=8589934597}}) = 0
 epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 read(5, "\26\3\3\0z", 5)    = 5
 read(5, "\2\0\0v\3\3\233\317P\30\33f\17\3772\6K\2446\267N\304\313\232\177\364\224\n\334\37\324{"..., 122) = 122
 read(5, "\24\3\3\0\1", 5)   = 5
 read(5, "\1", 1)            = 1
 read(5, "\27\3\3\0 ", 5)    = 5
 read(5, "8\315\331bnq\370w\346+j~\310\330\213\30\17\301~9\200ezU:\321\228\6\274\365\212", 32) = 32
 read(5, "\27\3\3\n!", 5)    = 5
 read(5, "\265\366\242\vqr\241\260\373G\325\370\330\227A\376e\243JY\317\340\226x}Fo\\\364=\306\306"..., 2593) = 2593
 read(5, "\27\3\3\1\31", 5)  = 5
 read(5, "&amp;\356\321\337\346\240q\317\n\307^\23\362\1\35]D\304gc\334\302\367\340\34C\273\313\316\307yo"..., 281) = 281
 read(5, "\27\3\3\0E", 5)    = 5
 read(5, "\246m\377@\212@\244\276#vB\211\362\277\23-\17&gt;\324c\237\5\265\265&lt;\370\276\374\20k\353\352"..., 69) = 69
 write(5, "\24\3\3\0\1\1\27\3\3\0E)\235\322 \3\247N\216\202\300+\200\25K\177\341\376\300\212\235k"..., 80) = 80
 write(5, "\27\3\3\0\200\314Z.\260t2\0\336\315\351L&amp;\256\221\344\222\2453;\201\n\223\305\r\27=\260"..., 133) = 133
 epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 read(5, "\27\3\3\0J", 5)    = 5
 read(5, "\325\27U^\346\20\2\320\266h\215\356DV\168\300K+\265\f\316|\355\337h\361\177\242V\213\271"..., 74) = 74
 read(5, "\27\3\3\0J", 5)    = 5
 read(5, "\0012/\311\351g\270\201\3176\273\245t\230[\257\336j3,\377\365\356\213W\225\301K\304\371\334l"..., 74) = 74
 read(5, "\27\3\3\09", 5)    = 5
 read(5, "yz,\375\306\0177T-\313M\7\256\376\r\vpY\215\222m\233\16\n\364\345\201\235\302\n\276\255"..., 57) = 57
 read(5, "\27\3\3\0\32", 5)  = 5
 read(5, "\346^[\215\243iA\324\343\37\200\245\311\270\331S\332\243\311\261\236\325\374\237\v\247", 26) = 26
 read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 write(5, "\27\3\3\0\32\377Imr\325\234U\307Ez\253Un\346\276\6P\256\262\214\260\324J\250,\274", 31) = 31
 epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 read(5, "\27\3\3\0s", 5)    = 5
 read(5, "\225\322\26\225\266\356\305hL\6\232)_\242_\310P\32H|zL\277\351c?\317\7G\252Z]"..., 115) = 115
 read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 write(5, "\27\3\3\0%L\334\214\234\245X}1\213\4\0\3*bky\370:\344\357T\316\360\342\f\243;"..., 42) = 42
 epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 read(5, "\27\3\3\0s", 5)    = 5
 read(5, "\351+\3gyf\305.V\272\373\354\240\231UI\222\25oq\244\200\0255c\221\301\t\2m\273\326"..., 115) = 115
 read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 write(5, "\27\3\3\0%\300\337r\tR\252O,2\223B\256b\222\271\5\322\265\227\243\327\304G\10\272_3"..., 42) = 42
 epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 read(5, "\27\3\3\0s", 5)    = 5
 read(5, "\323D\n\313\323\2\353q\324\210\317\213\30\2550\3738L\200D&lt;\326\214b\366\24\335\"\31\311w\223"..., 115) = 115
 read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 write(5, "\27\3\3\0\"\217P\25^*\232\311vCy=u\207;~\"Gf\203h\247\25m&amp;\255\272`"..., 39) = 39
 write(5, "\27\3\3\0\23\234SF\221\214\213\7&lt;\336`\316\252\315\307\204\3\342AL", 24) = 24
 shutdown(5, SHUT_WR)        = 0
 close(5)                    = 0
 </code></pre>
 <p>(╯°□°)╯︵ ┻━┻</p>
 <p>If you don't spend a lot of time reading <code>strace</code> output, that may not look odd or bad. It's a complete mess, though. Here's what stands out:</p>
 <ul>
 <li>The socket is created with <code>SOCK_NONBLOCK</code>. That's useful in some niche cases, but here it's mixed with <code>epoll_wait()</code> and this code is one-socket-per-thread anyway. That leads to behavior like calling <code>connect()</code>, getting <code>EINPROGRESS</code> back, and immediately calling <code>epoll_ctl()</code> and <code>epoll_wait()</code> to reconstruct blocking <code>connect()</code> behavior.</li>
 <li><code>TCP_NODELAY</code> is set on the socket. That can be great if you know that you're writing chunks that should go out immediately. In this code, though, there are multiple <code>write()</code> calls one after another. With <code>TCP_NODELAY</code>, those get broken into separate packets, with lots of wasted overhead on the wire and at both ends.</li>
 <li>There's a call to <code>getsockopt(SO_ERROR)</code> just before a call to <code>write()</code>. If the socket was closed, the <code>write()</code> would fail with a distinct errno, so the <code>getsockopt()</code> is redundant.</li>
 <li>Wow, that's a lot of <code>read()</code> calls. Lots of small calls that are fulfilled with all the bytes they asked for, then immediately another <code>read()</code> for more data. h2load doesn't do this in unencrypted mode, so it's probably OpenSSL's fault.</li>
 <li><code>shutdown()</code> gets called just before <code>close()</code> on the same file descriptor, uselessly.</li>
 </ul>
 <p>Issues like this frequently creep in with abstractions. Attempts to separate layers of parser logic, or to make the OpenSSL library API sensible, require you to do work in discrete chunks. It's possible to work around much of it with good buffer usage, but it can complicate things.</p>
 <p>Want to avoid this in your code? Some suggestions:</p>
 <ul>
 <li>Make a conscious choice of model, rather than stumbling into one. Are you using <code>epoll</code> or one-socket-pre-thread?</li>
 <li>Consider <a href="2019-05-05-optimistic-parsing.html">optimisitic parsing</a>.</li>
 <li>Avoid non-blocking sockets (<code>SOCK_NONBLOCK</code> and <code>O_NONBLOCK</code>). You really don't need them in either model.</li>
 <li>Use <a href="https://linux.die.net/man/2/readv"><code>readv()</code></a> and <a href="https://linux.die.net/man/2/writev"><code>writev()</code></a>. They let you avoid copying in/out a buffer for I/O if you're dealing with complex structures. Copies are better than syscalls, but they're still not free.</li>
 <li>Only set <code>TCP_NODELAY</code> if you're being careful with <code>write()</code>.</li>
 <li>Don't pre-check for failure (<code>getsockopt(SO_ERROR)</code>). The socket can still go into error state between that call and your next <code>read()</code> or <code>write()</code>, so you have to handle errors in those anyway. That means the pre-check is useless, even if it catches 99% of closed connections.</li>
 <li>Run your code under <code>strace</code> regularly during development, and explain to yourself why each syscall is there.</li>
 </ul>
 <!--# include file="include/bottom.html" -->
--- a/include/style.css
+++ b/include/style.css
@@ -94,7 +94,7 @@ ul li, ol li {
 code {
 	font-family: monospace, Courier;
 	font-size: 16px;
-	color: #22bb00;
+	color: #5ac941;
 }
 pre code {
@@ -103,6 +103,8 @@ pre code {
 	border: 1px solid grey;
 	color: #22bb00;
 	overflow: scroll;
 	padding: 1ch;
--- a/index.html
+++ b/index.html
@@ -3,6 +3,8 @@
 <!--# include file="include/top.html" -->
 <ol>
 <li>2019-05-05: <a href="2019-05-05-syscall-efficiency.html">Syscall efficiency</a></li>
 <li>2019-05-05: <a href="2019-05-05-optimistic-parsing.html">Optimistic parsing</a></li>
 <li>2019-04-28: <a href="2019-04-28-h2load-process-request-failure.html">h2load Process Request Failure</a></li>
 <li>2019-04-21: <a href="2019-04-21-old-code-roundup.html">Old code roundup</a></li>
 <li>2019-04-14: <a href="2019-04-14-reboot.html">Reboot</a></li>
--- a/markdown/2019-05-05-optimistic-parsing.md
+++ b/markdown/2019-05-05-optimistic-parsing.md
@@ -12,11 +12,11 @@ I've been writing a [FastCGI](https://www.mit.edu/~yandros/doc/specs/fcgi-spec.h
 This pattern presents an implementation challenge: how to balance efficiency with implementation complexity. It seems easy to read and parse each structure/record sequentially, but there are pitfalls and fundamental problems:
-* You have to consistently handle interrupted system calls (EINTR) and [short reads](TODO) for every read()
+* You have to consistently handle interrupted system calls (`EINTR`) and short reads for every `read()`
 * It only works in a thread-per-connection model
-* It results in extra system calls, which are [expensive](TODO)
+* It results in extra system calls, which are [expensive](2019-05-05-syscall-efficiency.html)
-The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using [fopen()](https://en.cppreference.com/w/cpp/io/c/fopen) and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.
+The solution to all three of these is the same: buffering. This can be a dirty word because it hides magic — many programmers have wondered at some point why their output didn't get written, only to discover that they're using [`fopen()`](https://en.cppreference.com/w/cpp/io/c/fopen) and didn't write a newline, or similar. Buffering used well, however, can keep execution in user space. Optimistic buffering can make control flow even easier to understand.
 The biggest thing that makes such code ugly is handling running out of data while in the middle of parsing a structure/record. If you have some state (say, a header), preserving that state until the remaining data arrives is messy and error-prone, especially in a many-connections-per-thread model. However, you've already read that data from the socket/buffer, so you can't just throw it away. But what if you had a smarter buffer?
@@ -49,9 +49,9 @@ Consider:
 You still keep state between calls to this function, but it's entirely inside the buffer.
-Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. ResetRead() moves the read pointer to the rewind pointer. Commit() moves the rewind pointer to the read pointer. Consume() moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.
+Internally, the buffer object has two pointers: one to the next byte to read, and one to the byte to rewind to. `ResetRead()` moves the read pointer to the rewind pointer. `Commit()` moves the rewind pointer to the read pointer. `Consume()` moves any data starting at the commit pointer to the start of the buffer, then moves the pointers back to match.
-This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls Consume(), minimizing not only syscalls but data copies.
+This is "optimistic" because it assumes that most of the time, we'll have a whole record to parse in the buffer. If data arrives one byte at a time, we'll waste lots of effort re-parsing the early bytes. That's not the way it works in reality for many protocols, though, and for things like FastCGI this can parse multiple records before it calls `Consume()`, minimizing not only syscalls but data copies.
 In exchange for that risk of multiple re-parsing, we get a relatively clean, readable implementation. The `break` statements are probably the ugliest, but function return error handling patterns are a topic for another day.
--- a/markdown/2019-05-05-syscall-efficiency.md
+++ b/markdown/2019-05-05-syscall-efficiency.md
@@ -0,0 +1,88 @@
 <!--# set var="title" value="Syscall efficiency" -->
 <!--# set var="date" value="2019-05-05" -->
 <!--# include file="include/top.html" -->
 System calls have come a long way since the interrupt days. [vDSO](http://man7.org/linux/man-pages/man7/vdso.7.html) has moved many simple syscalls (e.g. `getpid()`) into pure userspace, making them as fast as indirect function calls.
 That still doesn't work for I/O, though. Arkanis has some [simple measurements](http://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead) that show the difference between `getpid()` and `read()`, and it's clear that minimizing `read()` calls is still worth the effort.
 So, how's the world doing on that? Here's [h2load](https://nghttp2.org/documentation/h2load-howto.html), an HTTP/2 load testing client that clearly has a reason for a performance focus, connecting and making 3 requests to a server:
    socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 5
 	setsockopt(5, SOL_TCP, TCP_NODELAY, [1], 4) = 0
 	connect(5, {sa_family=AF_INET6, sin6_port=htons(443), inet_pton(AF_INET6, "...", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
 	epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLOUT, {u32=5, u64=4294967301}}) = 0
 	epoll_wait(4, [{EPOLLOUT, {u32=5, u64=4294967301}}], 64, 59743) = 1
 	getsockopt(5, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
 	write(5, "\26\3\1\2\0\1\0\1\374\3\3$Xo\3a5:I\275\351Ge;\216b\201\300\312.\202\326"..., 517) = 517
 	read(5, 0x7fd01ac32ac3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 	epoll_ctl(4, EPOLL_CTL_MOD, 5, {EPOLLIN, {u32=5, u64=8589934597}}) = 0
 	epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 	read(5, "\26\3\3\0z", 5)    = 5
 	read(5, "\2\0\0v\3\3\233\317P\30\33f\17\3772\6K\2446\267N\304\313\232\177\364\224\n\334\37\324{"..., 122) = 122
 	read(5, "\24\3\3\0\1", 5)   = 5
 	read(5, "\1", 1)            = 1
 	read(5, "\27\3\3\0 ", 5)    = 5
 	read(5, "8\315\331bnq\370w\346+j~\310\330\213\30\17\301~9\200ezU:\321\228\6\274\365\212", 32) = 32
 	read(5, "\27\3\3\n!", 5)    = 5
 	read(5, "\265\366\242\vqr\241\260\373G\325\370\330\227A\376e\243JY\317\340\226x}Fo\\\364=\306\306"..., 2593) = 2593
 	read(5, "\27\3\3\1\31", 5)  = 5
 	read(5, "&\356\321\337\346\240q\317\n\307^\23\362\1\35]D\304gc\334\302\367\340\34C\273\313\316\307yo"..., 281) = 281
 	read(5, "\27\3\3\0E", 5)    = 5
 	read(5, "\246m\377@\212@\244\276#vB\211\362\277\23-\17>\324c\237\5\265\265<\370\276\374\20k\353\352"..., 69) = 69
 	write(5, "\24\3\3\0\1\1\27\3\3\0E)\235\322 \3\247N\216\202\300+\200\25K\177\341\376\300\212\235k"..., 80) = 80
 	write(5, "\27\3\3\0\200\314Z.\260t2\0\336\315\351L&\256\221\344\222\2453;\201\n\223\305\r\27=\260"..., 133) = 133
 	epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 	read(5, "\27\3\3\0J", 5)    = 5
 	read(5, "\325\27U^\346\20\2\320\266h\215\356DV\168\300K+\265\f\316|\355\337h\361\177\242V\213\271"..., 74) = 74
 	read(5, "\27\3\3\0J", 5)    = 5
 	read(5, "\0012/\311\351g\270\201\3176\273\245t\230[\257\336j3,\377\365\356\213W\225\301K\304\371\334l"..., 74) = 74
 	read(5, "\27\3\3\09", 5)    = 5
 	read(5, "yz,\375\306\0177T-\313M\7\256\376\r\vpY\215\222m\233\16\n\364\345\201\235\302\n\276\255"..., 57) = 57
 	read(5, "\27\3\3\0\32", 5)  = 5
 	read(5, "\346^[\215\243iA\324\343\37\200\245\311\270\331S\332\243\311\261\236\325\374\237\v\247", 26) = 26
 	read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 	write(5, "\27\3\3\0\32\377Imr\325\234U\307Ez\253Un\346\276\6P\256\262\214\260\324J\250,\274", 31) = 31
 	epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 	read(5, "\27\3\3\0s", 5)    = 5
 	read(5, "\225\322\26\225\266\356\305hL\6\232)_\242_\310P\32H|zL\277\351c?\317\7G\252Z]"..., 115) = 115
 	read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 	write(5, "\27\3\3\0%L\334\214\234\245X}1\213\4\0\3*bky\370:\344\357T\316\360\342\f\243;"..., 42) = 42
 	epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 	read(5, "\27\3\3\0s", 5)    = 5
 	read(5, "\351+\3gyf\305.V\272\373\354\240\231UI\222\25oq\244\200\0255c\221\301\t\2m\273\326"..., 115) = 115
 	read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 	write(5, "\27\3\3\0%\300\337r\tR\252O,2\223B\256b\222\271\5\322\265\227\243\327\304G\10\272_3"..., 42) = 42
 	epoll_wait(4, [{EPOLLIN, {u32=5, u64=8589934597}}], 64, 59743) = 1
 	read(5, "\27\3\3\0s", 5)    = 5
 	read(5, "\323D\n\313\323\2\353q\324\210\317\213\30\2550\3738L\200D<\326\214b\366\24\335\"\31\311w\223"..., 115) = 115
 	read(5, 0x7fd01acb91c3, 5)  = -1 EAGAIN (Resource temporarily unavailable)
 	write(5, "\27\3\3\0\"\217P\25^*\232\311vCy=u\207;~\"Gf\203h\247\25m&\255\272`"..., 39) = 39
 	write(5, "\27\3\3\0\23\234SF\221\214\213\7<\336`\316\252\315\307\204\3\342AL", 24) = 24
 	shutdown(5, SHUT_WR)        = 0
 	close(5)                    = 0
 (╯°□°)╯︵ ┻━┻
 If you don't spend a lot of time reading `strace` output, that may not look odd or bad. It's a complete mess, though. Here's what stands out:
 * The socket is created with `SOCK_NONBLOCK`. That's useful in some niche cases, but here it's mixed with `epoll_wait()` and this code is one-socket-per-thread anyway. That leads to behavior like calling `connect()`, getting `EINPROGRESS` back, and immediately calling `epoll_ctl()` and `epoll_wait()` to reconstruct blocking `connect()` behavior.
 * `TCP_NODELAY` is set on the socket. That can be great if you know that you're writing chunks that should go out immediately. In this code, though, there are multiple `write()` calls one after another. With `TCP_NODELAY`, those get broken into separate packets, with lots of wasted overhead on the wire and at both ends.
 * There's a call to `getsockopt(SO_ERROR)` just before a call to `write()`. If the socket was closed, the `write()` would fail with a distinct errno, so the `getsockopt()` is redundant.
 * Wow, that's a lot of `read()` calls. Lots of small calls that are fulfilled with all the bytes they asked for, then immediately another `read()` for more data. h2load doesn't do this in unencrypted mode, so it's probably OpenSSL's fault.
 * `shutdown()` gets called just before `close()` on the same file descriptor, uselessly.
 Issues like this frequently creep in with abstractions. Attempts to separate layers of parser logic, or to make the OpenSSL library API sensible, require you to do work in discrete chunks. It's possible to work around much of it with good buffer usage, but it can complicate things.
 Want to avoid this in your code? Some suggestions:
 * Make a conscious choice of model, rather than stumbling into one. Are you using `epoll` or one-socket-pre-thread?
 * Consider [optimisitic parsing](2019-05-05-optimistic-parsing.html).
 * Avoid non-blocking sockets (`SOCK_NONBLOCK` and `O_NONBLOCK`). You really don't need them in either model.
 * Use [`readv()`](https://linux.die.net/man/2/readv) and [`writev()`](https://linux.die.net/man/2/writev). They let you avoid copying in/out a buffer for I/O if you're dealing with complex structures. Copies are better than syscalls, but they're still not free.
 * Only set `TCP_NODELAY` if you're being careful with `write()`.
 * Don't pre-check for failure (`getsockopt(SO_ERROR)`). The socket can still go into error state between that call and your next `read()` or `write()`, so you have to handle errors in those anyway. That means the pre-check is useless, even if it catches 99% of closed connections.
 * Run your code under `strace` regularly during development, and explain to yourself why each syscall is there.
 <!--# include file="include/bottom.html" -->
--- a/markdown/index.md
+++ b/markdown/index.md
@@ -2,6 +2,7 @@
 <!--# include file="include/top.html" -->
 1. 2019-05-05: [Syscall efficiency](2019-05-05-syscall-efficiency.html)
 1. 2019-05-05: [Optimistic parsing](2019-05-05-optimistic-parsing.html)
 1. 2019-04-28: [h2load Process Request Failure](2019-04-28-h2load-process-request-failure.html)
 1. 2019-04-21: [Old code roundup](2019-04-21-old-code-roundup.html)