CS3214 Computer Systems

Project 4 FAQ§

Network Programming FAQ§

Do we need to use the textbook author's RIO package?§

We would like you to use the base code we provide instead, including but not limited to the bufio package we wrote.

Both ours and the RIO package provide an I/O layer on top of sockets that provides two core pieces of functionality: handling short reads and buffering of read data. Handling short reads is a correctness requirement whereas buffering is a performance optimization.

We believe that bufio is superior to RIO because it is (a) designed to be used in a multi-threaded environment where exiting is not option, and (b) because it provides dynamically allocated buffers.

How can I verify that my server responds correctly to HTTP requests?§

Run it under strace. You may wish to set '-s 1024' to be able to examine all data written/read, and possibly use '-e' to restrict to '-e network,read,write' or a similar subset of system calls.

How can I verify that my server handles persistent connections correctly?§

A simple trick is to use curl -v and specify the same URL twice, as in this theoretical example:

$ curl -v http://cs3214.cs.vt.edu:9011/loadavg http://cs3214.cs.vt.edu:9011/loadavg
* About to connect() to cs3214.cs.vt.edu port 9011
*   Trying 128.173.41.123... connected
* Connected to cs3214.cs.vt.edu (128.173.41.123) port 9011
...
Connection #0 to host cs3214.cs.vt.edu left intact
* Re-using existing connection! (#0) with host cs3214.cs.vt.edu
...
Connection #0 to host cs3214.cs.vt.edu left intact
* Closing connection #0

Look for the lines shown above, particularly "Re-using existing connection"

How do I use valgrind/strace when running under the test harness?§

Your server may fail when executing the tests of the test harness in sequence. One way to debug such problems is by using wrappers for valgrind and strace. To do this, create files swrapper and vwrapper like so:

swrapper should contain

#!/bin/sh
#
# invoke strace, passing along any arguments
strace -s 1024 -e network,read,write -ff -o stracelog ./server $*

vwrapper should contain

#!/bin/sh
#
# invoke valgrind, passing along any arguments
valgrind --log-file=valgrind.log ./server $*

Make sure to make those scripts executable (chmod +x ?wrapper), then you can run the test harness via server_unit_test_pserv.py -s ./vwrapper -o output to see valgrind output in output (respectively for swrapper).

Does the body of a HTTP response have to be terminated with `\r\n`?§

No. CRLF is used only to separate header lines (and to end the header). The body consists of arbitrary content, which isn't necessarily line-oriented at all.

curl indicates that persistent connections work, but we still fail the test.§

A possible reason is that you append additional bytes beyond the number you announced in Content-Length. curl skips whitespace when attempting to read a response. If the additional bytes you send are whitespace (such as \r\n), then curl will hide the problem.

Do we need to use select() (or poll())?§

To meet the basic requirements of the assignment, no. The only possible use I could see is if you wanted to implement a time-out feature without issuing a blocking read() call. In this case, you would call select() before read(), giving a read set that consists only of the file descriptor you're intending to read from. If select() times out without the file descriptor having become readable, you treat it as timeout. Otherwise, you perform a single read() call on the file descriptor which you know won't block.

An alternative to that is the use of a timer and a signal, as shown here; although select() is likely the simpler solution.

If our server is supposed to handle multiple clients simultaneously, do we need to `bind()` one (or multiple) sockets to multiple ports?§

No. TCP uses a quadruple (src ip, src port, dst ip, dst port) to identify each connection, so different connections can go to the same dst port as long as the src ip or src port are different. You will obtain a socket that refers to a specific connection (i.e., client with a specific src ip/src port pair) as the return value from the accept() call.

curl claims to speak HTTP/1.1, but does not send a Connection: close header§

The use of the Connection: close header to announce that no more or requests/responses are sent on an existing TCP connection is a courtesy, but not a requirement in HTTP/1.1. Both client and server are free to close the connection instead. In fact, such closing is needed when a client and/or server opportunistically keeps a HTTP/1.1 connection open only to later realize that nothing more needs to be requested (in the client case) or that the connection needs to be closed, perhaps to avoid running out of file descriptors or ports (in the server case). For details, read Section 6.5 of RFC 7230 which updates Section 8 of RFC 2616, particularly 8.1.4 Practical Considerations.

Why do I get a `SIGPIPE` signal/why does writing to a socket fail with errno=EPIPE?§

According to write(2):

EPIPE  fd  is  connected to a pipe or socket whose reading end is closed.  When this happens
the writing process will also receive a SIGPIPE  signal.   (Thus,  the  write  return
value is seen only if the program catches, blocks or ignores this signal.)

This means the client has closed its file descriptor and would be unable to read() the data sent through this connection. Note that due to buffering and the internals of the TCP protocol, there may be a delay before EPIPE is returned. When it is, you should stop trying to send data on that connection and close the fd.

You can also disable this mechanism, but then you must make sure that you do not create runaway threads that repeatedly try to read from already closed sockets, waiting for responses that will never come. Hence, use at your own risk.

3 ways of doing that are discussed here they include ignoring SIGPIPE altogether, using send/recv with the MSG_NOSIGNAL flag, or using a setsockopt() (not on Linux, however) to turn off this behavior.

The provided bufio code uses the send call with the MSG_NOSIGNAL flag; however, if you call sendfile, you may still be subject to SIGPIPE. That's why SIGPIPE is ignored in main.

My server works with curl, but fails with the test harness.§

Before starting any tests, your test harness checks if your server is running by connecting, then disconnecting from it. Your server must not crash when that happens (the book's tiny.c does). You'd detect that if, after accepting a client, the very first read()/recv() call returns 0, indicating EOF on that socket. Make sure you handle this situation correctly.

To that end, you need to understand the provided bufio code.

How should we handle IPv6/protocol-independent programming?§

Handling both IPv6 and IPv4 clients is (surprisingly) complicated, even over 2 decades after the IPv6 programming model was developed. Nevertheless, in our opinion, it's not acceptable for network programs today to be IPv4 only.

The first consideration is that your code must work on a machine that has only a IPv4 address, on machine that has only a global IPv6 address, and on a machine that has both. (We don't ask that you support link-local IPv6 addresses, which would actually require additional programming effort). Thus, your program needs to learn which addresses are available on a given system. It uses getaddrinfo() for that, with AI_PASSIVE set, as shown in Drepper's Userlevel IPv6 Programming Introduction

getaddrinfo() will return a list of addresses (typically one or two). You need to use the addresses returned in a certain manner, described below. First, note that there's (apparently) no guarantee as to the order in which the addresses are returned, as I note in this email, which contradicts the statement Drepper makes on his page. I believe Drepper may have wrongly interpreted RFC 3484, or couldn't convince the maintainers of libc for the major Linux distributions to accept his interpretation, but in any event, on our rlogin machines (running CentOS 6, 7, or now Stream) which have both global IPv4 (via NAT) and IPv6 connectivity, the IPv4 address is returned before the IPv6 address, contrary to Drepper's statement.

There are two approaches of dealing with IPv4 and IPv6. The first approach is to keep them separate. Two separate sockets, separately bound via bind(), with separate calls to listen() and separate calls to accept(). Note that this approach requires multiple threads (or a form of I/O multiplexing such as select()/poll()), because you must avoid the following pitfall:

while (1) { // server loop
    int ipv4client = accept(ipv4socket, ....);
    handleConnection (ipv4client);
    int ipv6client = accept(ipv6socket, ....);
    handleConnection (ipv6client);
}

In the above loop, while we wait for an IPv4 client, we cannot accept and serve pending IPv6 clients, and vice versa. Instead, you need one thread for IPv4 accept and one for IPv6 accept. In this project, since you're using multiple threads anyway, implementing this may not be such a burden.

The second approach is to use the so-called dual-bind feature. This feature was introduced to make porting existing servers from IPv4 to IPv6 simpler by allowing the application to just use 1 socket to accept both IPv4 and IPv6 clients. This socket must be bound to the IPv6 address for this to work. If a socket is dual-bound to IPv6/IPv4, a subsequent attempt to bind it to IPv4 will fail with EADDRINUSE. (And vice versa: if the IPv4 port is already bound to another socket, an attempt to dual-bind() to the IPv6 will fail with EADDRINUSE!)

On most Linux systems, including ours, dual-bind is enabled by default for a socket. That is, if you bind the socket to IPv6, it'll automatically be bound to IPv6 and IPv4. There's an option to turn dual-bind off for a given socket. You must turn it off if you wish to use the two socket approach before calling bind(), like so:

// 'pinfo' obtained from getaddrinfo()
// do not dual-bind if this is an IPv6 address
if (pinfo->ai_family == AF_INET6) {
    int on = 1;
    if (setsockopt(s, IPPROTO_IPV6, IPV6_V6ONLY, (void *)&on, sizeof(on)) == -1)
        perror("setsockopt");
}
// now bind(s, pinfo->ai_addr, ...)

So, in summary:

Approach (1): exploit dual-bind. Bind a single-socket, use a single thread to do the accept. Caveat: you can't assume the IPv6 address is before the IPv4 address in the list. Drepper's technique does not work on CentOS. Keep in mind your code should still work on any configuration (IPv4-only, IPv6-only, dual-stack). You may need to iterate through the returned list from getaddrinfo twice, first to detect if there's an IPv6 in there, and if so, bind to it first (and only to it), else bind to the IPv4. A drawback of this approach is that it will not work on systems that do not allow dual-binding, such as OpenBSD.
Approach (2): handle protocols separately, with two sockets. You need 2 threads (could be tasks in your thread pool, but don't forget that they are long-running, so you'll want to increase the number of threads in the pool since 2 threads will be running the accept() loops). And you need to avoid accidental dual-bind by explicitly turning on IPV6_V6ONLY for the socket you intend to bind to IPv6, and for that socket only.

Is supporting IPv6 really that complicated?§

As of the time of this writing, no simpler approach is known. In a Jan/2016 conversation with Steinar Gunderson, the lead implementor of Google's transition to IPv6 in 2010, this lack of a universal method appears to be confirmed.

Does the IPv6 code in the 3rd edition of the textbook work?§

Unfortunately, the 3rd edition of our textbook advocates the wrong method in its open_listenfd method. To make matters worse, I am being credited with this wrong method in the book's preface.

How does the benchmarking work?§

Benchmarking web servers is a surprisingly difficult task, which was recognized as early as 1998.

In our approach, we use a tool called wrk by Will Glozer. This tool simulates n simultaneous clients that repeatedly access a URL over a separate persistent HTTP/1.1 connection, recording whether the request succeeded (with 200), failed (with a non-200 code), or timed out or incurred another error, such as a connection reset. wrk's scalable implementation ensures that the client does not become the bottleneck. wrk then tabulates the results. For each successful request, it also records the latency, which it tabulates as well.

There are multiple possible performance metrics for web servers: throughput measured in requests per seconds, throughput measured in bytes/second, and latency. Often, a performance profile is created: for instance, examining how the throughput changes with the number of concurrent connections (it should go up to a point and then not decay). A related profile is that of latency: generally, as the number of concurrent connections increases, so will latency (up to the point where connections may time out).
Latency is often considered using percentiles and/or averages.

We use multiple types of workloads for our benchmarking here and record requests per second, payload throughput in bytes/s, the number of errors, and the mean latency. The number of errors should stay zero (or small). You can read each workload's description by running server_bench.py -l.

Some of the benchmarks are intended to make your server CPU bound, some of them will make your server network bound, and some will stress the behavior of your server when there is a large number of connections. (We use 10k here as large, although a server can probably handle up to 30k which is when the port space limitation becomes an issue.)

A limitation of wrk is that it cannot keep a constant request rate (unlike when your server is slashdotted), rather, it backs up (delays sending the next request on a given connection) if the server cannot keep pace. We ignore this restriction here, but please recognize that wrk testing will say little about your server's behavior under realistic overloads.

To the best of my knowledge, there is currently no tool that can reliably produce constant (and high) request rates. httperf does not work on recent OSs as it does not support epoll(). autobench relies on httperf. (Gil Tene's derived wrk2 claims to support a constant request rate, though initial testing was not successful. We may revisit this in the future.)

Can we use libevent/libuv, etc.?§

For the purposes of this assignment, which is to gain familiarity with low-level socket interfaces, no. But you can implement your own epoll() loop if you want.

Can we use `sendfile(2)`?§

Yes. In fact, the provided bufio code does.

Note that sendfile may cause a SIGPIPE to be sent if the client closes the connection in mid transfer. This may happen if the client uses a short timeout and your server isn't responding within that timeout.

What exactly is bufio_offset2ptr/bufio_ptr2offset for?§

The bufio package will automatically grow its buffer as more data is being received on a connection. To that end, it uses the realloc function. realloc may or may not move the data in the buffer to a different location. (It doesn't move it if the memory allocator can extend the block because its right neighbor is currently unused). If the data is moved, then any pointers you hold into the buffer will be invalid.

In particular, while reading the headers of an HTTP request, if you detect a Cookie header, you cannot store a char * pointer to this header's field value and expect to access it after the full request has been read.

To handle this scenario, store an offset instead of a pointer, and later convert the offset to a pointer at the bufio's current address.

What is `bufio_truncate` for?§

In a long-running HTTP/1.1 connection the buffer associated with a connection (in which all data read from the client is written) may grow over time. To discard already processed requests, call bufio_truncate.

How do I write a fast, multi-threaded, and event-based server?§

You will need to combine a number of techniques. First, your server should be multi-threaded, but you will no longer dedicate threads to connections. Rather, each thread will multiplex many connections. To that end, each thread should maintain an epoll set of file descriptors for which it is responsible. Each thread then executes an event loop. At the head of the loop, it will call epoll_wait(). In the body of the event loop, each ready file descriptor will need to be handled. Handling a file descriptor here means to read() whatever data is available on it. The data read needs to be processed. Since there is no guarantee how much data is available, your processing should be state machine based so that it can remember how far in the processing of the HTTP request it got. Once the request is complete, the event handling thread would process it.

You should probably use epoll() in level-triggered mode, which requires that you partition your file descriptors across worker threads so that exactly one worker thread handles all interactions on that file descriptor. It is probably also beneficial to separate your accepting file descriptors (and have a separate thread handle those) to avoid unnecessary latency.

The strategies described so far will probably allow you to create a server that achieves high throughput in our benchmarking, but a real-world event-based server would need to go further. For instance, a real-world server would need to make sure that event threads are never blocked, which will require special handling when a thread is writing back responses to the client (which may block if the TCP pipeline is full). This would need to be handled by placing the socket in non-blocking mode, attempting to write, and then using epoll to continue the write when it is safe to do so. Furthermore, if a thread needed to contact other tiers (a database, for instance), any such processing's progress would need to be modeled using state machines. A real-world server would also require some form of load balancing to ensure that all threads handle roughly an equal number of connections.