Building Secure and Reliable Network Applications

The Essential Tasks
Kenneth P. Birman

1996, Hardbound, 591 pages
ISBN 0137195842
Sample Chapter 4
© Mannning Publications Co. All Rights Reserved.


Remote Procedure Calls and the Client/Server Model

4.1  The Client/Server Model
4.2RPC Protocols and Concepts
4.3Writing an RPC-Based Client or Server Program
4.4The RPC Binding Problem
4.5Marshaling and Data Types
4.6Associated Services
4.7The RPC Protocol
4.8Using RPC in Reliable Distributed Systems
4.9Related Reading


4.1     The Client/Server Model

The emergence of real distributed computing systems is often identified with the client/server paradigm and a protocol called remote procedure call (RPC), which is normally used in support of this paradigm. The basic idea of a client/server system architecture involves a partitioning of the software in an application into a set of services, which provide a set of operations to their users, and client programs, which implement applications and issue requests to services as needed to carry out the purposes of the application. In this model, the application processes do not cooperate directly with one another, but instead share data and coordinate actions by interacting with a common set of servers and by the order in which the application programs are executed.

There are a great number of client/server system structures in a typical distributed computing environment. Some examples of servers include the following:

  • File servers: These are programs (or, increasingly, combinations of special-purpose hardware and software) that manage disk storage units on which file systems reside. The operating system on a workstation that accesses a file server acts as the client, thus creating a two-level hierarchy: The application processes talk to their local operating system. The operating system on the client workstation functions as a single client of the file server, with which it communicates over the network.
  • Database servers: The client/server model operates in a similar way for database servers, except that it is rare for the operating system to function as an intermediary in the manner that it does for a file server. In a database application, there is usually a library of procedure calls with which the application accesses the database, and this library plays the role of the client in a client/server communication protocol to the database server.
  • Network name servers: Name servers implement some form of map from a symbolic name or service description to a corresponding value, such as an IP address and port number for a process capable of providing a desired service.
  • Network time servers: These are processes that control and adjust the clocks in a network, so that clocks on different machines give consistent time values (values with limited divergence from one another). The server for a clock is the local interface by which an application obtains the time. The clock service, in contrast, is the collection of clock servers and the protocols they use to maintain clock synchronization.
  • Network security servers: Most commonly, these consist of a type of directory in which public keys are stored, as well as a key generation service for creating new secure communication channels.
  • Network mail and bulletin board servers: These are programs for sending, receiving, and forwarding e-mail and messages to electronic bulletin boards. A typical client of such a server would be a program that sends an electronic mail message or that displays new messages to a user who is using a newsreader interface.
  • WWW servers: As we learned in the introduction, the World Wide Web is a large-scale distributed document management system developed at CERN in the early 1990s and subsequently commercialized. The Web stores hypertext documents, images, digital movies, and other information on Web servers, using standardized formats that can be displayed through various browsing programs. These systems present point-and-click interfaces to hypertext documents, retrieving documents using Web document locators from Web servers, and then displaying them in a type-specific manner. A Web server is thus a type of enhanced file server on which the Web access protocols are supported.
  • In most distributed systems, services can be instantiated multiple times--for example, a distributed system can contain multiple file servers or multiple name servers. We normally use the term service to denote a set of servers. Thus, the network file system service consists of the network file servers for a system, and the network information service is a set of servers, provided on UNIX systems, that maps symbolic names to ASCII strings encoding values or addresses. An important question to ask about a distributed system concerns the binding of applications to servers.

    We say that a binding occurs when a process that needs to talk to a distributed service becomes associated with a specific server that will perform requests on its behalf. Various binding policies exist, differing in how the server is selected. For an NFS distributed file system, binding is a function of the file path name being accessed--in this file system protocol, the servers all handle different files, so that the path name maps to a particular server that owns that file. A program using the UNIX network information server normally starts by looking for a server on its own machine. If none is found, the program broadcasts a request and binds to the first NIS that responds, the idea being that this NIS representative is probably the least loaded and will give the best response times. (On the negative side, this approach can reduce reliability: Not only will a program now be dependent on availability of its file servers, but it may be dependent on an additional process on some other machine, namely the NIS server to which it became bound.) The CICS database system is well known for its explicit load-balancing policies, which bind a client program to a server in a way that attempts to give uniform responsiveness to all clients.

    Algorithms for binding, and for dynamically rebinding, represent an important topic to which we will return in Chapter 17, once we have the tools at our disposal to solve the problem in a concise way.

    A distributed service may or may not employ data replication, whereby a service maintains more than one copy of a single data item to permit local access at multiple locations or to increase availability during periods when some server processes may have crashed--for example, most network file services can support multiple file servers, but they do not replicate any single file onto multiple servers. In this approach, each file server handles a partition of the overall file system, and the partitions are disjoint from one another. A file can be replicated, but only by giving each replica a different name, placing each replica on an appropriate file server, and implementing hand-crafted protocols for keeping the replicas coordinated. Replication, then, is an important issue in designing complex or highly available distributed servers.

    Caching is a closely related issue. We say that a process has cached a data item if it maintains a copy of that data item locally, for quick access if the item is required again. Caching is widely used in file systems and name services, and permits these types of systems to benefit from locality of reference. A cache hit is said to occur when a request can be satisfied out of cache,
    avoiding the expenditure of resources needed to satisfy the request from the primary store or primary service. The Web uses document caching heavily, as a way to speed up access to frequently used documents.

    Caching is similar to replication, except that cached copies of a data item are in some ways second-class citizens. Generally, caching mechanisms recognize the possibility that the cache contents may be stale, and they include a policy for validating a cached data item before using it. Many caching schemes go further, and include explicit mechanisms by which the primary store or service can invalidate cached data items that are being updated, or refresh them explicitly. In situations where a cache is actively refreshed, caching may be identical to replication--a special term for a particular style of replication.

    However, "generally" does not imply that this is always the case. The Web, for example, has a cache validation mechanism but does not actually require that Web proxies validate cached documents before providing them to the client; the reasoning is presumably that even if the document were validated at the time of access, nothing prevents it from changing immediately afterwards and hence being stale by the time the client displays it. Thus, a periodic refreshing scheme in which cached documents are refreshed every half hour or so is in many ways equally reasonable. A caching policy is said to be coherent if it guarantees that cached data are indistinguishable to the user from the primary copy. The Web caching scheme is thus one that does not guarantee coherency of cached documents.


    4.2     RPC Protocols and Concepts

    The most common communication protocol for communication between the clients of a service and the service itself is remote procedure call. The basic idea of an RPC originated in work by Birrell and Nelson in the early 1980s (see Birrell and Nelson). Nelson worked in a group at Xerox PARC that was developing programming languages and environments to simplify distributed computing. At that time, software for supporting file transfer, remote log in, electronic mail, and electronic bulletin boards had become common. PARC researchers, however, had ambitious ideas for developing other sorts of distributed computing applications, with the consequence that many researchers found themselves working with the lowest-level-message-passing primitives in the PARC distributed operating system, which was called Cedar.

    Much like a more modern operating system, message communication in Cedar supported three communication models:

  • Unreliable datagram communication, in which messages could be lost with some (hopefully low) probability
  • Broadcast communication, also through an unreliable datagram interface
  • Stream communication, in which an initial connection was required, after which data could be transferred reliably
  • Programmers found these interfaces hard to work with. Any time a program, p, needed to communicate with another program, s, it was necessary for p to determine the network address of s, encode its requests in a way that s would understand, send off the request, and await a reply. Programmers soon discovered that certain basic operations needed to be performed in almost any network application and that each developer was developing his or her own solutions to these standard problems. Some programs used broadcasts to find a service with which they needed to communicate; others stored the network address of services in files or hard-coded them into the application; and still others supported directory programs with which services could register themselves, supporting queries from other programs at run time. Not only was this situation confusing, it turned out to be difficult to maintain the early versions of PARC software: A small change to a service might break all sorts of applications that used it, so it became hard to introduce new versions of services and applications.

    Surveying this situation, Bruce Nelson started by asking what sorts of interaction programs were really needed in distributed settings. He concluded that the problem was really no different from a function or procedure call in a nondistributed program that uses a presupplied library; that is, most distributed computing applications would prefer to treat other programs with which they interact much as they treat presupplied libraries, with well-known, documented, procedural interfaces. Talking to another program would then be as simple as invoking one of its procedures--a remote procedure call (RPC).

    The idea of a remote procedure call is compelling. If distributed computing can be transparently mapped to a nondistributed computing model, all the technology of nondistributed programming could be brought to bear on the problem. In some sense, we would already know how to design and reason about distributed programs; how to prove them to be correct; how to test, maintain, and upgrade them; and all sorts of preexisting software tools and utilities would be readily applicable to the problem.

    Unfortunately, the details of supporting a remote procedure call turn out to be nontrivial, and some aspects result in visible differences between remote and local procedure invocations. Although this wasn't evident in the 1980s when RPC really took hold, the subsequent ten or 15 years saw considerable theoretical activity in distributed computing, out of which ultimately emerged a deep understanding of how certain limitations on distributed computing are reflected in the semantics, or properties, of a remote procedure call. In some ways, this theoretical work finally led to a major breakthrough in the late 1980s and early 1990s, when researchers learned how to create distributed computing systems in which the semantics of RPC are precisely the same as for local procedure calls (LPC). In Part III of this book, we will study the results and necessary technology underlying such a solution, and we will see how to apply it to RPC. We will also see, however, that such approaches involve subtle tradeoffs between the semantics of the RPC and the performance that can be achieved; the faster solutions also weaken semantics in fundamental ways. Such considerations ultimately lead to the insight that RPC cannot be transparent, however much we might wish that this were not the case.

    Making matters worse, during the same period of time a huge engineering push behind RPC elevated it to the status of a standard--and this occurred before it was understand how RPC could be made to accurately mimic LPC. The result of this is that the standards for building RPC-based computing environments (and, to a large extent, the standards for object-based computing that followed RPC in the early 1990s) embody a nontransparent and unreliable RPC model, and this design decision is often fundamental to the architecture in ways that the developers who formulated these architectures probably did not appreciate. In the next chapter, when we discuss stream-based communication, we will see that the same sort of premature standardization affected the standard stream technology, which as a result also suffers from serious limitations that could have been avoided had the problem simply been better understood at the time the standards were developed.

    In the remainder of this chapter, we will focus on standard implementations of RPC. We will look at the basic steps by which a program RPC is coded in a program, how that program is translated at compile time, and how it becomes bound to a service when it is executed. Then, we will study the encoding of data into messages and the protocols used for service invocation and for collecting replies. Finally, we will try to pin down a semantic for RPC: a set of statements that can be made about the guarantees of this protocol and that can be compared with the guarantees of LPC.

    We do not, however, give detailed examples of the major RPC programming environments: the Distributed Computing Environment (DCE) and Open Network Computing (ONC). These technologies, which emerged in the mid 1980s, represented proposals to standardize distributed computing by introducing architectures within which the major components of a distributed computing system would have well-specified interfaces and behaviors and within which application programs could interoperate using RPC by virtue of employing standard RPC interfaces. DCE, in particular, has become relatively standard, and it is available on many platforms today (see Open Software Foundation). However, in the mid-1990s, a new generation of RPC-oriented technology emerged through the Object Management Group (OMG), which set out to standardize object-oriented computing. In a short period of time, the CORBA (see Object Management Group and X/Open) technologies defined by OMG swept past the RPC technologies, and it now makes more sense to focus on CORBA, which we discuss in Chapter 6. CORBA has not so much changed the basic issues, as it has broadened the subject of discourse by covering more kinds of system services than did previous RPC systems. Moreover, many CORBA systems are implemented as a layer over DCE or ONC. Thus, although RPC environments are important, they are more and more hidden from typical programmers and hence there is limited value in seeing examples of how one would program applications using them directly.

    Many industry analysts talk about CORBA implemented over DCE, meaning that they like the service definitions and object orientation of CORBA, and they feel it makes sense to assume that these were built using the service implementations standardized in DCE. In practice, however, CORBA makes as much sense on a DCE platform as on a non-DCE platform; hence, it would be an exaggeration to claim that CORBA on DCE is a de facto standard today, as one sometimes reads in the popular press.

    The use of RPC leads to interesting problems of reliability and fault handling. As we will see, it is not hard to make RPC work if most of the system is working well. When a system malfunctions, however, RPC can fail in ways that leave the user with no information at all about what has occurred and with no apparent strategy for recovering from the situation. There is nothing new about the situations we will be studying--indeed, for many years, it was simply assumed that RPC was subject to intrinsic limitations, and since there was no obvious way to improve on the situation, there was no reason that RPC shouldn't reflect these limitations in its semantic model. As we advance through the book, however, and it becomes clear that there are realistic alternatives that might be considered, this point of view becomes increasingly open to question.

    Indeed, it may now be time to develop a new set of standards for distributed computing. The existing standards are flawed, and the failure of the standards community to repair these flaws has erected an enormous barrier to the development of reliable distributed computing systems. In a technical sense, these flaws are not tremendously hard to overcome--although the solutions would require some reengineering of communication support for RPC in modern operating systems. In a practical sense, however, one wonders if it will take a Tacoma Narrows event to create real industry interest in taking such steps.

    One could build an RPC environment that would have few, if any, user-visible incompatibilities from a more fundamentally rigorous approach. The issue then is one of education--the communities that control the standards need to understand the issue better, and they need to understand the reasons that this particular issue represents such a huge barrier to progress in distributed computing. They also need to recognize that the opportunity vastly outweighs the reengineering costs that would be required to seize it. With this goal in mind, let's take a close look at RPC.


    4.3     Writing an RPC-Based Client or Server Program

    The programmer of an RPC-based application employs what is called a stub-generation tool. Such a tool is somewhat like a macro preprocessor: It transforms the user's original program into a modified version, which can be linked to an RPC run-time library.

    From the point of view of the programmer, the server or client program looks much like any other program. Normally, the program will import or export a set of interface definitions, covering the remote procedures that will be obtained from remote servers or offered to remote clients, respectively. A server program will also have a name and a version, which are used to connect the client to the server. Once coded, the program is compiled in two stages: First the stub generator is used to map the original program into a standard program with added code to carry out the RPC, and then the standard program is linked to the RPC run-time library for execution.

    RPC-based application or server programs are coded in a programming style very similar to a nondistributed program written in C for UNIX: There is no explicit use of message passing. However, there is an important aspect of RPC programming that differs from programming with local procedure calls: the separation of the service interface definition, or IDL,[1] from the code that implements it. In an RPC application, a service is considered to have two parts. The interface definition specifies the way that the service can be located (its name), the data types used in issuing requests to it, and the procedure calls that it supports. A version number is included to provide for evolution of the service over timethe idea being that if a client is developed to use version 1.1 of a service, there should be a way to check for compatibility if it turns out that version 1.0 or 2.3 is running when the client actually gets executed.

    The basic actions of the RPC library were described earlier. In the case of a server program, the library is responsible for registering the program with the RPC directory service program, which is normally provided as part of the RPC run-time environment. An RPC client program will automatically perform the tasks needed to query the directory to find this server and to connect to it, creating a client/server binding. For each of the server operations it invokes, code will be executed to marshal a representation of the invocation into a message--that is, information about the way that the procedure was called and values of the parameters that were passed. Code is included to send this message to the service and to collect a reply; on the server side, the stub generator creates code to read such a message, invoke the appropriate procedure with the arguments used by the remote caller, and marshal the results for transmission back to the caller. Issues such as user-ID handling, security and privacy, and handling of exceptions are often packaged as part of a solution. Finally, back on the caller side, the returning message will be demarshaled and the result made to look like the result of a local procedure.

    Although much of this mechanism is automatic and hidden from the programmer, RPC programming differs from LPC programming in many ways. Most noticeable is that most RPC packages limit the types of arguments that can be passed to a remote server, and some also limit the size (in bytes) of the argument information--for example, suppose that a local procedure is written to search a list, and an LPC is performed to invoke this procedure, passing a pointer to the head of the list as its argument. One can ask whether this should work in an RPC environment--and, if so, how it can be supported. If a pointer to the head of the list is actually delivered to a remote program, that pointer will not make sense in the remote address space where the operation will execute. So, it would be natural to propose that the pointer be dereferenced, by copying the head of the list into the message. Remotely, a pointer to the copy can be provided to the procedure. Clearly, however, this will only work if one chases all the pointers in question--a problem because many programs that use pointers have some representation for an uninitialized pointer, and the RPC stub generator may not know about this.

    In building a balanced tree, it is common to allocate nodes dynamically as items are inserted. A node that has no descendents would still have left and right pointer fields, but these would be initialized to nil and the procedure to search nodes would check for the nil case before dereferencing these pointers. If an RPC marshaling procedure were to automatically make a copy of a structure to send to the remote server (see Figure 4.1), it would need to realize that for this particular structure, a pointer value of nil has a special meaning and should not be chased.

    FIGURE 4.1. Remote procedure call involves creating a message that can be sent to the remote server, which unpacks it, performs the operation, and sends back a message encoding the result.

    The RPC programmer sees issues such as these as a set of restrictions. Depending on the RPC package used, different approaches may be used to attack them. In many packages, pointers are simply not legal as arguments to remote procedures. In others, the user can control a copying mechanism to some degree, and in still fancier systems, the user must provide general-purpose structure traversal procedures, which will be used by the RPC package to marshal arguments. Further complications can occur if a remote procedure can modify some of its arguments. Again, the degree to which this is supported, and the degree to which the programmer must get involved, varies from package to package.

    Perhaps ironically, RPC programmers tend to complain about this aspect of RPC no matter how it is handled. If a system is highly restrictive, the programmer finds that remote procedure invocation is annoying, because one is constantly forced to work around the limitations of the invocation package--for example, if an RPC package imposes a size limit on the arguments to a procedure, an application that works perfectly well in most situations may suddenly fail because some dynamically defined object has grown too large to be accepted as an RPC parameter. Suddenly, what was a single RPC becomes a multi-RPC protocol for passing the large object in chunks, and a perfectly satisfied programmer has developed distinct second thoughts about the transparency of RPC. At the other extreme are programming languages and RPC packages in which RPC is extremely transparent. These, however, often incur high overheads to copy information in and out, and the programmer is likely to be very aware of these because of their cost implications--for example, a loop that repeatedly invokes a procedure having one changing parameter as well as others (including a pointer to some large object) may be quite inexpensive to invoke in the local case. But if the large object will be copied to a remote program on every invocation, the same loop may cost a fortune when coded as part of a distributed client/server application, forcing the program to be redesigned to somehow pass the object to the remote server prior to the computational loop. These sorts of issues, then, make programming with RPC quite different from programming with LPC.

    RPC also introduces error cases that are not seen in LPC, and the programmer needs to deal with these. An LPC would never fail with a binding error, a version mismatch, or a timeout. In the case of RPC, all of these are possibilities--a binding error would occur if the server were not running when the client was started. A version mismatch might occur if a client were compiled against version 1 of a server, but the server has now been upgraded to version 2. A timeout could result from a server crash, a network problem, or even a problem on the client's computer. Many RPC applications would view these sorts of problems as unrecoverable errors, but fault-tolerant systems will often have alternative sources for critical services and will need to fail-over from a primary server to a backup. The code to do this is potentially complex, and in most RPC environments, it must be implemented by the application developer on a casebycase basis.


    4.4     The RPC Binding Problem

    The binding problem occurs when an RPC client program needs to determine the network address of a server capable of providing some service it requires. Binding can be approached from many perspectives, but the issue is simplified if issues associated with the name service used are treated separately, as we do here.

    Disregarding its interactions with the name service, a binding service is primarily a protocol by which the RPC system verifies compatibility between the client and server and establishes any connections needed for communication.

    The compatibility problem is important in systems that will operate over long periods of time, during which maintenance and the development of new versions of system components will inevitably occur. Suppose that a client program, c, was developed and tested using server s, but that we now wish to install a new version of s, c, or both. Upgrades such as these create a substantial risk that some old copy of c will find itself talking to a new copy of s, or vice versa--for example, in a network of workstations it may be necessary to reload c onto the workstations one by one, and if some machines are down when the reload occurs, an old copy of c could remain on its disk. Unless c is upgraded as soon as the machine is rebooted--and this may or may not occur, depending on how the system is administered--one would find an old c talking to an upgraded s. It is easy to identify other situations in which problems such as this could occur.

    It would be desirable to be able to assume that all possible versions of s and c could somehow communicate with all other versions, but this is not often the case. Indeed, it is not necessarily even desirable. Accordingly, most RPC environments support a concept of version number, which is associated with the server IDL. When a client program is compiled, the server IDL version is noted in software. This permits the inclusion of the client's version of the server interface directly in the call to the server. When the match is not exact, the server could reject the request as being incompatible, perform some operation to map the old-format request to a new-format request, or even preserve multiple copies of its functionality, running the version matched to the caller.

    Connection establishment is a relatively mechanical stage of binding. Depending on the type of client/server communication protocol that will be used, messages may be transmitted using unreliable datagrams or over reliable communication streams such as X.25 or TCP. Unreliable datagram connections normally do not require any initial setup, but stream connections typically involve some form of open or initialization operation. Having identified the server to which a request will be issued, the binding mechanism would normally perform this open operation.

    The binding mechanism is sometimes used to solve two additional problems. The first of these is called the factory problem and involves starting a server when a service has no currently operational server. In this approach, the first phase of binding looks up the address of the server and learns that the server is not currently operational (or, in the connection phase, a connection error is detected and from this the binder deduces that the server has failed). The binder then issues a request to a factory in which the system designer has stored instructions for starting up a server when needed. After a suitable pause, the binder cycles back through its first phase, which presumably succeeds.

    The second problem occurs in the converse situation, when the binder discovers multiple servers that could potentially handle this client. The best policy to use in such situations depends very much on the application. For some systems, a binder should always pick a server on the same machine as the client, if possible, and should otherwise pick randomly. Other systems require some form of load-balancing, while still others may implement an affinity policy under which a certain server might be especially well suited to handling a particular client for reasons such as the data it has cached in memory or the type of requests the client is expected to issue once binding has been completed.

    Binding is a relatively expensive operation--for example, in the DCE RPC environment, binding can be more than ten times as costly as RPC. However, since binding only occurs once for each client/server pair, this high cost is not viewed as a major problem in typical distributed computing systems.


    4.5     Marshaling and Data Types

    The purpose of a data marshaling mechanism is to represent the caller's arguments in a way that can be efficiently interpreted by a server program. In the most general cases, this mechanism deals with the possibility that the computer on which the client is running uses a data representation different from the computer on which the server is running.

    Marshaling has been treated at varying levels of generality, and there exists a standard, ASN.1, for self-describing data objects in which a specific representation is recommended. In addition to ASN.1, several major vendors have adopted data representations of their own, such as Sun Microsystem's External Data Representation (XDR) format, which is used in the widely popular Network File System (NFS) protocol.

    The basic issues that occur in a data marshaling mechanism, then, are these. First, integer representations vary for the most common CPU chips. On some chips the most significant byte of an integer is also the low byte of the first word in memory, while on others the most significant byte is stored in the high byte of the last word of the integer. These are called little-endian and
    big-endian representations. At one point in the 1980s, computers with other representations--other byte permutations--were on the market, but at the time of this writing I am not aware of any other surviving formats.

    A second representation issue concerns data alignment. Some computers require that data be aligned on 32-bit or even 64-bit boundaries, while others may have weaker alignment rules--for example, by supporting data alignment on 16-bit boundaries. Unfortunately, such issues are extremely common. Compilers know about these rules, but the programmer is typically unaware of them. However, when a message arrives from a remote machine that may be using some other alignment rule, the issue becomes an important one. An attempt to fetch data directly from a message without attention to this issue could result in some form of machine fault, or it could result in retrieval of garbage. Thus, the data representation used in messages must encode sufficient information to permit the destination computer to find the start of object in the message, or the sender and destination must agree in advance on a packed representation that will be used for messages on the wire even if the sender and destination themselves share the same rules and differ from the standard. Needless to say, this is a topic capable of generating endless and fascinating debate among computer vendors whose machines use different alignment or data representations.

    A third issue arises from the existence of multiple floating-point representations. Although there is an IEEE standard floating-point representation, which has become widely accepted, some computer vendors use nonstandard representations for which conversion would be required, and even within computers using the standard, byte-ordering issues can still occur.

    A fourth issue concerns pointers. When transmitting a complex structure in which there are pointers, the marshaling mechanism needs to either signal that the user has requested something illegal, or somehow represent these pointers in a way that will permit the receiving computer to fix them upon reception of the request. This is especially tricky in languages such as LISP, which requires pointers and hence cannot easily legislate against them in RPC situations. On the other hand, passing pointers raises additional problems: Should the pointed-to object be included in the message, transferred only upon use (a "lazy" scheme), or handled in some other way?

    Finally, a marshaling mechanism may need to deal with incompatibilities in the basic data types available on computers (see Figure 4.2)--for example, a pair of computers supporting 64-bit integers in hardware may need to exchange messages containing 64-bit integer data. The marshaling scheme should therefore be able to represent such integers. On the other hand, when this type of message is sent to a computer that uses 32-bit integers, the need arises to truncate the 64-bit quantities so that they will fit in the space available, with an exception being generated if data would be lost by such a truncation. Yet, if the message is merely being passed through some sort of intermediary, one would prefer that data not be truncated, since precision would be lost. In the reverse direction, sign extension or padding may need to be performed to convert a 32-bit quantity into an equivalent 64-bit quantity, but only if the data sent are a signed integer. Thus, a completely general RPC package needs to put a considerable amount of information into each packet, and it may need to do quite a bit of work to represent data in a universal manner. Such an approach may be much more costly than one that supports only a very limited set of possible representations or that compiles the data marshaling and demarshaling operations directly into in-line code.

    FIGURE 4.2. The same number (here, a 32-bit integer) may be represented very differently on different computer architectures. One role of the marshaling and demarshaling process is to modify data representations (here, by permuting the bytes) so that values can be interpreted correctly upon reception.

    The approach taken to marshaling varies from RPC package to package. Sun's XDR system is extremely general, but requires the user to code marshaling procedures for data types other than the standard base types of the system. With XDR, one can represent any desired data structure, even dealing with pointers and complex padding rules. At the other end of the spectrum are marshaling procedures that transmit data in the binary format used by the sender, are limited to only simple data types, and perhaps do little more than compatibility checking on the receive-side. Finally, schemes such as ISDN.1 are often used with RPC stub generators, which automatically marshal and demarshal data, but impose some restrictions on the types of objects that can be transmitted.

    As a general rule of thumb, users will want to be aware that the more general solutions to these problems are also more costly. If the goal is extreme speed, it may make sense to design the application itself to produce data in a form that is inexpensive to marshal and demarshal. The cost implications of failing to do so can be surprising, and, in many cases, it is not difficult to redesign an interface so that RPCs to it will be inexpensive.


    4.6     Associated Services

    No RPC system lives in isolation. As we will see, RPC is often integrated with a security mechanism and security keys, using timestamps with a clock synchronization mechanism. For this reason, one often talks about distributed computing environments that include tools for implementing client/server applications such as RPC mechanisms, security services, and time services. Elaborate environments may go well beyond this, including system instrumentation, management interfaces and tools, fault-tolerant tools, and so-called Fourth-Generation Language (4GL)
    tools for building applications using graphical user interfaces (GUIs). Such approaches can empower even unskilled users to develop sophisticated distributed solutions. In this section we briefly review the most important of these services.

    4.6.1   Naming Services

    A naming service maintains one or more mappings from some form of name (normally symbolic) to some form of value (normally a network address). Naming services can operate in a very narrow, focused way--for example, the Domain Name Service of the TCP/IP protocol suite maps short service names, in ASCII, to IP addresses and port numbers, requiring exact matches. At the other extreme, one can talk about general naming services, which are used for many sorts of data, allow complex pattern matching on the name, and may return other types of data in addition to, or instead of, an address. One can even go beyond this to talk about secure naming services, which can be trusted to give out only validated addresses for services, and dynamic naming services, which deal with applications such as mobile computing systems in which hosts have addresses that change constantly.

    In standard computer systems at the time of this writing, three naming services are widely supported and used. As previously mentioned, the Domain Name Service (DNS) is the least functional but most widely used. It responds to requests on a standard network port address, and for the domain in which it is running can map short (eight-character) strings to Internet port numbers. DNS is normally used for static services, which are always running when the system is operational and do not change port numbers at all--for example, the e-mail protocol uses DNS to find the remote mail daemon capable of accepting incoming e-mail to a user on a remote system.

    The Network Information Service (NIS), previously called Yellow Pages (YP), is considerably more elaborate. NIS maintains a collection of maps, each of which has a symbolic name (e.g., hosts, services, etc.) and maps ASCII keywords to an ASCII value string. NIS is used on UNIX systems to map host names to Internet addresses, service names to port numbers, and so forth. Although NIS does not support pattern matching, there are ways for an application to fetch the entire NIS database, one line at a time, and it is common to include multiple entries in an NIS database for a single host that is known by a set of aliases. NIS is a distributed service that supports replication: The same data are normally available from any set of servers, and a protocol is used to update the full set of servers if an entry changes. However, NIS is not designed to support rapid updates: The assumption is that NIS data consist of mappings, such as the map from host name to Internet address, which change very rarely. A 12-hour delay before NIS information is updated is not unreasonable given this model--hence, the update problem is solved by periodically refreshing the state of each NIS server by having it read the contents of a set of files in which the mapping data are actually stored. As an example, NIS is often used to store password information on UNIX systems.

    X.500 is an international standard that many expect will eventually replace NIS. This service, which is designed for use by applications running the ISO standard remote procedure call interface and ISDN.1 data encoding, operates much like an NIS server. No provision has been made in the standard for replication or high-performance update, but the interface does support some limited degree of pattern matching. As might be expected from a standard of this sort, X.500 addresses a wide variety of issues, including security and recommended interfaces. However, reliability issues associated with availability and consistency of the X.500 service (i.e., when data are replicated) have not yet been tackled by the standards organization.

    There is considerable interest in using X.500 to implement general-purpose White Pages (WP) servers, which would be explicitly developed to support sophisticated pattern matching on very elaborate databases with detailed information about abstract entities. Rapid update rates, fault-tolerant features, and security are all being considered in these proposals. At the time of this writing, it appears that the Web will require such services and that work on universal resource naming for use in the Web will be a major driving force for evolution in this overall area.

    4.6.2   Time Services

    With the launch of the so-called Global Positioning System (GPS) satellites, microsecond accuracy became possible in workstations equipped with inexpensive radio receivers. Unfortunately, however, accurate clocks remain a major problem in the most widely used computer workstations and network technologies. We will discuss this in more detail in Chapter 20, but some background may still be useful here.

    At the time of this writing, the usual clock for a PC or workstation consists of a quartz-based chip much like the one in a common wristwatch, accurate to within a few seconds per year. The initial value of such a clock is either set by the vendor or by the user, when the computer is booted. As a result, in any network of workstations, clocks can give widely divergent readings and can drift with respect to one another at significant rates. For these reasons, there has been considerable study of algorithms for clock synchronization, whereby the clocks on individual machines can be adjusted to give behavior approximating that of a shared global clock. In Chapter 20, we will discuss some of the algorithms that have been proposed for this purpose, their ability to tolerate failures, and the analyses used to arrive at theoretical limits on clock accuracy.

    However, much of this work has a limited lifetime. GPS receivers can give extremely accurate time, and GPS signals are transmitted frequently enough so that even inexpensive hardware can potentially maintain time accurate to microseconds. By broadcasting GPS time values, this information can be propagated within a network of computers, and although some accuracy is necessarily lost when doing so, the resulting clocks are still accurate and comparable to within tens of microseconds. This development can be expected to have a major impact on the way that distributed software is designed--from a world of asynchronous communication and clocks that can be inaccurate by many times the average message latency in the network, GPS-based time could catapult us into a domain in which clock resolutions considerably exceed the average latency between sending a message and when it is received. Such developments make it very reasonable to talk about synchronous (time-based) styles of software design and the use of time in algorithms of all sorts.

    Even coarsely synchronized clocks can be of value in distributed software--for example, when comparing versions of files, microsecond accuracy is not needed to decide if one version is more current than another: Accuracy of seconds or even tens of seconds may be adequate. Security systems often have a concept of expiration associated with keys, but for these to be at risk of attacks an intruder would need a way to set a clock back by days, not fractions of a second. And, although we will see that RPC protocols use time to detect and ignore very old, stale messages, as in the case of a security mechanism a clock would need to be extremely inaccurate for such a system to malfunction.

    4.6.3   Security Services

    In the context of an RPC environment, security is usually concerned with the authentication problem. Briefly stated, this is the problem of providing applications with accurate information about the user-ID on behalf of which a request is being performed. Obviously, one would hope that the user-ID is related in some way to the user, although this is frequently the weak link in a security architecture. Given an accurate source of user identifications, however, the basic idea is to avoid intrusions that can compromise user-ID security through break-ins on individual computers and even replacements of system components on some machines with versions that have been compromised and hence could malfunction. As in the case of clock services, we will look more closely at security later in the book (Chapter 19) and hence limit ourselves to a brief review here.

    To accomplish authentication, a typical security mechanism (e.g., the Kerberos security architecture for DCE [see Schiller, Steiner et al.]) will request some form of password or one-time key from the user at log-in time, and periodically thereafter, as keys expire on the basis of elapsed time. This information is used to compute a form of secure user identification, which can be employed during connection establishment. When a client binds to a server, the security mechanism authenticates both ends and (at the option of the programmer) arranges for data to be encrypted on the wire, so that intruders who witness messages being exchanged between the client and server have no way to decode the data contained within them. (Unfortunately, however, this step is so costly that many applications disable encryption and simply rely upon the security available from the initial connection setup.) Notice that for such a system to work correctly, there must be a way to trust the authentication server itself: The user needs a way to confirm that it is actually talking to the authentication server and to legitimate representatives of the services it wishes to use. Given the anonymity of network communication, these are potentially difficult problems.

    In Chapter 19, we will look closely at distributed security issues (e.g., we will discuss Kerberos in much more detail) and also at the relationship between security and other aspects of reliability and availability--problems that are often viewed as mutually exclusive, since one replicates information to make it more available, and the other tends to restrict and protect the information to make it more secure. We will also look at emerging techniques for protecting privacy, namely the true user-ID's of programs active in a network. Although the state of the art does not yet support construction of high-performance, secure, private applications, this should be technically feasible within the not-too-distant future. Of course, technical feasibility does not imply that the technology will become widely practical and therefore useful in building reliable applications, but at least the steps needed to solve the problems are increasingly understood.

    Threads: A Personal Perspective

    Rather than choosing between threads and event dispatch, an approach that supports threads as an option over event dispatch offers more flexibility to the developer. Speaking from personal experience, I have mixed feelings on the issue of threads (versus event dispatch). Early in my career I worked with protocols implemented directly over a UDP datagram model. This turned out to be very difficult: Such a system needs to keep track of protocol state in some form of table, matching replies with requests, and is consequently hard to program--for example, suppose that a distributed file server is designed to be single-threaded. Such a file server may handle many applications at the same time, so it will need to send off one request, perhaps to read a file, but remain available for other requests, perhaps by some other application that wants to write a file. The information needed to keep track of the first request (the read that is pending) will have to be recorded in some sort of pending activities table and later matched with the incoming reply from the remote file system. Having implemented such an architecture once, I would not want to do it again.

    This motivated me to move to RPC-style protocols, using threads. We will be talking about the Isis Toolkit, which is a system that I implemented (with help from others!) in the mid-1980s, in which lightweight threads were employed extensively. Many Isis users commented to me that they had never used threads before working with Isis, and they were surprised at how much the approach simplified things. This is certainly the case: In a threaded system, the procedure handling the read would simply block waiting for the reply, while other procedures could be executed to handle other requests. The necessary bookkeeping is implicit: The blocked procedure has a local state consisting of its calling stack, local variables, and so forth. Thus, there is no need to constantly update a table of pending activities.

    Of course, threads are also a potential source of insidious programming bugs. In Isis, the benefits of threads certainly outweighed the problems associated with them, but it is also clear that this model requires a degree of programming sophistication that goes somewhat beyond standard single-threaded programming. It took me at least a year to get in the habit of thinking through the potential reentrance and ordering issues associated with concurrency and to become comfortable with the various styles of locking needed to overcome these problems. Many users report the same experience. Isis, however, is perhaps an unusually challenging case because the order in which events happen is very important in this system, for reasons that we will study in Part III.

    In more recent work, I have teamed up with Robbert van Renesse, who is the primary author of the Horus system (we discuss this in considerable detail in Chapter 18). Horus, like Isis, was initially designed to use threads and is extremely sensitive to event ordering. But when testing very demanding applications, van Renesse found that threads were a serious source of overhead and code bloat: overhead because a stack for a thread consumes 16 KB or more of space, which is a lot of space in a system that can handle tens of thousands of messages per second, and excess code because of the necessary synchronization. Yet, as in the case of Isis, Horus sometimes needs threads: They often make it easy to do things that would be very hard to express in a nonthreaded manner.

    van Renesse eventually extended Horus to use an event dispatch model similar to the one in Windows NT, which offers threads as an option over a basic event dispatch mechanism. This step, which substantially simplified many parts of Horus, left me convinced that supporting threads over an event dispatch architecture is the right way to go. For cases in which a thread is needed, it is absolutely vital that it be available. However, threads bring a considerable amount of baggage, which may be unnecessary in many settings. An event dispatch style of system gives the developer freedom to make this choice and has a lightweight and fast default behavior. I am, however, still convinced that event dispatch systems that lack the option of forking a thread when one is needed are often unwieldy and very difficult to use; this approach should be avoided.

    4.6.4   Threads Packages

    A fourth component of a typical RPC system is the lightweight threads package, which enables a single program to handle multiple tasks at the same time. Although threads are a general concept and indeed have rather little to do with communication per se, they are often viewed as necessary in distributed computing systems because of the potential for deadlock if threads are not present.

    To understand this point, it is helpful to contrast three ways of designing a communication system. A single-threaded, message-based approach would correspond to a conventional style of programming extended directly to message passing. The programmer would use system calls such as sendto and recvfrom as needed to send and receive messages. If there are several things happening at the same time in a program structured this way, however, the associated bookkeeping can be a headache (see Break-out 4.1).

    Threads offer a simple way to eliminate this problem: Each thread executes concurrently with the others, and each incoming request spawns a new thread to handle it. While an RPC is pending, the thread that issues it blocks (waits) in the procedure call that invoked the RPC. To the degree that there is any bookkeeping to worry about, the associated state is represented directly in the local variables of this procedure and in the call itself: When the reply is received, the procedure returns (the thread resumes execution), and there is no need to track down information about why the call was being done--this is obvious to the calling procedure. Of course, the developer does need to implement adequate synchronization to avoid concurrency-related bugs, but in general this is not a difficult thing to do. The approach overcomes many forms of problems that are otherwise difficult to address.

    Consider a situation in which an RPC server is also the client of some other server, which is in turn the client of still additional servers. It is entirely possible that a cycle could form, in which RPC a by process x on process y leads to an RPC b by y on z, and so forth, until finally some process in the chain makes a request back to the original process, x. If these calls were LPC calls, such a sequence would simply be a form of recursion. For a single-threaded RPC system, however, x will be busy performing RPC a and hence would be unresponsive, creating a deadlock. Alternatively, x would need to somehow save the information associated with sending RPC a while it is handling this new incoming request. This is the bookkeeping problem alluded to above.

    Yet a third option is known as event dispatch and is typical of windowing systems, in which each action by the user (mouse motion or clicks, keyboard entries) results in delivery of an event record to a central dispatching loop. The application program typically registers a set of procedure callbacks to perform when events of interest are received: If the left mouse button is pressed, invoke left_button(). Arguments to these callbacks tell the program exactly what occurred: The cursor was at position 132,541 when the mouse button was pressed, this is inside such and such a window, and so forth. One can use the same approach to handle event dispatch in message-based systems: Incoming messages are treated as events and result in callbacks to handler procedures.

    The approaches can also be combined: Event dispatch systems can, for example, fork a new thread for each incoming message. In the most general approach, the callback is registered with some indication of how it should be performed: by forking a thread, by direct procedure call, or perhaps even by some other method, such as enqueuing the event on an event queue. This last approach is used in the Horus system, which we will discuss in Chapter 18.

    At the time of this writing, although this is not universally the case, many RPC systems are built directly over a lightweight threads package. Each incoming RPC is handled by a new thread, eliminating the risk of deadlock, but forcing the programmer to learn about lightweight threads, preemption, mutual exclusion mechanisms, and other issues associated with concurrency. In this book, we will present some protocols in which processes are assumed to be multithreaded, so that the initiator of a protocol can also be a participant in it. However, we will not explicitly discuss thread packages or make use of any special features of particular packages.

    The use of threads in this manner remains debatable. UNIX programs have heavily favored this approach, and the UNIX community generally understands the issues that must be addressed and minimizes their difficulty. Indeed, with experience, threaded programming is not all that difficult. One merely needs to get in the habit of enforcing necessary synchronization using appropriate interlocks. However, the PC community tends to work with an event-based model that lacks threads, in which the application is visualized as a dispatcher for incoming events and all callbacks are by procedure invocation. Thus, the PC community has its own style of programming, and it is largely nonthreaded. Windows NT further complicates this picture: It supports threads, and yet uses an event-oriented style of dispatching throughout the operating system; if a user wants to create a thread to handle an event, this is easily done but not forced upon the programmer.


    4.7     The RPC Protocol

    The discussion up to this point has focused on client/server computing and RPC from the perspective of the user. A remote procedure call protocol is concerned with the actual mechanism by which the client process issues a request to a server and by which the reply is transmitted back from the server to the client. We now look at this protocol in more detail.

    Abstractly, the remote procedure call problem, which an RPC protocol undertakes to solve, consists of emulating LPC using message passing. LPC has a number of properties--a single procedure invocation results in exactly one execution of the procedure body, the result returned is reliably delivered to the invoker, and exceptions are raised if (and only if) an error occurs.

    Given a completely reliable communication environment, which never loses, duplicates, or reorders messages, and given client and server processes that never fail, RPC would be trivial to solve. The sender would merely package the invocation into one or more messages and transmit these to the server. The server would unpack the data into local variables, perform the desired operation, and send back the result (or an indication of any exception that occurred) in a reply message. The challenge, then, is created by failures.

    Were it not for the possibility of process and machine crashes, an RPC protocol capable of overcoming limited levels of message loss, disorder, and even duplication would be easy to develop (Figure 4.3). For each process to which it issues requests, a client process maintains a message sequence number. Each message transmitted carries a unique sequence number, and (in most RPC protocols) a timestamp from a global clock--one that returns roughly the same value throughout the network, up to clock synchronization limits. This information can be used by the server to detect very old or duplicate copies of messages, which are discarded, and to identify received messages using what are called acknowledgment protocol messages.

    The basic idea, then, is that the client process transmits its request and, until acknowledgments have been received, continues to retransmit the same messages periodically. The server collects messages and, when the full request has been received, performs the appropriate procedure invocation. When it transmits its reply, the same sort of reliable communication protocol is used. Often, the acknowledgment is delayed briefly in the hope that the reply will be sent soon and can be used in place of a separate acknowledgment.

    A number of important optimizations have been proposed by developers of RPC-oriented distributed computing environments--for example, if one request will require the transmission of multiple messages, because the request is large, it is common to inhibit the sending of acknowledgments during the transmission of the burst of messages. In this case, a negative acknowledgment is sent if the receiver detects a missing packet; a single acknowledgment confirms reception of the entire burst when all packets have been successfully received (Figure 4.4). Similarly, it is common to delay the transmission of acknowledgment packets in the hope that the reply message itself can be transmitted instead of an acknowledgment; obviously, the receipt of a reply implies that the corresponding request was delivered and executed.


    FIGURE 4.3. Simple RPC interaction, showing packets that contain data (dark) and acknowledgments (light).
    Process and machine failures, unfortunately, render this very simple approach inadequate. The essential problem is that because communication is over unreliable networking technologies, when a process is unable to communicate with some other process, there is no way to determine whether the problem is a network failure, a machine failure, or both (if a process fails but the machine remains operational, the operating system will often provide some status information, permitting this one case to be accurately sensed).

    When an RPC protocol fails by timing out, but the client or server (or both) remains operational, it is impossible to know what has occurred. Perhaps the request was never received, perhaps it was received and executed but the reply was lost, or perhaps the client or server crashed while the protocol was executing. This creates a substantial challenge for the application programmer who wishes to build an application that will operate reliably despite failures of some of the services upon which it depends.

    A related problem concerns the issue of what are called exactly once semantics. When a programmer employs LPC, the invoked procedure will be executed exactly once for each invocation. In the case of RPC, however, it is not evident that this problem can be solved. Consider a process, c, that issues an RPC to a service offered by process s. Depending upon the assumptions we make, it may be very difficult even to guarantee that s performs this request at most once. (Obviously, the possibility of a failure precludes a solution in which s would perform the operation exactly once.)

    To understand the origin of the problem, consider the possible behaviors of an arbitrary communication network. Messages can be lost in transmission, and as we have seen this can prevent process c from accurately detecting failures of process s. But, the network might also misbehave by delivering a message after an unreasonably long delay--for example, suppose that a network router device fails by jamming up in such a manner that until the device is serviced, the software within it will simply wait for the hardware to be fixed. Obviously, there is no reason to simply assume that routers won't behave this way, and in fact it is known that some routers definitely could behave this way. Moreover, one can imagine a type of attack upon a network in which an intruder records messages for future replay.

    One could thus imagine a situation in which process s performs a request from c, but then is presented with the same request after a very long delay (Figure 4.5). How can process s recognize this as a duplicate of the earlier request?


    FIGURE 4.4. RPC using a burst protocol; here the reply is sent soon enough so that an acknowledgment to the burst is not needed.


    FIGURE 4.5. If an old request is replayed, perhaps because of a transient failure in the network, a server may have difficulty protecting itself against the risk of reexecuting the operation.

    Depending upon the specific protocol used, an RPC package can use a variety of barriers to protect itself against replays of long-delayed messagesfor example, the package might check timestamps in the incoming messages, rejecting any that are very old. Such an approach, however, presumes that clocks are synchronized to a reasonable degree and that there is no danger that a message will be replayed with a modified timestamp--an action that might well be within the capabilities of a sophisticated intruder. The server could use a connect-based binding to its clients, but this merely pushes the same problem into the software used to implement network connections--and, as we shall see shortly, the same issues occur and remain just as intractable at that level of a system. The server might maintain a list of currently valid users and could insist that each message be identified by a monotonically increasing sequence numberbut a replay could, at least theoretically, reexecute the original binding protocol.

    Analyses such as these lead us to two possible conclusions. One view of the matter is that an RPC protocol should take reasonable precautions against replay but not be designed to protect against extreme situations such as replay attacks. In this approach, an RPC protocol might claim to guarantee at most once semantics, meaning that provided that the clock synchronization protocol has not been compromised or some sort of active attack been mounted upon the system, each operation will result in either a single procedure invocation or, if a communication or process failure occurs, in no invocation. An RPC protocol can similarly guarantee at least once semantics, meaning that if the client system remains operational indefinitely, the operation will be performed at least once but perhaps more than once. Notice that both types of semantics come with caveats: conditions (hopefully very unlikely ones) under which the property would still not be guaranteed. In practice, most RPC environments guarantee a weak form of at most once semantics: Only a mixture of an extended network outage and a clock failure could cause such systems to deliver a message twice, and this is not a very likely problem.

    A different approach, also reasonable, is to assume a very adversarial environment and protect the server against outright attacks that could attempt to manipulate the clock, modify messages, and otherwise interfere with the system. Security architectures for RPC applications commonly start with this sort of extreme position, although it is also common to weaken the degree of protection to obtain some performance benefits within less hostile subsets of the overall computing system. We will return to this issue and discuss it in some detail in Chapter 19.


    4.8     Using RPC in Reliable Distributed Systems

    The uncertainty associated with RPC failure notification and the weak RPC invocation semantics seen on some systems pose a challenge to the developer of a reliable distributed application.

    A reliable application would typically need multiple sources of critical services, so that if one server is unresponsive or faulty the application can reissue its requests to another server. If the server behaves as a read-only information source, this may be an easy problem to solve. However, as soon as the server is asked to deal with dynamically changing information, even if the changes are infrequent compared to the rate of queries, a number of difficult consistency and fault-tolerant issues arise. Even questions as simple as load-balancing, so that each server in a service spanning multiple machines will do a roughly equal share of the request processing load, can be very difficult to solve.

    Suppose that an application will use a primary-backup style of fault tolerance, and the requests performed by the server affect its state. The basic idea is that an application should connect itself to the primary, obtaining services from that process as long as it is operational. If the primary fails, the application will fail-over to the backup. Such a configuration of processes is illustrated in Figure 4.6. Notice that the figure includes multiple client processes, since such a service might well be used by many client applications at the same time.

    FIGURE 4.6. Idealized primary-backup server configuration. Clients interact with the primary and the primary keeps the backup current.

    Consider now the design of a protocol by which the client can issue an RPC to the primary-backup pair such that if the primary performs the operation, the backup learns of the associated state change. In principle, this may seem simple: The client would issue an RPC to the server, which would compute the response and then issue an RPC to the backup, sending it the request it performed, the associated state change, and the reply being returned to the client. Then the primary would return the reply, as shown in Figure 4.7.

    FIGURE 4.7. Simplistic RPC protocol implementing primary-backup replication.

    This simple protocol is, however, easily seen to be flawed if the sorts of problems we discussed in the previous section might occur while it were running (see Birman and Glade). Take the issue of timeout (see Figure 4.8). In this solution, two RPCs occur, one nested within the other.
    Either of these, or both, could fail by timeout, in which case there is no way to know with certainty in what state the system was left. If, for example, the client sees a timeout failure, there are quite a few possible explanations: The request may have been lost, the reply may have been lost, and either the primary or the primary and the backup may have crashed. Fail-over to the backup would only be appropriate if the primary were indeed faulty, but there is no accurate way to determine if this is the case, except by waiting for the primary to recover from the failure--not a very available approach.

    FIGURE 4.8. RPC timeouts can create inconsistent states, such as this one, in which two clients are connected to the primary, one to the backup, and one is disconnected from the service. Moreover, the primary and backup have become disconnected from one another--each considers the other faulty. In practice, such problems are easily provoked by transient network failures. They can result in serious application-level errors--for example, if the clients are air traffic controllers and the servers advise them on the safety of air traffic routing changes, this scenario could lead two controllers to route different planes into the same sector of the airspace! The matter is further complicated by the presence of more than one client. One could easily imagine that different clients could observe different and completely uncorrelated outcomes for requests issued simultaneously but during a period of transient network or computer failures. Thus, one client might see a request performed successfully by the primary, another might conclude that the primary is apparently faulty and try to communicate with the backup, and yet a third may have timed out both on the primary and the backup! We use the term "inconsistent" in conjunction with this sort of uncoordinated and potentially incorrect behavior. An RPC system clearly is not able to guarantee the consistency of the environment, at least when the sorts of protocols discussed above are employed, and hence reliable programming with RPC is limited to very simple applications.

    The line between easily solved RPC applications and very difficult ones is not a very clear one--for example, one major type of file server accessible over the network is accessed by an RPC protocol with very weak semantics, which can be visible to users. Yet this protocol, called the Network File System protocol, is widely popular and has the status of a standard, because it is easy to implement and widely available on most vendor computing systems. NFS is discussed in more detail in Section 7.3 and so we will be very brief here.

    One example of a way in which NFS behavior reflects an underlying RPC issue occurs when creating a file. NFS documentation specifies that the file creation operation should return the error code EEXISTS if a file already exists at the time the create operation is issued. However, there is also a case in which NFS can return error EEXISTS even though the file did not exist when the create was issued. This occurs when the create RPC times out, even though the request was delivered to the server and was performed successfully. NFS automatically reissues requests that fail by timing out and will retry the create operation, which now attempts to reexecute the request and fails because the file is now present. In effect, NFS is unable to ensure at most once execution of the request, and hence can give an incorrect return code. Had NFS been implemented using LPC (as in the LFS or local file system), this behavior would not be possible.

    NFS illustrates one approach to dealing with inconsistent behavior in an RPC system. By weakening the semantics presented to the user or application program, NFS is able to provide acceptable behavior despite RPC semantics that create considerable uncertainty when an error is reported. In effect, the erroneous behavior is simply redefined to be a feature of the protocol.

    A second broad approach that will interest us here involves the use of agreement protocols by which the components of a distributed system maintain consensus on the status (operational or failed) of one another. A rigorous derivation of the obligations upon such consensus protocols, the limitations on this approach, and the efficient implementation of solutions will be discussed later in this book (see Section 13.3). Briefly, however, the idea is that any majority of the system can be empowered to vote that a minority (often, just a single component) be excluded on the basis of apparently faulty behavior. Such a component is cut off from the majority group: If it is not really faulty, or if the failure is a transient condition that corrects itself, the component will be prevented from interacting with the majority system processes and will eventually detect that it has been dropped. It can then execute a rejoin protocol, if desired, after which it will be allowed back into the system.

    With this approach, failure becomes an abstract event--true failures can trigger this type of event, but because the system membership is a self-maintained property of the system, the inability to accurately detect failures need not be reflected through inconsistent behavior. Instead, a conservative detection scheme can be used, which will always detect true failures while making errors infrequently (discussed in more detail in Section 13.9).

    By connecting an RPC protocol to a group membership protocol that runs such a failure consensus algorithm, a system can resolve one important aspect of the RPC error-reporting problems discussed above. The RPC system will still be unable to accurately detect failures; hence, it will be at risk of incorrectly reporting operational components as having failed. However, the behavior will now be consistent throughout the system: If component a observes the failure of component b, than component c will also observe the failure of b, unless c is also determined to be faulty. In some sense, this approach eliminates the concept of failure entirely, replacing it with an event that might be called exclusion from membership in the system. Indeed, in the case where b is actually experiencing a transient problem, the resulting execution is much like being exiled from one's country: b is prevented from communicating with other members of the system and learns this. Conversely, the concept of a majority allows the operational part of
    the system to initiate actions on behalf of the full membership in the system. The system now becomes identified with a rigorous concept: the output of the system membership protocol, which can itself be defined formally and reasoned about using formal tools.

    As we move beyond RPC to consider more complex distributed programming paradigms, we will see that this sort of consistency is often required in nontrivial distributed applications. Indeed, there appears to be a dividing line between the distributed applications that give nontrivial coordinated behavior at multiple locations, and those that operate as completely decoupled interacting components, with purely local correctness criteria. The former type of system requires the type of consistency we have encountered in this simple case of RPC error reporting. The latter type of system can manage with error detection based upon timeouts--but is potentially unsuitable for supporting any form of consistent behavior.


    4.9     Related Reading

    A tremendous amount has been written about client/server computing, and several pages of references could easily have been included here. Good introductions to the literature, including more detailed discussions of DCE and ASN.1, can be found in Birrell and Nelson, Comer and Stevens (1993), Coulouris et al., Tanenbaum.

    On RPC performance, the classic reference is Shroeder and Burrows. Critiques of the RPC paradigm appear in Birman and van Renesse, Tanenbaum and van Renesse.

    On the problem of inconsistent failure detection with RPC: (see Birman and Glade).

    Other relevant publications include Bal et al. (1992), Bellovin and Merritt, Berners-Lee et al. (1994, 1995), Birrell and Nelson, Braun and Diot, Brockschmidt, Engler et al., Govindran and Anderson, Heidemann and Popek (1994), Jacobsen (1988, 1990), Mullender et al., Rashid, Shroeder and Burrows, Thekkanth and Levy, von Eicken et al. (1995).

    A good reference to DCE is Open Software Foundation and to OLE-2 is Brockschmidt.

    Kerberos is discussed in Bellovin and Merritt, Schiller, Steiner et al.



         1 It is common to call the interface to a program its IDL, although IDL is actually shorthand for Interface Definition Language, which is the language used to write down the description of such an interface. Historically, this seems to represent a small degree of resistance to the overuse of acronyms by the distributed system standardization community. Unfortunately, the resistance seems to have been short-lived: CORBA introduces at least a dozen new three-letter acronyms, ATM has swept the networking community and four- and five-letter acronyms (as the available three-letter combinations are used up) seem certain to follow!