Survivability in computer systems has become a primary concern in modern computer networks, due to the prevalence of both malicious faults and benign faults and the increasing need to ensure that critical systems can provide services during those faults. The paper Survivable Systems Based on an Adaptive NMR Algorithm, by Wing Kai Ma, Wei Li, I-Ling Yen, Farokh Bastani, and Ing-Ray Chen [1], explores limitations in current methods to attain survivability and explores a coordination protocol called an Adaptive NMR Algorithm that seeks to address specific limitations.
Survivability is maintained primarily in three ways:
1) Replication – this is an essential method to the survivability of any system. The system model used in the paper assumes that there is replication of both the server nodes and the client nodes, such that some faults can be tolerated in either the server group or the client groups.
2) Data Protection – this is primarily provided through data encryption and partitioning schemes. The problem with these approaches, however, is that the data can be comprised when it is executed upon in the server nodes. While the paper explores Blackey’s Data Partitioning Scheme combined with a new secret-share approach that allows execution of data while it is still partitioned, it relegates the proof of its correctness to another paper by one of the authors, as this is not the primary purpose of this paper [2].
3) Agreement Protocols – this is the primary way in which coordination, and through coordination detection of faults, is achieved. Current agreement protocols, called NMR algorithms or schemes, do achieve survivability but impose certain limitations upon the system through high communication overhead. The Adaptive NMR algorithm proposed seeks to reduce the communication overhead while still performing agreement between the nodes.
The NMR schemes apply to systems that have some group of replicated servers and have some groups of clients who request services from those servers (Figure 1, [1]). For this system model we assume that the messages are signed, which means that if we tolerate k failures per client group and t failures for the server group we need n = 2k + 1 client nodes for each client group and m = 2t + 1 server nodes. Additionally we assume that the model is loosely synchronized via time-out procedures, as purely asynchronous systems cannot achieve coordination via agreement protocols.
The communication patterns for the NMR scheme (Figure 2a, [1]) rely upon messages from the clients being sent to the servers, and the servers sending replies back to the clients. The clients examine the replies back from the servers, and decide on a majority of the replies received. A server that did not send the majority reply value is faulty. Faulty client nodes are detected by the servers via faulty request messages. The primary problem with NMR schemes is the overhead imposed at the server nodes. Each server receives ng messages, where n is the number of client nodes per group and g is the number of client groups. This bottleneck severely limits the scalability and performance of the system. Thus, the primary purpose of the Adaptive NMR (ANMR) scheme is to reduce this bottleneck and still maintain the survivability of the system. It does this by reducing the number of messages sent to the server nodes during normal operation, and in the presence of some faults, switch back to the NMR scheme in order to achieve agreement. The typical communication protocol (Figure 2b, [1]) for the ANMR scheme is achieved by having only one client node per client group send messages to the server nodes during normal operation. This node, called the Client Executor or CE, is responsible for the messages, while the other nodes in the client groups, called the Client Verifiers or CV, shadow the operations of the CE by listening in on the messages sent. They perform the same operations based on the requests and replies sent. If a fault is detected, they revert back to the NMR scheme and re-request replies back from the server nodes. Based on the resent replies, they determine if the server node is faulty or if the CE is faulty. If a CV is faulty, we cannot determine its typical actions. If however, in the worst case, it sends faulty requests to the servers, the servers reply back only to the CV that sent the message. If more than k requests are sent per group, than the server knows that the CE is faulty. Since a server only replies directly to a CV and only replies to one resent request, any faulty CV will only result in a few extra messages. Since the primary difference between the ANMR scheme and a traditional NMR scheme is in the workings of the CEs, the proof of the correctness of the ANMR scheme is based around proving that the scheme works when the CEs are not faulty and when they are.
The results obtained in the paper show that during normal execution, the average response time from the server nodes to the client nodes decreases due to fewer messages ((n – 1)m messages) being sent to the server nodes. However, when a CE fails additional messages are generated ((n – 1)m messages). Thus, the frequency of CE failures plays a role in whether the ANMR scheme is appropriate for a given system. A few interesting scenarios can also occur during the ANMR scheme. If a CE fails, then numerous extra messages are sent as the CVs attempt to discern what is at fault, the servers or the CE. Multiple CE failures across several client groups at the same time could flood the server group with messages, since additional messages above the NMR scheme occur when a CE fails. This is one failure case that isn’t explored in the paper. One case that is noted however is that during a server failure, the ANMR scheme has much better performance in terms of server response time as there are fewer messages being sent to the servers initially, while in the NMR scheme the bottleneck essentially becomes narrower. Additionally, if a client group experiences more than k failures in the client group, the server node cannot say whether the failures occurred in the CE or the CVs. Likewise, if more than t failures occur in the server group, the model fails. Neither of these scenarios however is different than what occurs in the NMR schemes.
An interesting result of this paper is the possible applications for the ANMR scheme. As mentioned in the introduction and background, there are many different types of systems that are considered to be critical, from military command and control systems to banking servers, company webservers and networks, etc. The motivations behind this paper therefore are clear and the results obtained useful. A increase in the response time from the server nodes due to an agreement protocol would be a severe limitation in any of these types of systems, and hence it can be assumed that the frequency with which the agreement protocol can be executed has to be rarer. Additionally the size of the system would be restricted. This would increase the likelihood that malicious attacks could go unnoticed for some time, as well as benign faults. The fact that the ANMR scheme decreases response time from the servers implies that not only can the system be scaled better but also that the frequency with which the agreement protocol can be executed would increase; thereby increasing the chance that faults would not go undetected for a significant amount of time.
Some interesting future work could also be conducted into the ANMR scheme. Survivability is only one aspect of surviving a malicious attack; integrating the scheme with other network intrusion detection and adaptation techniques could prove interesting. It would also be interesting to see how the scheme would perform when the network is being utilized by different tasks, or during different attacks, i.e. if a malicious attack causes a denial of service attack upon the server nodes via a flooding of the network with packets, can the ANMR scheme itself still survive? While this paper provides a solid backbone in terms of the algorithm itself, there could be a significant amount of work done in applying the algorithm to specific problem scenarios and system models. Overall, my opinion of this paper and the ANMR scheme is very positive, as it demonstrates a significant advantage over the NMR scheme in many situations and provides an interesting glimpse into the issue of survivability for critical systems.
Miss Any?
Alright, we're gonna give this a shot
January 02, 2005
Let's see how badly I failed these last year
December 31, 2004
Okay, so its trendy
December 28, 2004
Its just like that asshole, Joe fucking Lieberman. Annoying and rather pointless.
December 27, 2004
Is this a typical Christmas?
December 26, 2004