Computing in the RAIN: A reliable array of independent nodes

Citation
V. Bohossian et al., Computing in the RAIN: A reliable array of independent nodes, IEEE PARALL, 12(2), 2001, pp. 99-114
Citations number
57
Categorie Soggetti
Computer Science & Engineering
Journal title
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
ISSN journal
10459219 → ACNP
Volume
12
Issue
2
Year of publication
2001
Pages
99 - 114
Database
ISI
SICI code
1045-9219(200102)12:2<99:CITRAR>2.0.ZU;2-8
Abstract
The RAIN project is a research collaboration between Caltech and NASA-JPL o n distributed computing and data storage systems for future spaceborne miss ions. The goal of the project is to identify and develop key building block s for reliable distributed systems built with inexpensive off-the-shelf com ponents. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configu red in fault-tolerant topologies. The RAIN software components run in conju nction with operating system services and standard network protocols. Throu gh software-implemented fault tolerance, the system tolerates multiple node , link, and switch failures, with no single point of failure. The RAIN tech nology has been transfered to Rainfinity, a start-up company focusing on cr eating clustered solutions for improving the performance and availability o f Internet data centers. In this paper, we describe the following contribut ions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of -concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a com mercial product, Rainwall, built with the RAIN technology.