The RAIN project is a research collaboration between Caltech and NASA-JPL o
n distributed computing and data storage systems for future spaceborne miss
ions. The goal of the project is to identify and develop key building block
s for reliable distributed systems built with inexpensive off-the-shelf com
ponents. The RAIN platform consists of a heterogeneous cluster of computing
and/or storage nodes connected via multiple interfaces to networks configu
red in fault-tolerant topologies. The RAIN software components run in conju
nction with operating system services and standard network protocols. Throu
gh software-implemented fault tolerance, the system tolerates multiple node
, link, and switch failures, with no single point of failure. The RAIN tech
nology has been transfered to Rainfinity, a start-up company focusing on cr
eating clustered solutions for improving the performance and availability o
f Internet data centers. In this paper, we describe the following contribut
ions: 1) fault-tolerant interconnect topologies and communication protocols
providing consistent error reporting of link failures, 2) fault management
techniques based on group membership, and 3) data storage schemes based on
computationally efficient error-control codes. We present several proof-of
-concept applications: a highly-available video server, a highly-available
Web server, and a distributed checkpointing system. Also, we describe a com
mercial product, Rainwall, built with the RAIN technology.