Design and evaluation of an HPVM-based windows NT supercomputer

Citation
A. Chien et al., Design and evaluation of an HPVM-based windows NT supercomputer, INT J HI PE, 13(3), 1999, pp. 201-219
Citations number
41
Categorie Soggetti
Computer Science & Engineering
Journal title
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
ISSN journal
10943420 → ACNP
Volume
13
Issue
3
Year of publication
1999
Pages
201 - 219
Database
ISI
SICI code
1094-3420(199923)13:3<201:DAEOAH>2.0.ZU;2-6
Abstract
We describe the design and evaluation of a 192-processor Windows NT cluster for high performance computing based on the High Performance Virtual Machi ne (HPVM) communication suite. While other clusters have been described in the literature, building a 58 GFlop/s NT cluster to be used as a general-pu rpose production machine for NCSA required solving new problems. The HPVM s oftware meets the challenges represented by the large number of processors, the peculiarities of the NT operating system, the need for a production-st rength job submission facility and the requirement for mainstream programmi ng interfaces. First, HPVM provides users with a collection of standard API s like MPI, Shmem, Global Arrays with supercomputer class performance (13 m u s minimum latency, 84 MB/s peak bandwidth for MPI), efficiently deliverin g Myrinet's hardware performance to application programs. Second, HPVM prov ides cluster management and scheduling (through integration with Platform C omputing's LSF). Finally, HPVM addresses Windows NT's remote access problem , providing convenient remote access and job control (through a graphical J ava-applet front-end). Given the production nature of the cluster, the perf ormance characterization is largely based on a sample of the NCSA scientifi c applications the machine will be running. The side-by-side comparison wit h other present-generation NCSA supercomputers shows the cluster to be with in a factor of 2 to 4 of the SGI Origin 2000 and Gray T3E performance at a fraction of the cost. The inherent scalability of the cluster design produc es a comparable or better speedup than the Origin 2000 despite a limitation in the HPVM flow control mechanism.