ITA
ENG

EXPLOITING PARTIAL REPLICATION IN UNBALANCED PARALLEL LOOP SCHEDULINGON MULTICOMPUTERS

Authors

ORLANDO S PEREGO R

Citation

S. Orlando et R. Perego, EXPLOITING PARTIAL REPLICATION IN UNBALANCED PARALLEL LOOP SCHEDULINGON MULTICOMPUTERS, Microprocessing and microprogramming, 41(8-9), 1996, pp. 645-658

Citations number

Categorie Soggetti

Computer Sciences","Computer Science Hardware & Architecture

Journal title

Microprocessing and microprogramming → ACNP

ISSN journal

01656074

Volume

Issue

8-9

Year of publication

1996

Pages

645 - 658

Database

ISI

SICI code

0165-6074(1996)41:8-9<645:EPRIUP>2.0.ZU;2-P

Abstract

We consider the problem of scheduling parallel loops whose iterations operate on large array data structures and are characterized by highly varying execution times (unbalanced or non-uniform parallel loops), A general parallel loop implementation template for message-passing dis tributed-memory multiprocessors (multicomputers) is presented, Assumin g that it is impossible to statically determine the distribution of th e computational load on the data accessed, the template exploits a hyb rid scheduling strategy, The data are partially replicated on the proc essor's local memories and iterations are statically scheduled until f irst load imbalances are detected. At this point an effective dynamic scheduling technique is adopted to move iterations among nodes holding the same data. Most of the communications needed to implement dynamic load balancing are overlapped with computations, as a very effective prefetching policy is adopted. The template scales very well, since kn owing where data are replicated makes it possible to balance the load without introducing high overheads. In the paper a formal characteriza tion of load imbalance related to a generic problem instance is also p roposed. This characterization is used to derive an analytical cost mo del for the template, and in particular, to tune those parameters of t he template that depend on the costs related to the specific features of the target machine and the specific problem. The template and the r elated cost model are validated by experiments conducted on a 128-node nCUBE 2, whose results are reported and discussed.