We provide the first optimal algorithms in terms of the number of inpu
t/outputs (I/Os) required between internal memory and multiple seconda
ry storage devices for the problems of sorting, FFT, matrix transposit
ion, standard matrix multiplication, and related problems. Our two-lev
el memory model is new and gives a realistic treatment of parallel blo
ck transfer, in which during a single I/O each of the P secondary stor
age devices can simultaneously transfer a contiguous block of B record
s. The model pertains to a large-scale uniprocessor system or parallel
multiprocessor system with P disks. In addition, the sorting, FFT, pe
rmutation network, and standard matrix multiplication algorithms are t
ypically optimal in terms of the amount of internal processing time. T
he difficulty in developing optimal algorithms is to cope with the par
titioning of memory into P separate physical devices: Our algorithms'
performances can be significantly better than those obtained by the we
ll-known but nonoptimal technique of disk striping. Our optimal sortin
g algorithm is randomized, but practical;the probability of using more
than l times the optimal number of I/Os is exponentially small in l(l
og l) log(M/B), where M is the internal memory size.