Data sets in large applications are often too massive to fit completely ins
ide the computer's internal memory. The resulting input/output communicatio
n (or I/O) between fast internal memory and slower external memory (such as
disks) can be a major performance bottleneck. In this article we survey th
e state of the art in the design and analysis of external memory (or EM) al
gorithms and data structures, where the goal is to exploit locality in orde
r to reduce the I/O costs. We consider a variety of EM paradigms for solvin
g hatched and online problems efficiently in external memory. For the batch
ed problem of sorting and related problems such as permuting and fast Fouri
er transform, the key paradigms include distribution and merging. The parad
igm of disk striping offers an elegant way to use multiple disks in paralle
l. For sorting, however, disk striping can be nonoptimal with respect to I/
O, so to gain further improvements we discuss distribution and merging tech
niques for using the disks independently. We also consider useful technique
s for batched EM problems involving matrices (such as matrix multiplication
and transposition), geometric data (such as finding intersections and cons
tructing convex hulls), and graphs (such as list ranking, connected compone
nts, topological sorting, and shortest paths). In the online domain, canoni
cal EM applications include dictionary lookup and range searching. The two
important classes of indexed data structures are based upon extendible hash
ing and B-trees. The paradigms of filtering and bootstrapping provide a con
venient means in online data structures to make effective use of the data a
ccessed from disk. We also reexamine some of the above EM problems in sligh
tly different settings, such as when the data items are moving, when the da
ta items are variable-length (e.g., text strings), or when the allocated am
ount of internal memory can change dynamically. Programming tools and envir
onments are available for simplifying the EM programming task. During the c
ourse of the survey, we report on some experiments in the domain of spatial
databases using the TPIE system (transparent parallel I/O programming envi
ronment). The newly developed EM algorithms and data structures that incorp
orate the paradigms we discuss are significantly faster than methods curren
tly used in practice.