External memory algorithms and data structures: Dealing with massive data

Authors
Citation
Js. Vitter, External memory algorithms and data structures: Dealing with massive data, ACM C SURV, 33(2), 2001, pp. 209-271
Citations number
265
Categorie Soggetti
Computer Science & Engineering
Journal title
ACM COMPUTING SURVEYS
ISSN journal
03600300 → ACNP
Volume
33
Issue
2
Year of publication
2001
Pages
209 - 271
Database
ISI
SICI code
0360-0300(200106)33:2<209:EMAADS>2.0.ZU;2-4
Abstract
Data sets in large applications are often too massive to fit completely ins ide the computer's internal memory. The resulting input/output communicatio n (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this article we survey th e state of the art in the design and analysis of external memory (or EM) al gorithms and data structures, where the goal is to exploit locality in orde r to reduce the I/O costs. We consider a variety of EM paradigms for solvin g hatched and online problems efficiently in external memory. For the batch ed problem of sorting and related problems such as permuting and fast Fouri er transform, the key paradigms include distribution and merging. The parad igm of disk striping offers an elegant way to use multiple disks in paralle l. For sorting, however, disk striping can be nonoptimal with respect to I/ O, so to gain further improvements we discuss distribution and merging tech niques for using the disks independently. We also consider useful technique s for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and cons tructing convex hulls), and graphs (such as list ranking, connected compone nts, topological sorting, and shortest paths). In the online domain, canoni cal EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hash ing and B-trees. The paradigms of filtering and bootstrapping provide a con venient means in online data structures to make effective use of the data a ccessed from disk. We also reexamine some of the above EM problems in sligh tly different settings, such as when the data items are moving, when the da ta items are variable-length (e.g., text strings), or when the allocated am ount of internal memory can change dynamically. Programming tools and envir onments are available for simplifying the EM programming task. During the c ourse of the survey, we report on some experiments in the domain of spatial databases using the TPIE system (transparent parallel I/O programming envi ronment). The newly developed EM algorithms and data structures that incorp orate the paradigms we discuss are significantly faster than methods curren tly used in practice.