Huge data sets and the frontiers of computational feasibility

Citation
J. Wegman, Edward, Huge data sets and the frontiers of computational feasibility, Journal of computational and graphical statistics , 4(4), 1995, pp. 281-295
ISSN journal
10618600
Volume
4
Issue
4
Year of publication
1995
Pages
281 - 295
Database
ACNP
SICI code
Abstract
Recently, Huber offered a taxonomy of data set sizes ranging from tiny (10 2 bytes) to huge (10 10 bytes). This taxonomy is particularly appealing because it quantifies the meaning of tiny, small, medium, large, and huge. Indeed, some investigators consider 300 small and 10,000 large while others consider 10,000 small. In Huber's taxonomy, most statistical and visualization techniques are computationally feasible with tiny data sets. With larger data sets, however, computers run out of computational horsepower and graphics displays run out of resolution fairly quickly. In this article, I discuss aspects of data set size and computational feasibility for general classes of algorithms in the context of CPU performance, memory size, hard disk capacity, screen resolution and massively parallel architectures. I discuss some strategies such as recursive formulations that mitigate the impact of size. I also discuss the potential for scalable parallelization that will mitigate the effects of computational complexity.