Eight large chemical databases have been analyzed and compared to each othe
r. Central to this comparison is the open National Cancer Institute (NCI) d
atabase, consisting of approximately 250 000 structures. The other database
s analyzed are the Available Chemicals Directory ("ACD," from MDL, release
1.99, 3D-version); the ChemACX ("ACX," from CamSoft, Version 4.5); the Mayb
ridge Catalog and the Asinex database (both as distributed by CamSoft as pa
rt of ChemInfo 4.5); the Sigma-Aldrich Catalog (CD-ROM, 1999 Version); the
World Drug Index ("WDI," Derwent, version 1999.03): and the organic part of
the Cambridge Crystallographic Database ("CSD," from Cambridge Crystallogr
aphic Data Center, 1999 Version 5.18). The database properties analyzed are
internal duplication rates; compounds unique to each database; cumulative
occurrence of compounds in an increasing number of databases, overlap of id
entical compounds between two databases: similarity overlap: diversity; and
others. The crystallographic database CSD and the WDI show somewhat less o
verlap with the other databases than those with each other. In particular t
he collections of commercial compounds and compilations of vendor catalogs
have a substantial degree of overlap among each other. Still, no database i
s completely a subset of any other, and each appears to have its own niche
and thus "raison d'etre". The NCI database has by far the highest number of
compounds that are unique to it. Approximately 200 000 of the NCI structur
es were not found in any of the other analyzed databases.