T. Potter et H. Matter, RANDOM OR RATIONAL DESIGN - EVALUATION OF DIVERSE COMPOUND SUBSETS FROM CHEMICAL-STRUCTURE DATABASES, Journal of medicinal chemistry, 41(4), 1998, pp. 478-488
The performance of rational design to maximize the structural diversit
y of databases for lead finding and lead refinement was investigated.
Rational methods such as maximum dissimilarity methods or hierarchical
cluster analysis for designing compound subsets were compared to a ra
ndom approach to study their efficiency for an enhancement of the dive
rsity of three different databases. All investigations were done based
on 2D fingerprints as a validated molecular descriptor. To compare th
e performance of the rational selection methods to a random approach,
we additionally used probability calculations. When using maximum diss
imilarity-based selections, a single compound can be a member of diffe
rent neighborhoods as defined by the similarity threshold value, awhil
e in hierarchical clustering each compound is assigned to only a singl
e cluster, Therefore the relationship between the similarity threshold
of the maximum diversity selection method and a 2D similarity search
threshold was studied. In contrast to hierarchical clustering analysis
, maximum dissimilarity selections allow to use a similarity threshold
for adding a new compound to an already selected compound list. Reaso
nable values for this similarity threshold are presented here, More di
verse subsets were designed using maximum dissimilarity selections, wh
ich cover more biological classes than using random selections. An opt
imally diverse subset without redundant structures containing only 38%
;ro of one original dataset was generated, where no structure is more
similar than 0.85 to its nearest neighbor, but all biological classes
were represented. When it is acceptable to cover only 90% of all biolo
gical targets, 3.5-3.7 times more compounds need to be selected using
a random approach than in a rational design approach. Such coverage ra
te shows the highest efficiency of design techniques compared to a ran
dom approach. In those subsets no compound is closer than 0.70 to its
nearest neighbor. Furthermore a comparative molecular field analysis (
CoMFA) is used to evaluate designed and randomly chosen subsets for a
database consisting of inhibitors of the angiotensin-converting enzyme
. It was shown that designed subsets using maximum dissimilarity metho
ds lead to more stable quantitative structure-activity relationship (Q
SAR) models with higher predictive power compared to randomly chosen c
ompounds. This predictive power is especially high when there is no co
mpound in the test dataset with a similarity coefficient less than 0.7
to its nearest neighbor in the: training set.