H. Matter, SELECTING OPTIMALLY DIVERSE COMPOUNDS FROM STRUCTURE DATABASES - A VALIDATION-STUDY OF 2-DIMENSIONAL AND 3-DIMENSIONAL MOLECULAR DESCRIPTORS, Journal of medicinal chemistry, 40(8), 1997, pp. 1219-1229
The efficiency of the drug discovery process can be significantly impr
oved using design techniques to maximize the diversity of structure da
tabases or combinatorial libraries. Here, several physicochemical desc
riptors were investigated to quantify molecular diversity. Based on th
e 2D or 3D topological similarity of molecules, the relationship betwe
en physicochemical metrics and biological activity was studied to find
valid descriptors. Several compounds were selected using those descri
ptors from a database containing diverse templates and 55 biological c
lasses. It was evaluated whether the obtained subsets represent all bi
ological properties and structural variations of the original database
. In addition, hierarchical cluster analyses were used to group molecu
les from the parent database, which should have similar biological pro
perties. Using various sets of structurally similar molecules, it was
possible to derive quantitative measures for compound similarities in
relation to biological properties. A similarity radius for 2D fingerpr
ints and molecular steric fields was estimated; compounds within this
radius of another molecule were shown to have comparable biological pr
operties. This study demonstrates that 2D fingerprints alone or in com
bination with other metrics as the primary descriptor allow to handle
global diversity. In addition, standard atom-pair descriptors or molec
ular steric fields can be used to correlate structural diversity with
biological activity. Hence, the latter two descriptors can be classifi
ed as secondary descriptors useful for analog library design, while 2D
fingerprints are applicable to design a general library for lead disc
overy. Based on these findings, an optimally diverse subset containing
only 38% of the entire IC93 database was generated using 2D fingerpri
nts. Here no structure is more similar than 0.85 to any other (Tanimot
o coefficient), but all biological classes were selected. This reducti
on of redundancy led to a child database with the same physicochemical
diversity space, which contains the same information as the original
database.