A modular framework is proposed for modeling and understanding the relation
ships between molecular profile data and other domain knowledge using a com
bination of generative (here, graphical models) and discriminative [Support
Vector Machines (SVMs)] methods. As illustration, naive Bayes models, simp
le graphical models, and SVMs were applied to published transcription profi
le data for 1,988 genes in 62 colon adenocarcinoma tissue specimens labeled
as tumor or nontumor. These unsupervised and supervised learning methods i
dentified three classes or subtypes of specimens, assigned tumor or nontumo
r labels to new specimens and detected six potentially mislabeled specimens
. The probability parameters of the three classes were utilized to develop
a novel gene relevance, ranking, and selection method. SVMs trained to disc
riminate nontumor from tumor specimens using only the 50-200 top-ranked gen
es had the same or better generalization performance than the full repertoi
re of 1,988 genes. Approximately 90 marker genes were pinpointed for use in
understanding the basic biology of colon adenocarcinoma, defining targets
for therapeutic intervention and developing diagnostic tools. These potenti
al markers highlight the importance of tissue biology in the etiology of ca
ncer. Comparative analysis of molecular profile data is proposed as a mecha
nism for predicting the physiological function of genes in instances when c
omparative sequence analysis proves uninformative, such as with human and y
east translationally controlled tumour protein. Graphical models and SVMs h
old promise as the foundations for developing decision support systems for
diagnosis, prognosis, and monitoring as well as inferring biological networ
ks.