I was involved in some of the earlier work toward learning stochastic
circuit models (a kind of temporal graphical model) of gene regulation
from microarray expression data. This method was a statistical
physics-based (deterministic annealing) clustering model and algorithm
in which a datum could truly belong to multiple clusters
simultaneously. (Mjolsness, Castano, and Gray, Multi-Parent
Clustering Algorithms for Large-Scale Gene Expression, JPL Report
1999. Mjolsness at al. Clustering Methods for the Analysis of
C. elegans Gene Expression Array Data, PSB 1999.)
|
|
Computational Chemistry,
Drug Discovery, Biology
| |
Along with many other people today, I believe that applied
mathematical approaches (computational, statistical, engineering) to
chemical and biological problems represent one of the most fruitful
scientific goldmines for the next few decades at least. I would say,
however, that this is really a special case of a more general
opportunity -- all of the natural sciences should and will be transformed
by such approaches in the near future. We have a large collaboration with a
major pharmaceutical company.
|
Molecule Ranking for Virtual Screening
| |
Automated high-throughput drug screening constitutes a critical
emerging approach in modern pharmaceutical research. The statistical
task of interest is that of discriminating active versus inactive
molecules given a target molecule, in order to rank potential drug
candidates for further testing. Because the core problem is one of
ranking, our approach concentrates on accurate estimation of unknown
class probabilities, in contrast to popular non-probabilistic methods
which simply estimate decision boundaries. While this motivates
nonparametric density estimation, we are faced with the fact that the
molecular descriptors used in practice typically contain thousands of
binary features. In this paper we attempt to improve the extent to
which kernel density estimation can work well in high-dimensional
classification settings. We present a synthesis of techniques
(SLAMDUNK: Sphere, Learn A Metric, Discriminate Using Nonisotropic
Kernels) which yields favorable performance in comparison to previous
published approaches to drug screening such as support vector
machines, as tested on a large high-dimensional proprietary
pharmaceutical dataset. (Gray, Komarek, Liu, and Moore,
High-Dimensional Probabilistic Classification for Drug Discovery [pdf],
[ps]
Computational Statistics 2004.) I consider this to just be a first
dip into the sea of this problem. Whether we use this method as a
component of the eventual system depends on the next step.
Next step:
The ultimate goal is to automatically choose which molecules
to test, in a running system. I am working right now on a new formulation
of active learning which is not based on the traditional least-squares
approach to experimental design. Another huge sub-problem in this
enterprise is the effective representation of molecules.
|
|