A Data-driven approach to drug discovery: from machine learning to deep learning
The concept of chemical space, which comprises all possible organic and inorganic molecules, is central to drug discovery. Molecule locations can be described by high-dimensional feature vectors encoding structural and/or functional information (i.e., descriptors). Since the exhaustive characterization of all molecules is unfeasible, virtual screening has emerged as a computational alternative to identify new chemicals with specific structure and activity. Virtual screening is based on chemical similarity principles, which establish that similar compounds have comparable properties and function. Chemical similarity has been used since the early 60’s to establish Quantitative Structure Property (or Activity) Relationships (QSPR/QSAR), which provide in silico estimations of physicochemical properties and bioactivity profiles for new chemicals.
Progress in high-throughput characterization techniques together with advances in computational chemistry facilitate the generation of large data repositories that can be used to develop predictive models to support virtual screening tasks. In this data-rich context, the use of machine learning and deep learning approaches provide the basis for developing accurate QSPR/QSAR models. However, as the amount of available data and the complexity of the learning algorithms increase, new challenges related to the development, validation, application and interpretation of these models emerge. In the first part of the talk, we will discuss several models developed using “traditional” machine learning approaches (support vector machines, ensemble learning) to predict endpoints related to the toxicity of chemicals and nanoparticles. The second part of the talk will focus on the use of deep learning where representation learning simplifies the feature engineering process required in traditional QSPR/QSAR development. We will discuss two deep learning architectures based on (1) a convolutional neural network using molecular drawings (Chemception), and (2) a recurrent neural network using SMILES strings (SMILES2vec). Finally, we will present ChemNet, a novel deep learning architecture that leverages weak supervised learning and transfer learning for enhanced predictive accuracy.
This is a joint work with Garrett Goh and Nathan Baker. This work was supported by the National Institute of Environmental Health Sciences of the NIH (R01 ES022190), National Institute of General Medical Sciences (P41 GM103493), NIH (P42 ES027704), and by the Pauling Fellowship and the Deep Learning for Scientific Discovery Laboratory Directed Research and Development Programs at Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy.