Blast from the past: Is QSAR relevant to drug discovery?
Storks, overfitting and a descriptor glut
Another oldie but goodie with a provocative title. The author makes several significant points that are as relevant today as they were back then. While some of these problems are much better-recognized today and can be mitigated, they still present important pitfalls that should be part of a QSAR checklist, and many of them would be relevant to the applications of AI (which is often a more sophisticated version of QSAR) in drug discovery.
On correlation vs causation:
"Time-dependent correlations constitute a special case of correlations that generally have no causative associations. For example, a gradual decrease in the number of breeding storks in Germany from 1960 to 1985 was determined to be strongly correlated with a similar decrease in the number of newborn babies. Data obtained from the US Census Bureau in 1986 demonstrated that the US population increased as the number of civil executions decreased. And, of course, a large fire is associated with more fire engines at the scene. What conclusions would be reached by these apparent correlations? Storks deliver babies, executions lower the population and fire engines cause fire damage. Thus, these examples of correlation illusions lead to false hypotheses about causation. Although the correlations are indeed factual, causation does not necessarily follow. Logically, the observation of a correlation should be pursued by experimentation to determine cause and effect."
On problems with using multiple descriptors that account for a specific observable (like potency or an ADME property:
"When a variety of molecular descriptors are presented for a correlation analysis to some specific, observable endpoint, how can investigators determine which descriptors are illusory? Chance correlations are surprisingly likely to occur. Many descriptors should therefore elicit considerable caution, even if only a few descriptors remain that can yield good correlations....If 20 random descriptors were presented to the same set of 15 observations, the resulting correlation equation would have an average R^2 of 0.73. Thus, most of the data were explained by chance. Any QSAR model built in this manner (ie, using a multitude of descriptors) is subject to a 'chance factor' effect, and could be seriously flawed as a result."
On over-reliance on R^2 values for correlations, especially considering the errors in assays:
"When the resulting model boasts an R^2 value that is higher than the data can support, overfitting may be responsible for the discrepancy. A possible tactic to avoid this issue is to consider how experimental error affects R^2. Such error has been demonstrated to limit the maximum R^2 value of a perfect QSAR model (J Comp Aided Mol De (2008) 22(2):81-89). For example, an observational error of approximately 2-fold (typical of many biological assays) applied to a dataset of 19 compounds was determined to be equivalent to a standard error of 0.2 to 0.3 log units (assuming the observational data spanned approximately 3 orders of magnitude or 3 log units). This level of error will limit the perfect QSAR model to a maximum R^2 of 0.77 to 0.88; a larger experimental error will limit the maximum R^2 even further. Thus, QSAR models that yield impressive training set R^2 values should be viewed with suspicion."
On problems with the common "leave-one-out" method for preventing overfitting:
"Once a QSAR model has been developed, it is then evaluated for its ability to predict the activity of new molecules. Although the best approach for such evaluation involves actually testing new molecules, investigators appear to prefer to statistically determine the likelihood of 'predictivity'. Q2, or LOO (leave-one-out), is often assessed in such testing. For a dataset containing N molecules, this procedure involves creating N models using N–1 compounds (omitting a different compound each time). The QSAR model is built from each N–1 set of data and is then used to predict the activity of the omitted compound. The correlation coefficient obtained for this LOO process is referred to as Q2. Although this approach may appear to be a clever and effective method of 'testing' a QSAR mode without actually testing new compounds, given that each omitted compound serves the role of a test compound, the apparent predictivity of the model may be misleading."
Ultimately as the review concludes, QSAR can be useful if a tightly constructed, statistically validated model with as many descriptors as are necessary but no more is used, especially if the descriptors are physically meaningful and can be readily interpreted by medicinal chemists. In that sense, my favorite "QSAR" model of all time was created in 1899.