Derek has a good post on another recent blast of optimistic AI prognostications which posits that AI will ensure that the end of disease will more or less be here in another decade. His basic critique has to do with the large number of unknowns that exist in biology - the fundamental barrier in discovering breakthrough drugs is simple ignorance in the face of biological complexity - that neither AI nor any other technology is going to magically address.
It’s an evergreen point, and one that made me contemplate how so much of what we do in drug discovery might be usefully parsed through the lens of the known known/known unknown/unknown unknown framework. I’m breaking down the domains of knowledge that exist in each of these arenas for a typical drug discovery project; I say '“typical” because you will always find cases which may not fall into a certain category.
Known knowns:
In most drug discovery projects, there will be some known knowns. These will include things like a known protein expression and purification or crystallization system, a standard set of assays like SPR or AlphaLISA or ADME assays that will be used to look at binding affinity and function, the general scope of synthetic feasibility including a standard reaction toolkit that will be used to synthesize leads and candidates, a standard set of computational techniques like docking or similarity searching or molecular dynamics that would be applied etc.
If you are lucky, target validation may be a known known in terms of both biochemical and clinical or genetic validation. If you are really lucky you may even have an animal model in which hitting the target is robust and well-validated. There are few pieces of data as valuable to a drug hunter as the complete validation of a target, from cells all the way through to patients. There are other things like the IP space of target leads that may also belong to known knowns.
But just because these techniques and datasets are known knowns does not mean they will always work. That leads us to the second and more important category - the known unknowns. These are variables which you don’t know beforehand but are at least aware of because you have the experience and knowledge to know that they can matter.
Known unknowns:
Great, so you have a standard set of binding and functional assays based on a known protein expression system. Do they have anything to do with the in vivo behavior of the target in an animal or human? How do you know that those assays will translate to effective target engagement and off-target non-engagement?
Often this is where the rubber hits the road and drug discovery scientists start praying; pretty much every drug discovery scientist will carry the corpses of failed assays that performed beautifully in isolation but had nothing to do with the real world like battle scars. Having the target validated in downstream models might give them a way to test their hypothesis with varying degrees of confidence, but it’s rarely the case that you can interpolate between clinical target validation and your in-house assays with ramrod straight confidence.
Even within the narrow domains of specific assays there are many questions that could be formulated in terms of known unknowns: does your protein expression system encapsulate the full spectrum of biologically relevant entities - ions, cofactors, chaperones, other binding partners. Is it full length or catalytic? If catalytic, what parts are you exactly missing and how do they matter? Are you sure that you are adequately washing the chip every time you run a new SPR experiment? Are compounds and proteins sticking to your wells? Assay scientists could quote countless other questions, but all these should amount to a checklist of known unknowns that are useful for gaining confidence in your testing system.
Synthesis is now a highly evolved art and science, and the availability of billions of on-demand and in-stock reagents and purchasable compounds from vendors like Enamine has revolutionized the practice of drug discovery. But the dirty secret is still in the details, and synthetic chemists often make compounds that they can, not all the compounds that they want. If you find a hit, how easy or hard would it be to quickly scale it in time units measured not in weeks or months but in startup man and woman-hours? And how easy or hard would it be to make the nearest or the not-so-near neighbors? Just because making that thiazole was easy does not mean that making the related oxazole or thiadiazole would be equally easy. What about stereochemistry: we got an active racemate, but what about the stereochemically pure compound?
The list of known unknowns is available for computational techniques as well. In any standard computational protocol there’s a known list of potentially unknown variables that can impact the conformation, binding energy and ranking of a set of molecules: this includes strain, solvation energy, tautomers and ionization states, displaceable or not-so-displaceable water molecules, flexible protein loops and missing electron density in the structure in case of structure-based design, to name a few. A glut of descriptors is available for ligand-based techniques, but you never know if a simple set composed of easily interpretable ones like clogP or PSA correlates well if you don’t try it. A good modeler knows that the practice of computational chemistry is as much about the art of eliminating the irrelevant known unknowns as it is of finding the relevant ones.
The known unknowns get really thorny when you are skirting the outer boundaries of preclinical drug discovery and wading into in vivo pharmacology. Often there are strikingly unexpected, non-linear relationships between parameters like plasma protein binding, free drug fraction, volume of distribution and clearance. These relationships are often hard or impossible to predict through early modeling or measurement. This part is also often when the simple ADME assays may not correlate with advanced ADME measurements in animal models (for instance a lack of a correlation between ‘acceptable’ hERG inhibition and what’s found through expensive telemetry in dogs). Seasoned drug hunters are aware of these known unknown traps, even if all they can do is await them with about the same enthusiasm and inevitability that Londoners awaited the Luftwaffe during the Blitz.
Unknown unknowns:
Of course, even after making your way through this thicket of complications, you are always going to end up with things that you are simply unaware of. Since these unknowns unknowns are, well, unknown by definition, it’s not really possible to list them. But based on our collective experience it’s probably worth at least hinting at what could possibly surface in this category, especially since in some sense you can cast unknown unknowns as rare known unknowns. Unknown unknowns can include things like:
Rare genetic variants impacting drug response.
Rare but serious side-effects leading to major health crises like heart attacks or deaths, again stemming from rare genetic polymorphisms or compensatory biochemical mechanisms.
Runaway reactions causing problems in process scale-up: process chemists sometimes have to drastically modify med chem protocols because of weird, unexpected and dangerous behavior or compounds at scale or at high temperatures or pressures. Minor impurities at small scale may turn into major headaches on a large scale.
Polymorphs that shows up only after a drug is marketed (ok, technically this is very much a known unknown, but at the time, the fact that a new insoluble polymorph for Ritonavir showed up only after the drug was marketed put this observation in the unknown unknown category).
Novel off-target mechanisms that exist only in human patients and not in any animal models.
Novel drug-drug interactions that haven’t been documented before, especially for novel modalities like ADCs or PROTACs.
Unknown immune cross-reactivity with endogenous targets arising from lack of selectivity that can manifest itself because of its incremental, additive nature, months after the therapy is administered.
There are of course many more possible unknown unknowns, and it’s safe to say that all of them arise because we simply cannot grasp the totality of the complexity of biological systems.
What about the unknown knowns?
“The Unknown Known” refer to the title of a film made by the brilliant documentary maker Errol Morris that starred former Secretary of Defense Donald Rumsfeld (who, in an otherwise disastrous tenure as secretary, did gift the world the valuable known-unknown classification). A “known unknown” according to Morris is a fact that is true or known but one whose existence or details are denied or obscured because of vested self-interests or some other form of self-denial.
This is where things get interesting in drug discovery. Drug discovery scientists are human beings, after all, and many unknown knowns arise from cognitive biases that preclude people from admitting that they are going down the wrong path. For the moment, let us ignore biases that result simply from greed or naked financial interests like wanting to shore up the stock price. Often those who exhibit these biases are entirely well-meaning, honest and brilliant scientists, so there are clearly reasons beyond disingenuous self-interest why unknown knowns might show up. Here are three examples:
A particular hypothesis might be held on to in spite of repeated data that proves otherwise (lots of biases could be applied to describe such a situation, including the sunk-cost fallacy and confirmation bias). In the last few decades, the beta-amyloid hypothesis of Alzheimer’s disease might be regarded as this kind of hypothesis. Repeated clinical trials that cleared up amyloid failed to show clinical efficacy, but billions of dollars and resources had been invested in the hypothesis, so the drugs and trials kept on coming. Reasons were invented: perhaps we were targeting insoluble instead of soluble amyloid, or perhaps we were simply intervening too late. All of these reasons could potentially be true, but at some point it’s worth asking whether we are chasing an unknown known.
At some point small molecules were supposed to be dead, or at least on their deathbeds. You simply could not get the kind of efficacy and total target engagement that you got from antibody drugs, even if these needed to be injected. This belief persisted in spite of a steady drumbeat of new small molecules for both traditional targets like kinases and GPCRs and novel ones like transcription factors and PPIs. Clearly in the last decade or so we have seen a stunning new emergence of new small(ish) molecule modalities like PROTACs, glues and ADCs. Biologics will continue to proposer, but the death of small molecules was clearly exaggerated.
Druglike metrics have been with us for almost thirty years now, starting with Lipinski’s rules that were published in 1997. Since these metrics like “flatness” and ligand efficiency could be easily understood and calculated, they were widely adopted by drug discovery scientists to create and filter their screening decks and make go/no-go decisions in projects. But these metrics not only had limited validity but were often based on flawed statistics or chemical foundations. At best they should have been applied carefully, after their domain of applicability had been evaluated. But their use was overextended and rampant, and at least a few promising molecules that did not fit the rules were weeded out, setting back projects by years.
One can go on, and perhaps every one of us can list some bias or another that led to compounds being selected or dropped. Succumbing to unknown knowns is very human, but that is precisely why it behooves us to understand and possibly try to mitigate them.
The world of literature often says better what the world of science struggles to articulate. As Mark Twain memorably said, “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” That’s a useful lesson for our community. But perhaps the even more useful lesson is the fact that there is no definitive evidence that Twain ever said that.