AI in Drug Discovery: How Machine Learning Turns Data Into Drugs

Drug discovery is a notoriously difficult process. It typically takes over a decade, costs billions of dollars, and has a failure rate of over 90%. The root of this challenge lies in the sheer complexity of biology and the vast chemical space that must be explored to find effective, safe drugs. Artificial Intelligence is now being deployed to tackle these fundamental challenges. At its core, AI in drug discovery is about making better predictions and decisions in the face of overwhelming complexity and uncertainty.

The drug discovery pipeline consists of six key steps:

Drug Discovery StepTraditional TimeframeAI-Assisted TimeframePotential Time Savings
1. Target Identification and Validation2-5 years1-3 years30-50%
2. Hit Discovery0.5-2 years3-12 months30-50%
3. Lead Optimization1-3 years6-18 months40-60%
4. Preclinical Development1-3 years8-24 months20-30%
5. Clinical Trials6-7 years5-6 years15-25%
6. Regulatory Review and Approval0.5-1 year5-10 months10-20%
Total11-21 years7.5-14 years30-40%

In the early steps of discovery, AI sifts through vast biological datasets to identify promising drug targets. AI models help predict a compound’s behavior in biological systems, allowing researchers to focus their efforts on the most promising candidates. Later, it helps design new molecules with optimal properties and even plans how to synthesize them.

In this article, we’ll dive deep into each step of drug discovery, exploring how AI is transforming the field and what it means for the future of medicine.

AI Drug Discovery - Exoswan

Target Identification and Validation

The drug discovery process begins with a very time-consuming step called target identification and validation. A “target” is usually a protein or gene in the body that plays a key role in a disease process. For example, in cancer, a target might be an enzyme that promotes tumor growth or a protein that helps cancer cells evade the immune system.

Traditionally, this step typically takes anywhere from 2 to 5 years. Initial target identification involves 6 months to 2 years of literature review, genetic studies, and other research to generate a list of potential targets. Then, those potential targets must be validated through animal models and extensive cellular studies, which take another 1.5 to 3 years.

With the help of AI, that timeline could be reduced by roughly 30% to 50%, down to 1 to 3 years in total. Most of the savings come in the early stages (initial identification and preliminary validation). Here’s how:

Network Analysis

In a typical human cell, including cancer cells, there are about 20,000 different types of proteins. Each protein could have several hundred characteristics (e.g., amino acid composition, physical traits expression levels, etc.) that scientists care about. This is known as “high-dimensional” data, and it’s the perfect arena for AI to shine.

Let’s say 5,000 of those types of proteins are known to be involved in various cancer processes. The tricky part is that these proteins don’t exist in a vacuum. Each of these proteins might interact with dozens or even hundreds of others. This creates a complex web of over 100,000 known interactions.

To make sense of this vast network, researchers can use a method in AI called a graph neural network (GNN). The GNN represents this biological system as a mathematical graph, where each of the 5,000 cancer-related proteins is a node, and the 100,000+ interactions are edges connecting these nodes.

Graph Neural Network Analysis

Initially, the GNN assigns each protein node a vector of “features” – let’s say 200 different characteristics. These could include the protein’s size, its cellular location (nucleus, cytoplasm, membrane, etc.), its expression level in various types of cancer cells, and many other properties.

The magic of the GNN lies in its ability to update these features through a process called “message passing.” Think of this as peer pressure in a social network. If your best friend is raving about a new movie on her feed, you might be tempted to watch it yourself. And if you end up loving the movie too, you might post about it as well. Eventually, everyone’s profile reflects not only their own interests, but also the influence of their close social circle.

Likewise, after several rounds (typically 5-10) of message passing, the features of each protein node come to reflect not just its inherent properties, but also its position and importance in the overall network. A protein that interacts with many others involved in cell division might develop features indicating it’s a key regulator of cancer cell growth.

The final output is a ranked list of 50-100 proteins that the AI identifies as the most promising targets for cancer research. Some of these might be well-known cancer proteins, validating the model’s approach. Others might be proteins that haven’t been strongly associated with cancer before, revealing brand new avenues for research.

What makes AI so powerful here is its ability to consider all proteins and interactions at once. This parallel processing is not just from computing power, it’s baked into the architecture of GNNs. This entire process – analyzing 5,000 proteins, each with 200 features, connected by over 100,000 interactions – can be completed by the AI in a matter of hours or days. In contrast, a human researcher might spend months or years trying to understand even a small subset of these relationships manually.

In Silico Validation

In silico validation (computer simulations) allows researchers to narrow down potential targets before investing in lab work. Traditional molecular dynamics simulations have been used in drug discovery for decades. These simulations use classical physics equations to model how atoms and molecules interact over time. While useful, they are extremely computationally expensive.

Instead of relying solely on physics equations, AI can learn to predict molecular interactions from data. For example, let’s say we have a dataset of 10,000 known protein-drug interactions, each with 100 different measured properties (binding energy, conformational changes, etc.). We can then use this data to train a powerful neural network to predict these properties for new, unseen protein-drug pairs.

In Silico Gene Prioritization by Integrating Multiple Data Sources
Different modeling methods for in silico validation are tested against each other.

This AI model runs much faster than physics-based simulations because it’s not actually trying to model the atoms directly. Instead, it’s attempting to use pattern recognition and past empirical evidence to directly jump to the end result we really care about. This “shortcut” allows AI to make educated “guesses” about molecular interactions within a fraction of the time of a full-blown simulation.

Of course, the AI approach does sacrifice some theoretical accuracy. But to alleviate this, many AI methods (like Bayesian neural networks) can estimate their own uncertainty. For each prediction, the AI doesn’t just give a single number. It can give a range such as, “predicted binding energy is between -8.5 and -9.5 kcal/mol, with 95% confidence.”

Once an AI model is trained on one set of protein-drug interactions, it can be fine-tuned for related tasks via a method called transfer learning. An AI trained on kinase inhibitors can be quickly adapted to predict interactions with a newly discovered kinase, even with limited data on the new protein. This allows researchers to carry knowledge across different drug discovery projects, something much harder to do with traditional simulation methods.

Despite their advantages in speed and scalability, AI methods don’t replace traditional simulations entirely. Instead, they complement them, serving as an initial screen before more detailed, physics-based simulations are applied to the most promising candidates.

Literature Mining

The sheer volume of scientific literature published each year is staggering. In the biomedical field alone, over 1 million new papers are published annually. Crucial findings that could lead to breakthrough treatments are constantly being published, but it’s humanly impossible to read and comprehend it all.

This is where AI-powered literature mining comes in. The AI is trained on millions of scientific papers, learning the specialized language and concepts of biology and medicine. This process uses a technique called natural language processing (NLP), which enables computers to understand and interpret human language.

Once trained, the AI can process new papers at a rate far exceeding human capability. While a human researcher might read 200-300 papers a year, an AI system can process tens of thousands of papers in a day. Of course, the NLP model doesn’t just recognize words; it understands context and can extract key information such as:

  • Names of proteins and genes
  • Descriptions of biological processes
  • Associations between genes and diseases
  • Experimental results and their implications

Beyond just extracting facts, the AI can identify connections across papers and even across different fields of study. For instance, it might notice that a protein frequently mentioned in cancer research papers is also appearing in recent studies on neurological disorders. Or it could identify a drug that’s been used for one disease but has properties that suggest it might be effective against another, entirely different condition.

Based on these connections, the AI can generate new hypotheses or research directions. These could include new potential drug targets, unexpected side effects of existing drugs or promising combinations of drugs for complex diseases.

As new papers are published, the AI continuously updates its knowledge base. This means it can spot emerging trends or shifts in scientific understanding almost in real-time. Once again, AI doesn’t replace human expertise. Instead, it acts almost as a powerful “intelligence network” that scans for useful information 24/7. Then, human researchers can focus their attention where it’s most needed.

AI-Assisted Target Identification and Validation

In practice, a typical AI-assisted process is a hybrid workflow. It might involve:

  1. Using NLP to analyze research papers and create a knowledge graph.
  2. Applying graph neural networks to this knowledge graph to identify potential targets.
  3. Using predictive models to score these targets for how likely they are to be good drug targets.
  4. Subjecting the highest-scoring candidates to in silico validation.
  5. Finally, moving the most promising targets to experimental testing in the lab.

Hit Discovery

After identifying and validating a target, the next step in drug discovery is hit discovery. A “hit” is a compound that interacts with the target in a desired way, such as inhibiting or activating it. Traditionally, this process involves high-throughput screening (HTS) of large compound libraries.

A compound library is a collection of hundreds or thousands of small plastic plates, each about the size of your palm. Each plate has 96, 384, or even 1536 tiny wells. And each well contains a different chemical compound dissolved in a solvent.

In HTS, the target protein is first prepared in large quantities and added to similar plates. Precise robotic arms then transfer tiny droplets of each compound from the library plates to the plates containing the target protein. Once the compounds and targets are mixed, the system measures if there’s any interaction or “hits.”

The hit discovery process typically takes 6 months to 2 years and can cost several million dollars. Libraries of compounds can cost hundreds of dollars per milligram for some rare molecules, and large amounts of purified target protein are needed. AI can potentially reduce hit discovery timelines by 30-50% while also improving hit quality. Here’s how:

Virtual Screening

Virtual screening is similar in concept to in silico validation, which we discussed earlier. But while in silico validation aims to narrow down potential drug targets, virtual screening comes into play once a target has been validated. Think of virtual screening as the bridge between target validation and experimental hit discovery.

AI’s role in virtual screening is also similar. Traditional virtual screening methods also use physics-based calculations. Once again, these methods are computationally intensive, especially when dealing with millions of potential compounds. And also once again, AI aims to be a “shortcut” to the desired answer by learning patterns from large datasets of known drug-target interactions.

These models are also known as Quantitative Structure-Activity Relationship (QSAR) models. Basically, each compound in the library is converted into a set of numerical descriptors that capture various aspects of its structure and properties. The model then assigns a score to each compound, allowing researchers to rank the entire library and focus on the most promising candidates.

3D molecular visualization of a protein-ligand complex
3D visualization of a protein-ligand complex for informing QSAR models

Unlike physics-based methods that simulate molecular interactions directly, QSAR models learn statistical relationships between molecular features and biological activity. This allows them to make rapid predictions for millions of compounds, significantly speeding up the virtual screening process. Experimental results are then fed back into the AI models, helping them learn and improve over time.

De Novo Drug Design

De novo drug design involves creating entirely new drug molecules from scratch, rather than searching existing databases. AI’s role in de novo drug design is to act as a highly sophisticated molecular architect. The AI, often using techniques like generative adversarial networks (GANs) or variational autoencoders (VAEs), is trained on databases of known drugs and their properties.

Once trained, the AI can generate new molecular structures that are optimized for specific properties. For instance, you might ask the AI to design a molecule that binds to a certain target, is small enough to be taken orally, and doesn’t interact with liver enzymes. What makes this approach powerful is that the AI can explore chemical spaces that humans might not even think to investigate.

The AI doesn’t truly “understand” chemistry in the way a human does. Instead, it’s learning statistical patterns from vast amounts of data. It’s as if the AI has internalized the rules of a complex game, and can now play that game (design molecules) extremely well.

Predictive ADMET

ADMET stands for Absorption, Distribution, Metabolism, Excretion, and Toxicity. These critical properties determine whether a compound can become a successful drug. Traditionally, they were only tested in later stages of drug development, leading to many late-stage failures due to poor ADMET profiles.

Predictive ADMET uses AI to forecast these properties early in the drug discovery process. Machine learning models are trained on large datasets of known drugs and their measured ADMET properties. These AI models can then learn to associate specific molecular features with ADMET properties.

For example, the AI might learn that molecules with lipophilic (fat-loving) groups are more likely to be absorbed in the gut, or that a nitro group is associated with higher risk of liver toxicity. Equipped with these learned associations, the AI can rapidly evaluate new drug candidates, flagging those with potentially problematic features early.

AI-Assisted Hit Discovery

In practice, a typical AI-assisted hit discovery process might involve:

  1. Using AI to virtually screen large compound libraries and prioritize promising candidates.
  2. Applying de novo design algorithms to generate novel structures.
  3. Using predictive models to estimate ADMET properties of potential hits.
  4. Conducting physical HTS on a smaller, more focused set of compounds.
  5. Employing machine learning to analyze screening results and identify the most promising hits.

Lead Optimization

Once promising hit compounds have been identified, the next crucial phase is lead optimization. The goal of this step is to transform these hit compounds into “lead” compounds that are more likely to become successful drugs. This involves systematically modifying the chemical structure of the hits to enhance their potency, selectivity, and overall drug-like properties.

Lead optimization is a tricky problem, often described as multi-parameter optimization. Improving one property (e.g., potency) frequently comes at the cost of negatively impacting others (e.g., solubility or toxicity). Moreover, the chemical space to be explored is vast, with an enormous number of possible modifications for each molecule.

Traditionally, lead optimization typically takes 1 to 3 years and can cost tens of millions of dollars. Medicinal chemists synthesize hundreds or even thousands of analogs of the initial hits, each of which must be rigorously tested for its effect on the target, its drug-like properties, and its behavior in biological systems.

This is where AI has the potential to revolutionize the process. It can allow medicinal chemists to explore a much wider range of potential improvements to their hit compounds. With AI, researchers can potentially reduce timelines by 40-60% while also improving the quality of lead compounds. Here’s how:

Predictive Modeling (QSAR)

In Lead Optimization, QSAR models become more specialized and focused compared to their use in virtual screening. While virtual screening QSAR casts a wide net, Lead Optimization QSAR aims to fine-tune the properties of promising compounds that have already shown some activity against the target.

These models work on the same principle as in virtual screening – converting molecular structures into numerical descriptors. However, in Lead Optimization, these descriptors are often more detailed and the models are trained on data from closely related compounds, often synthesized as part of the project.

For example, a medicinal chemist might wonder: “What if we replace this methyl group with an ethyl group?” The QSAR model can predict how this change might affect the compound’s potency, selectivity, solubility, and other critical properties. This allows researchers to explore structural modifications virtually before committing resources to synthesis and testing.

Unlike the more general models used in virtual screening, Lead Optimization QSAR often deals with smaller structural changes and aims to predict their effects more precisely. These models consider a broader range of properties simultaneously, balancing factors like target affinity, selectivity, and ADMET properties. Advanced QSAR models can even suggest which parts of a molecule to modify to achieve desired property changes.

The process is highly iterative. As new compounds are synthesized and tested based on QSAR predictions, their data is fed back into the model, continually improving its accuracy. This creates a virtuous cycle where the model guides synthesis, and experimental results refine the model, accelerating the optimization process.

Generative Chemistry

While QSAR models predict properties of existing molecules, generative chemistry takes a more creative approach. It uses AI to design entirely new molecular structures that optimize desired properties, potentially leading to more innovative solutions in Lead Optimization.

These AI models, often based on techniques like generative adversarial networks (GANs) or variational autoencoders (VAEs), are trained on large databases of known drugs and their properties. They learn the underlying patterns and rules that make a molecule drug-like and effective.

Generative AI - Molecular Chemistry
Generative AI can be applied to generate molecules based on desired properties.

For instance, a researcher might input: “Design a molecule similar to our lead compound, but with improved solubility and reduced liver toxicity.” The AI then generates a range of new molecular structures that aim to meet these criteria.

Unlike traditional methods that make incremental changes to a lead compound, generative chemistry can propose more radical structural changes. This allows it to explore areas of chemical space that human chemists might not intuitively consider, possibly leading to unexpected breakthroughs.

That said, the AI doesn’t truly “understand” chemistry. Instead, it’s learning and applying complex statistical patterns from its training data. The generated molecules still need to be evaluated by expert chemists and validated through synthesis and testing.

Once again, the process is highly iterative — a virtuous cycle of AI-generated molecules guiding physical testing, which then feed back into the AI model to make it better.

Retrosynthesis Planning

While QSAR and generative chemistry focus on designing and optimizing compounds, retrosynthesis planning tackles a different challenge: how to actually make these molecules in the lab. Traditionally, planning the synthesis of a complex molecule was a time-consuming task requiring extensive knowledge and experience.

AI is revolutionizing this process through its ability to learn from vast databases of known chemical reactions. Neural networks are trained on these databases to recognize patterns in how molecules can be built up from simpler components. When presented with a target molecule, these AI systems can suggest potential synthetic routes, considering factors like reagent availability, reaction conditions, and potential side products.

They can rapidly evaluate many possible routes, considering trade-offs between factors like yield, cost, and the number of steps. This AI-driven approach allows medicinal chemists to quickly assess the synthetic feasibility of proposed compounds and even suggest alternative structures that might be easier to synthesize. As with other AI applications in drug discovery, the process is iterative – as new reactions are performed and added to the database, the AI’s predictions continue to improve.

AI-Assisted Lead Optimization

In practice, a typical AI-assisted lead optimization process might involve:

  1. Using AI to generate and virtually screen thousands of potential analogs of the hit compounds.
  2. Applying predictive models to estimate the properties (potency, selectivity, ADMET) of these virtual analogs.
  3. Using generative models to suggest novel structures that might have improved properties.
  4. Employing AI-guided molecular dynamics simulations to understand and optimize target binding.
  5. Utilizing AI for retrosynthesis planning to efficiently synthesize the most promising candidates.
  6. Iterating this process based on experimental feedback.

Preclinical Development

Preclinical development is a critical phase where promising lead compounds are extensively tested outside the human body. This stage involves both in vitro studies (in cell cultures) and in vivo studies (in animal models). The primary goals are to assess the compound’s efficacy, toxicity, pharmacokinetics (how the body processes the drug), and pharmacodynamics (how the drug affects the body).

During this phase, researchers aim to confirm the drug’s mechanism of action. determine safe dosage ranges, identify potential side effects and toxicities, and understand how the drug is absorbed, distributed, metabolized, and excreted. This stage typically takes 1-3 years and is crucial for spotting potential safety issues before human trials begin.

While AI use is less extensive here than in earlier stages, it’s gaining traction. AI models are being developed to reduce the number of drug candidates that fail in later, more expensive clinical stages. Specifically, these models aim to:

  • Predict toxicity and side effects more accurately, potentially reducing animal testing
  • Analyze complex biological data from animal studies, including genomics and proteomics data
  • Optimize dosing regimens based on pharmacokinetic/pharmacodynamic (PK/PD) modeling
  • Identify potential biomarkers for efficacy or toxicity

Clinical Trials

Clinical trials involve testing the drug in humans and typically occur in three phases:

  • Phase I: Tests the drug’s safety in a small group (20-100) of healthy volunteers. This phase assesses the drug’s side effects and determines safe dosage ranges.
  • Phase II: Evaluates the drug’s efficacy and side effects in a larger group (100-300) of patients with the target disease. This phase also aims to determine optimal dosing.
  • Phase III: Involves an even larger group of patients (300-3000 or more) and compares the new drug to existing treatments or placebos. This phase provides more comprehensive data on efficacy and safety.

Throughout these trials, researchers collect extensive data on the drug’s efficacy, safety, and optimal dosing. The entire clinical trial process can take 6-7 years or more.

AI is increasingly being used in this step to make clinical trials faster, more cost-effective, and more likely to succeed. Specifically, AI models are being trained to:

  • Help identify patients most likely to respond to treatment to make trials more efficient
  • Analyze trial data in real-time to spot safety signals or efficacy trends earlier
  • Optimize trial designs to reduce the number of patients needed or the duration of the trial
  • Analyze complex datasets, including electronic health records and multi-omics data from trial participants

Regulatory Review and Approval

The final step involves submitting a comprehensive dossier of all data from preclinical and clinical studies to regulatory agencies like the FDA. These agencies conduct a rigorous review to determine if the drug’s benefits outweigh its risks. This process typically takes 6-12 months.

Key aspects of this review include evaluation of all safety and efficacy data, the drug’s manufacturing process, and the proposed labeling information. If approved, the drug can be marketed and prescribed to patients.

While AI use is more limited in this stage due to regulatory requirements for human oversight, it is being explored for:

  • Assisting in the preparation and review of regulatory submissions
  • Predicting potential regulatory issues based on historical data from similar submissions
  • Enhancing post-marketing surveillance by analyzing real-world data to detect safety signals

While AI use is currently more prevalent in the earlier stages of drug discovery, its application is expanding into these later stages. As AI continues to advance and gain regulatory acceptance, it has the potential to streamline the entire drug development process, from initial discovery to market approval and beyond.

Read Next

Table of Contents

Stay In The Loop

Concise updates on 100X investment opportunities.

We respect your privacy. Unsubscribe at any time with one click.