The Computational Biology Revolution

The inner workings of biology remain one of the biggest mysteries in modern science. On the one hand, we’ve observed all the components that make up a cell, we have discovered the structure and sequence of genetic material, and we’ve even gleaned how genetic material is translated into useful proteins that perform crucial biological functions.

On the other hand, despite all these scientific milestones, we are still unable to answer precise questions about how various diseases, like cancer, arise. We have limited visibility into the effect of hormones on the growth and development of organisms. We are unable to say for certain how medicines might affect a given patient, just as we have little understanding of how organisms age and die.

For decades, biologists have undertaken a first-principles approach to understanding the processes that govern life. The hope was that by identifying the different components within a cell, and the ways in which those components interact, we could create a mechanistic model that would help us predict cell, and ultimately, organism-level behavior. Even Richard Feynman once said, “It is very easy to answer many of these fundamental biological questions; you just look at the thing!”

With time, it became clear that the problem was knowing what to look at. Each cell is a universe of interactions between proteins, genetic material, organelles, and other molecules all running separate and intertwined metabolic processes at the same time. How do you isolate what’s important?

For a long time, the observational tools we had simply weren’t up to par to produce this type of rich data, and our analytical tools couldn’t process this data effectively to produce meaningful insights. Until now.

The past decade has delivered remarkable improvements in artificial intelligence systems which have demonstrated an uncanny ability to identify patterns in rich, complex data. It’s exactly this capability that the field of biology has been lacking thus far, and it might be the one we need to finally get answers to biology’s biggest outstanding questions.

Complexity in Biology

Mathematics, physics, and chemistry are concerned with finding fundamental laws that govern the behavior of all physical phenomena. Over centuries, these fields sought to demonstrate that the idiosyncratic, multifarious nature of the reality we experience is in fact produced by a series of simple yet elegant laws such as Newton’s laws of motion, the forces of gravity, and electromagnetism. As Sir Isaac Newton once put it, “Truth is ever to be found in the simplicity, and not in the multiplicity and confusion of things.” Framing the world through the lens of classical mechanics convinced many that the world could be understood or solved, by boiling everything down to the essential principles underpinning it.

It seemed that if we could make a model to explain the behavior of rotating bodies in the heavens, we could probably make a model of anything. It was natural to extend this line of thinking from inanimate objects like heavenly bodies to animate ones like biological organisms. In 1966, Francis Crick said, “The ultimate aim of the modern movement in biology is to explain all biology in terms of physics and chemistry.”

After the discovery of the structure of DNA by Crick and Watson, and the elucidation of gene expression at a molecular level, the prospect of producing a mechanistic model of biological activity seemed increasingly plausible.

However, whenever we zoomed in to build up our understanding of life, worlds of ever-greater complexity were all that we found.

Take, for instance, the smallest possible unit of life — the cell. Geophysicist Robert Hazen wrote, "we [know] that the simplest living cell is intricate beyond imagining, because every cell relies on the interplay of millions of molecules engaged in hundreds of interdependent chemical reactions,” adding that “human brains seem ill suited to grasp such multidimensional complexity."

Still, efforts like those undertaken by Dr. Gerhard Michal at the pharmaceutical giant, Roche, have managed to produce some simplified models of the processes in a cell. For instance, the diagram below is the result of a gargantuan undertaking, summarizing the cell’s main metabolic pathways in eye-wateringly convoluted detail.

Source: Roche

Making heads or tails of diagrams like this, or producing useful insights is a challenge in itself, not only because of its inherent illegibility but because even this model omits critical elements.

So, if we wanted to get more complex than this, say model the cell in full detail at the atomic level, we would be talking about a model consisting of the interactions between some 100 trillion atoms — and that would just be a snapshot.

What’s important to understand are the mechanics of the cell, how its processes work and evolve through time. For this, we would need to build a simulation capable of tracking the state of one hundred trillion atoms in millisecond increments over the course of hours to build an accurate model of a basic process like mitosis.

And that’s what it takes to understand just one cell. A human has 37 trillion of them, all interacting together through varying complex intercellular processes.

As the prospect of building a mechanistic model began to pose greater challenges, the goal of building knowledge of biological systems from the ground up seemed doomed.

Was it even possible to reduce the complexity of biological systems without omitting crucial information? We didn’t know, and without reducing the complexity, the inner workings of biology remained a black box.

Tackling Complexity With AI

Just as the pursuit of biological understanding from the ground up was waning, new systems of inference were being developed which would soon allow insights to be gleaned by observing systems, in all their complexity, from the top down. The first piece of evidence that pointed to the watershed potential of AI came with the release of AlphaFold in 2018.

AlphaFold was built to tackle one of the largest, open-ended questions of its time — the protein folding problem. In 1972, Christian B. Anfinsen, a Nobel Prize-winning biochemist, predicted that in theory, it should be possible to determine the three-dimensional structure of a protein, simply from knowledge of the one-dimensional sequence of amino acids that make it up.

Of course, this was much easier said than done. A single protein, composed of a chain of 100 amino acids and 99 peptide bonds linking them, could exist in any of 3^198 possible configurations. Sampling all of these states sequentially to determine the correct fold, even if we sample at a rate of picoseconds per configuration, would take more time than the age of the universe. And this is for a protein with only 100 amino acids – most complex proteins have hundreds to thousands of them.

Proteins have a hand in nearly every biochemical process, which makes understanding them critical for biology and medicine, where grasping the nature of interactions between drugs and receptor-proteins is essential. The key to understanding the function of proteins lies in understanding the structure, but for a long time, this was a gargantuan task.

For decades, learning the exact structure of proteins took years and hundreds of thousands of dollars. Scientists relied on techniques like nuclear magnetic resonance and X-ray crystallography. It was slow and expensive work. In short, protein folding was not a computable problem, and there was no real view to solving it before AlphaFold.

On the back of DeepMind’s success with AlphaGo, a model that beat world-renowned Go player Lee Sedol in 2016, DeepMind decided to see if its AI models could make inroads on one of science’s biggest open questions — protein folding. DeepMind’s AlphaFold model was trained on the known structures and sequences of around 100,000 different proteins, accessible via the Protein Data Bank.

The goal was to test the model’s performance at CASP, Critical Assessment of Structure Prediction, a biannual experiment organized to give researchers the opportunity to compete on and evaluate their methods of predicting the protein structure of a handful of proteins whose structures were known but not yet revealed to the general public.

AlphaFold debuted in 2018, in the thirteenth year of CASP, and won. Two years later, at CASP14, AlphaFold 2 predicted the structure of two-thirds of the proteins in that year’s competition with over 90% accuracy. It was a watershed moment since, thereafter, the protein folding problem was considered solved. Biologist John Moult said, “This is the first time a serious scientific problem has been solved by AI.”

Source: Nature

In the following year, DeepMind released a database of over 350,000 protein structures, including all of the proteins that exist in the human body, known as the human proteome. Additionally, it published the full proteomes of yeast, fruit flies, and mice — the organisms traditionally used in biological and medical research. The AlphaFold 2 paper was cited more than 20,000 times in scientific papers, which made it one of the top 500 most-cited papers of all time.

Then, on May 8th, 2024, DeepMind released AlphaFold 3 in conjunction with Isomorphic Labs. AlphaFold 3 did not limit itself to predicting protein structure. Instead, it had the ability to predict the structure and interactions of all of life’s molecules. This newest model could predict not only the structures of DNA, RNA, and molecules like ligands, which comprise most drugs, but, crucially, how all of these entities bind and interact with each other.

Insights extracted from AlphaFold tools have already led to progress on solutions that can break down single-use plastics, to those that can discover new vaccines for diseases like malaria.

Of course, AlphaFold doesn’t reveal the actual mechanisms by which these structures assemble — the actual mechanics are still a mystery. It merely yields the correct three-dimensional structure given a series of inputs. Yet, even so, the breakthrough has saved trillions of dollars that would have otherwise gone into making piecewise progress on understanding these distinct structures and interactions of these molecules.

Increasingly, computational tools are showing us that a fully mechanistic understanding of biology might not be necessary to get the answers we seek. Since the early days of AlphaFold, models have moved beyond just protein folding.

In early 2024, substantial inroads were made toward this goal with the release of Evo, a model developed by the Arc Institute, Stanford, and TogetherAI. Evo is the first publicly available model that can make inferences over an entire genome. The model was trained on a data set of 2.7 million prokaryotic and phage genomes and is capable of generating DNA sequences at single-nucleotide precision.

This allows it to produce some rather remarkable predictions. Since it was trained on entire genomes, it can do everything from predicting the function of proteins to identifying which genes are critical to an organism’s survival to even generating its own CRISPR complexes for use in genetic engineering.

In a blog post announcing the release of Evo, researchers at Stanford’s Hazy Research Lab wrote, “As we were training Evo … it felt like we were observing a “GPT” moment in biology. A simple unsupervised task was getting competitive zero-shot performance by modeling across the central dogma of biology, and generalizing across DNA, RNA and protein modalities.”

Evo is just the first in what will likely be a series of models able to make inferences and predictions over a much larger biological context window. These models represent a promising new chapter in biology and medicine, where new genetic editing tools, along with answers to some of our most difficult questions, will be answered not through painstaking observation, but through generative AI.

The new tools taking over biology are producing a philosophical sea-change in how many are viewing the direction of the field. That view is well reflected in the words of Demis Hassabis, the CEO of DeepMind, who said, “at its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”

The Future of Computational Biology

It is difficult to overstate the future impact of AI in biology. In time, these systems have the potential to become oracles capable of answering any queries we might have about the most puzzling riddles in biology.

But even before then, it’s clear where AI is likely to have the greatest impact in the near term, and that’s medicine. Developing effective treatments for the most pernicious diseases, like cancer, has been a slog for decades. In 2012, scientist and entrepreneur Jack Scannell noted that the pharmaceutical industry seemed to be subject to Eroom’s Law — that is, Moore’s law spelled backwards. This was the observation that despite improvements in technology, innovation in drug discovery was declining while the costs of development continued to increase. By 2023, it was estimated that the development of a new drug takes a decade and costs more than $2 billion.

Instead of years of lab work and trial and error required to identify the appropriate target and candidate molecule for an effective drug, AI drug discovery platforms will be capable of generating such candidate complexes for any number of conditions. The job of the scientists will be to check and verify these answers, rather than having to go through the painstaking process of coming up with them on their own.

This could prove particularly powerful in cases where individual therapeutics may be required, allowing for faster development of patient-specific treatments. In the future, it can even assist us in better understanding the flaws of treatments like CAR-T, which, in certain patients, may induce off-target toxicity.

Beyond this, AI models can increase our search space for new compounds. Already, scientists are fascinated by the secrets hidden in the fungal genome. Fungi are already well known for their ability to produce powerful therapeutics. Penicillin is the most obvious example. However, the accidental discovery of the antibacterial properties of Penicillium rubens underscores the degree of luck required in massive medical breakthroughs like this. One can only imagine what the world might look like now if Alexander Fleming hadn’t unthinkingly left Petri dishes of staphylococcus bacteria exposed to mold spores in the air in 1929.

Since then, fungi have been the source of a number of other compounds, like cyclosporin, used as an immunosuppressant, and lovastatin, used as a treatment for coronary heart diseases. The thinking is that there’s likely more where this came from. And, as antibiotic resistance becomes a growing concern, finding new compounds that can counter this resistance is critically important.

The only thing standing in the way of this future is the need for better models, and getting better models amounts to a need for more data and more compute. Luckily, the past two decades have offered a deluge of new biological data. As the costs of sequencing DNA have collapsed, the quantity of genomic data has exploded. Since 2014, GenBank, the NIH’s genetic sequence database has seen a doubling in the number of DNA bases every 18 months.

Source: Osaka University

On top of this, the rise of technologies like transcriptomics, which transcribes RNA, and epigenomics, which study the expression of genetic materials have given unparalleled insight into some of the core processes within cells. Collectively, these technologies, which offer snapshots of activity within individual cells, are called omics. They have already helped researchers better understand the differences between healthy and diseased cell variants, alongside the response of cells to drugs or genetic modification, and more.

The rate at which we can read individual cells is also growing at an exponential pace. In 2009, we could read omics data from just one cell. As of 2020, we could simultaneously read from millions at once. And what’s more, multimodal omics technology allow researchers to simultaneously sample multiple modalities at the same time. This amounts to being able to read the genome, transcriptome, and epigenome of a cell at the same time.

The limiting factor with single-cell omics is that to sample their state, cells must be removed from their natural context in living tissues. This in-vivo context provides important information about how cells work together and is lost when cells are isolated. Spatial omics technologies, however, are now developing to allow researchers to mark and study the behavior of cells over time in context.

Take the increasing throughput at which we can study cells, add the ability to monitor various modalities at once, and allow for all of this to be done in the cells’ natural context — all of this amounts to a lot of tremendously useful data that can be used to train an AI to produce remarkably precise inferences about a host of critical questions.

The emergence of general-purpose foundation models in bio is only just beginning, but the progress has already been unprecedentedly rapid. In 2018, as Frances Arnold was accepting her Nobel Prize in Chemistry, she noted, “Today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. The code of life is a symphony, guiding intricate and beautiful parts performed by an untold number of players and instruments. Maybe we can cut and paste pieces from nature’s compositions, but we do not know how to write the bars for a single enzymic passage.”

Within just six years of that observation, writing the bars for a single enzymic passage is very possible indeed. As the Evo model has demonstrated, we are in the early innings of the ability to generate not only enzymic passages but entire genomes themselves.

It’s clear that AI models are revolutionizing our ability to answer questions about biological processes so complicated that they have stumped researchers for decades. What’s interesting about this new direction in biological research is the requirement that we let go of the need to build out a fully precise mechanistic understanding of the underlying biological processes.

Models are capable of giving us the answers we seek, but not telling us how they got there. It could be that the mechanistic understanding we’ve been pursuing thus far is itself a kind of simplification, legible for human understanding, but incapable of explaining emergent biological properties that arise as a result of incredible complexity.

Computational biology is increasingly pushing us to look upon not just biological study in particular, but scientific study in general through a new lens. Are we willing to give up perfect understanding, if that means we can get the answers we want? If so, the future of this field may yield systems capable of modeling the behavior of not only cells, but one day, entire organisms.

Disclosure: Nothing presented within this article is intended to constitute legal, business, investment or tax advice, and under no circumstances should any information provided herein be used or considered as an offer to sell or a solicitation of an offer to buy an interest in any investment fund managed by Contrary LLC (“Contrary”) nor does such information constitute an offer to provide investment advisory services. Information provided reflects Contrary’s views as of a time, whereby such views are subject to change at any point and Contrary shall not be obligated to provide notice of any change. Companies mentioned in this article may be a representative sample of portfolio companies in which Contrary has invested in which the author believes such companies fit the objective criteria stated in commentary, which do not reflect all investments made by Contrary. No assumptions should be made that investments listed above were or will be profitable. Due to various risks and uncertainties, actual events, results or the actual experience may differ materially from those reflected or contemplated in these statements. Nothing contained in this article may be relied upon as a guarantee or assurance as to the future success of any particular company. Past performance is not indicative of future results. A list of investments made by Contrary (excluding investments for which the issuer has not provided permission for Contrary to disclose publicly, Fund of Fund investments and investments in which total invested capital is no more than $50,000) is available at www.contrary.com/investments.

Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by Contrary. While taken from sources believed to be reliable, Contrary has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Please see www.contrary.com/legal for additional important information.