Multi-omic biosequence transformers: the next foundation layer for protein–nucleic acid biology (and drug R&D)
Why this matters now
Foundation models have already transformed protein understanding (structure/function inference) and are increasingly influential in genome-scale sequence interpretation. But the core mechanics of cellular biology—and a growing share of therapeutic modalities—sit between omics: transcription factor binding, RNA-binding protein specificity, chromatin regulation, replication/repair, and many drug concepts in RNA/oligo therapeutics depend on protein–nucleic acid interactions.
The attached paper introduces OmniBioTE, a large-scale multi-omic transformer trained jointly on protein + nucleic acid sequences at unprecedented scale, and demonstrates that multi-omic pretraining can unlock measurable gains on interaction-centric tasks—without giving up single-omic capability (Chen et al., 2025).
1) From single-omic “language models” to cross-omic “interaction models”
Transformers became the dominant sequence modelling paradigm because attention-based architectures efficiently capture long-range dependencies (Vaswani et al., 2017). In biology, the “single-omic” era produced strong specialist models for proteins (e.g., large protein LMs enabling structure inference) and for genomes (foundation models for promoter/splice/epigenetic prediction), but these models are structurally constrained by what they see during training: one modality at a time.
The OmniBioTE work argues (and empirically supports) that the real leverage comes from learning a shared representation space where nucleic acids and proteins are not separate languages but coupled distributions shaped by the central dogma and molecular recognition (Chen et al., 2025).
This direction aligns with a broader trend toward generalist biological foundation models that unify multiple sequence types across species (He et al., 2025) and with genome foundation models that increasingly emphasize scalable tokenization and standardized evaluation (Zhou et al., 2023).
2) What OmniBioTE shows (and why it’s non-trivial)
A. Emergent gene–protein alignment without supervision
A striking claim is that, despite being trained only with self-supervision (masked modelling), the model learns joint representations that align genes and their corresponding proteins—recoverable via lightweight contrastive projection on a small fraction of paired data (Chen et al., 2025).
For senior leaders, the implication is strategic: you may not need to explicitly curate perfect multi-omic pairing labels to benefit from cross-modal structure—scale + diversity can induce alignment as an emergent property.
B. Predicting binding energetics (ΔG) directly from sequence
OmniBioTE is fine-tuned to predict protein–nucleic acid binding free energy (ΔG) on ProNAB—an experimentally grounded dataset containing thermodynamic parameters for protein–DNA and protein–RNA complexes (Harini et al., 2022). The paper reports materially improved performance versus compute-matched single-omic controls, indicating that the joint pretraining actually transfers to a native multi-omic endpoint (Chen et al., 2025).
Why ΔG matters: it’s closer to the decision-making substrate for many programs than binary binding labels, because it supports ranking, robustness assessment, and mutation sensitivity—key for oligo design, TF/RBP targeting hypotheses, and specificity risk screens (Harini et al., 2022).
C. Structural signal emerges without explicit structural training
Another high-impact result: attention-derived probes can extract contact-like structural information after fine-tuning on binding energetics, even though the base model never sees structures during pretraining (Chen et al., 2025).
This is especially relevant given the field’s rapid progress in structure prediction of complexes—including protein–nucleic acid complexes—via specialized models like RoseTTAFoldNA (Baek et al., 2024) and AlphaFold 3 (Abramson et al., 2024). The OmniBioTE result suggests a complementary route: sequence → interaction energetics can itself impose constraints that recover structural correlates, potentially enabling cheaper “structure-aware” triage earlier in pipelines.
3) Where this sits vs. other state-of-the-art approaches
Structure-first (complex prediction)
-
AlphaFold 3 expands complex prediction to proteins, nucleic acids, ligands, ions, and modified residues (Abramson et al., 2024). Public discussion has also highlighted tension around openness/access compared to prior releases (Sample, 2024; Pujol-Mazzini, 2024).
-
RoseTTAFoldNA targets protein–DNA/RNA complex structure prediction and is positioned for modelling naturally occurring complexes and designing sequence-specific binders (Baek et al., 2024).
What OmniBioTE adds: a foundation-model route that’s natively multi-omic, optimized around interaction-relevant objectives (ΔG; specificity perturbations) and potentially more compute-efficient for early triage than structure+simulation pipelines (Chen et al., 2025).
Sequence-first (single-omic scaling)
-
Protein LMs at scale can encode structure directly from sequence (Lin et al., 2023).
-
Genome foundation models are becoming more standardized, with strong emphasis on tokenization and benchmarking (Zhou et al., 2023).
What OmniBioTE adds: evidence that mixing modalities can be non-inferior on single-omic benchmarks while improving multi-omic tasks—suggesting a platform consolidation opportunity (Chen et al., 2025).
Multi-omic generalists
-
LucaOne is a prominent attempt at generalized DNA/RNA/protein foundation modelling across many species (He et al., 2025).
What OmniBioTE adds: a clear, interaction-centric evaluation story (ΔG + contact interpretability) tied directly to protein–nucleic acid biology (Chen et al., 2025).
4) Practical implications for pharma R&D leaders
A. Target discovery & validation for regulation-heavy biology
For programs involving transcription factors, RNA-binding proteins, chromatin remodelers, or regulatory motifs, multi-omic models could enable:
-
rapid in silico screening of candidate binding interactions and mutation sensitivity
-
improved prioritization of regulatory hypotheses for wet-lab validation
-
earlier identification of “specificity cliffs” that frequently derail translation
JASPAR’s curated TF binding profiles provide a useful external check on specificity behaviour; OmniBioTE’s perturbation experiments (mutating consensus sequences) are directionally consistent with motif sensitivity expectations (Rauluseviciute et al., 2024; Chen et al., 2025).
B. Oligo/RNA therapeutic design & off-target risk
Even without claiming end-to-end aptamer/ASO/siRNA design, the demonstrated ability to model ΔG and motif disruption suggests near-term applications in:
-
ranking candidate sequences against protein targets (or vice versa)
-
stress-testing sequence robustness to small edits (ΔΔG-like reasoning)
-
early-stage risk filters for unintended protein binding modes
C. Platform strategy: fewer foundation models, more leverage
If multi-omic pretraining maintains competitive single-omic performance-per-compute while improving interaction tasks, this supports a strategic simplification:
-
consolidate internal FM stacks (one multi-omic backbone vs. separate protein/genome models)
-
standardize embedding interfaces across discovery, translational, and bioinformatics teams
-
reduce duplicated MLOps burden and model governance overhead
5) What this means for pharma (2026–2030): likely direction of travel
-
Interaction-native AI becomes the default for regulation and RNA biology (TF/RBP/chromatin), not an add-on.
-
Sequence-first triage shifts earlier, reducing dependence on expensive structural pipelines until later-stage confirmation.
-
Generalist multi-omic backbones become the center of platform strategy, with task heads for binding, specificity, expression, and safety-relevant endpoints.
-
Access and reproducibility become procurement criteria: leaders will weigh closed vs open ecosystems (e.g., AlphaFold 3 access debate) as part of long-term platform risk management (Sample, 2024; Pujol-Mazzini, 2024).
References
Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., … Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 493–500. https://www.nature.com/articles/s41586-024-07487-w
Baek, M., McHugh, R., Anishchenko, I., Jiang, H., Baker, D., & DiMaio, F. (2024). Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nature Methods, 21, 117–121. https://www.nature.com/articles/s41592-023-02086-5
Chen, S. F., Steele, R. J., Hocky, G. M., Lemeneh, B., Lad, S. P., & Oermann, E. K. (2025). Large-Scale Multi-omic Biosequence Transformers for Modeling Protein–Nucleic Acid Interactions. https://pmc.ncbi.nlm.nih.gov/articles/PMC11998858/
Large-Scale Multi-omic Bioseque…
Harini, K., Gupta, R., Vishwakarma, S., & Srinivasan, N. (2022). ProNAB: Database for binding affinities of protein–nucleic acid complexes and their mutants. Nucleic Acids Research, 50(D1), D1528–D1536. https://academic.oup.com/nar/article/50/D1/D1528/6381138
He, Y., et al. (2025). Generalized biological foundation model with unified nucleic acid and protein language. Nature Machine Intelligence. https://www.nature.com/articles/s42256-025-01044-4
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., … Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. https://www.science.org/doi/10.1126/science.ade2574
Pujol-Mazzini, A. (2024, May 28). AlphaFold 3 … frustre les chercheurs [News article]. Le Monde. https://www.lemonde.fr/sciences/article/2024/05/28/alphafold-3-le-logiciel-phare-de-deepmind-pour-modeliser-les-proteines-frustre-les-chercheurs_6236027_1650684.html
Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J., Ferenc, K., Kumar, V., … Mathelier, A. (2024). JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Research, 52(D1), D174–D182. https://academic.oup.com/nar/article/52/D1/D174/7420101
Sample, I. (2024, May 8). Google DeepMind’s ‘leap forward’ in AI could unlock secrets of biology [News article]. The Guardian. https://www.theguardian.com/science/article/2024/may/08/google-deepmind-ai-biology-alphafold
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS. https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Zhou, Z., Ji, Y., Li, W., Dutta, P., Davuluri, R., & Liu, H. (2023). DNABERT-2: Efficient foundation model and benchmark for multi-species genome. arXiv. https://arxiv.org/abs/2306.15006


