Genie2 is an improved diffusion model based on the Genie architecture that outperforms current known models on designability, diversity, and novelty, in single and multi-motif scaffolding.
1️⃣ Training Data Augmentation
Genie2 significantly expands its training dataset by using the AlphaFold database (AFDB), which includes over 200 million predicted protein structures. By employing a subsampled AFDB set, filtered for structural similarity and high-confidence predictions (pLDDT > 80), Genie2 utilizes 588,570 structures for training. This marks a substantial increase from the PDB's dataset and not only enriches the model with a broader variety of protein folds but also enhances its performance.
2️⃣ Unconditional Protein Generation
Genie2's capabilities in protein generation without length constraints outperform its predecessors up to its training limit of 256 aa. Beyond this, up to 500 aa, Genie2 demonstrates robust out-of-distribution generative abilities. It demonstrates commendable performance even beyond its training range. Comparatively, it surpasses Chroma and FrameFlow and is positioned as superior to RFDiffusion.
3️⃣ In-Distribution Performance
Genie2 was assessed alongside Chroma and RFDiffusion by generating five structures of every length ranging from 50 to 256 aa (1,035 structures in total). Relative to RFDiffusion and Chroma, Genie2 achieves comparable designability and much higher diversity and novelty, suggesting its ability to best capture foldable protein structure space.
4️⃣ Length-Based Generative Performance
Genie2 shows comparable designability to RFDiffusion but higher diversity in a subset of sequence lengths ranging from 50-500 aa at 100-generated structures. For short proteins (<200 aa), Genie2 exhibits significantly higher diversity, doubling that of RFDiffusion at 100 aa, which is remarkable as shorter lengths constitute smaller design spaces.
5️⃣ Motif Scaffolding (Single/Multi)
Across 24 single-motif scaffolding tasks, Genie2 solved 23 compared to RFDiffusion’s 22 tasks, with the number of unique successes being 1,445 and 889, respectively. In multi-motif scaffolding, Genie2 solved 4 out of 6 tasks. It also successfully solved the RSV G-protein 2D10 site-scaffolding task (5WN9), where RFDiffusion failed. Out of 24 problems, Genie2 failed on only one, similar to RFDiffusion, when scaffolding the RSV F-protein site-0.
6️⃣ Limitations
Genie2’s sampling time is longer, requiring 10x more denoising iterations (1,000) for sampling compared to FrameFlow (100), 2x more than Chroma (500), and 20x more than RFDiffusion (50) using a single A6000 GPU (48GB). Experimental validation of the designed proteins remains to be explored.
Open Questions and Future Directions:
By addressing these questions, researchers can further explore and expand the capabilities of Genie2, driving forward innovations in protein design and engineering. Continuous development and interdisciplinary collaboration will be key to realizing the full potential of Genie2 in various scientific and industrial domains.
Resources
Code: Genie 2: Designing and Scaffing Proteins at the Scale of the Structural Universe