The combination of data-driven and physics-based modeling with application in protein formulations

J.G.E.M. Fraaije, P. Petris
Siemens Culgi,
Netherlands

Keywords: AI, protein formulations

Summary:

With the advent of the data revolution, new avenues open up to integrate data-driven and physics-based modeling. We discuss the application of such integration to protein formulations. One finds protein formulations in many industries: personal care, food, and drug discovery and development. It is estimated that currently, 20% of all new drug discovery is in biologicals. In all these systems, in one way or the other, the protein solution will go through a phase where the formulation is susceptible to instabilities, for example, when the concentration is relatively high or because of changing solvent conditions. Instabilities manifest themselves as aggregation, denaturation, or a combination thereof. Until recently, the protein formulator could do not much more than relying on experimental trial and error, perhaps assisted by robotic-assisted screening or ancient wisdom. With de phenomenal technological advancement in AI-driven protein structure prediction by Google’s Deepmind (1,2) and similar academic AI initiatives (3), we now suddenly have the possibility to generate structures on a (multiple-)proteome-wide scale. Perhaps only a few of those structures will be of atomic accuracy; that is the resolution one needs for small molecules drug discovery. But for many a formulation challenge, one does not need atomistic resolution: a relatively rough 3D structure, organized on the level of groups of atoms (‘beads’) could be enough. However, there is one challenge: in the translation of structure to formulation, one needs both atomic positions (albeit rough) and thermodynamics interactions. It is precisely on the level of overlaying the AI-generated rough structure with coarse-grained (CG) modeling that one can hybridize data-driven and physics-based modeling into a new AI-CG hybrid algorithm. The AI method is determined by statistics, whereas the coarse-grained modeling relies on physics. We showcase the AI-CG algorithm by a few examples where we take protein structures generated by Deepmind and then coarse-grain the structures with Simcenter Culgi’s Automated Fragmentation and Parameterization method. Once on the coarse-grained level, it is relatively easy to calculate, for example, the second virial coefficient or even to simulate the diffusion of a few coarse-grained protein molecules by Stokesian Particle Dynamics. The hybrid AI-CG algorithm takes only a few minutes, or at most a few hours, to execute on a modest PC., so still of sufficient efficiency for screening purposes. 1. Jumper, J. et al. Nature https://doi.org/10.1038/s41586-021-03819-2 (2021).2. Tunyasuvunakool, K. et al. Nature https://doi.org/10.1038/s41586-021-03828-1 (2021). 3. Baek, M. et al. Science, https://doi.org: 10.1126/science.abj8754 (2021)