Human α-lactalbumin (α-LA, UniProt: P00709) plays a crucial role in nutrition and healthy development. As a high-quality protein source, its nutritional value depends not only on its comprehensive essential amino acid composition but, more profoundly, on the efficient absorption of its degradation products by small intestinal epithelial cells.
The absorption of amino acids by small intestinal epithelial cells occurs primarily through two pathways:
1.Absorption of free amino acids via various specific transporters.
2.Absorption of dipeptides and tripeptides through a highly efficient and broad-spectrum transport system—the proton-coupled peptide transporter 1 (PepT1, gene name SLC15A1). Substantial evidence indicates that the oligopeptide transport pathway is dominant in overall amino acid absorption. The PepT1 transporter, located on the brush border membrane of the small intestine, utilizes the transmembrane proton gradient as a driving force to efficiently transport thousands of different dipeptides and tripeptides into the cell. This physiological mechanism offers a new perspective for optimizing the nutritional value of proteins: if the hydrolysis pathway of a protein during digestion can be precisely controlled through engineering to preferentially produce oligopeptide fragments easily recognized and transported by PepT1, the overall bioavailability of the target protein could be significantly enhanced.
Based on this, this project proposes a rational protein engineering strategy for human α-LA. The core hypothesis is to achieve "molecular tailoring" of its proteolytic product profile by introducing specific protease cleavage sites on the surface of the α-LA molecule. This tailoring aims to maximize the proportion of dipeptides and tripeptides in the digestive products, thereby fully utilizing the efficient PepT1 absorption pathway.
The primary principle for the rational modification of any protein is to deeply understand and respect its natural structural and functional limitations. Any mutation intended to introduce new properties must not compromise the protein's correct folding, structural stability, or core biological activity. For α-LA, its precise structure is not only the basis of its biological function but also the guarantee of its integrity in the harsh environment of the digestive tract.
Cys6–Cys120, Cys28–Cys111, Cys61–Cys77, and Cys73–Cys91. These eight cysteine residues are almost absolutely conserved throughout the α-LA/lysozyme superfamily, and any mutation targeting them is highly likely to cause severe protein misfolding and loss of function. Therefore, these eight cysteine residues must be excluded when screening for modification sites.
One of the most significant structural and functional features of α-LA is its high-affinity calcium ion (Ca²⁺) binding site, where a precise spatial conformation is formed by multiple amino acid side chains and main-chain carbonyls coordinating a calcium ion. Key residues involved in calcium coordination include the main-chain carbonyl oxygens of Lys79 and Asp84, and the side-chain carboxyl oxygens of Asp82, Asp87, and Asp88.
Studies have shown that compared to the "holo-protein" state with bound calcium, the "apo-protein" state without calcium exhibits a partially folded "molten globule" state under physiological conditions. This molten globule state has a loose structure with an exposed hydrophobic core, leading to a sharp decrease in thermal stability and making it extremely sensitive to proteolytic degradation.
| Feature | Residue Position (Precursor, 1-142) | Residue Position (Mature Protein, 1-123) | Rationale for Conservation |
|---|---|---|---|
| Disulfide Bond 1 | Cys25, Cys139 | Cys6, Cys120 | Essential for maintaining tertiary structure and global folding. |
| Disulfide Bond 2 | Cys47, Cys130 | Cys28, Cys111 | Essential for maintaining tertiary structure and global folding. |
| Disulfide Bond 3 | Cys80, Cys96 | Cys61, Cys77 | Essential for maintaining tertiary structure and global folding. |
| Disulfide Bond 4 | Cys88, C_Cys110 (predicted) | Cys73, Cys91 (predicted) | Essential for maintaining tertiary structure and global folding. |
| Calcium-Binding Loop | 98-107 | 79-88 | Coordinates Ca²⁺ ions, greatly enhancing protein structural stability and resistance to proteolysis. |
From a structural and evolutionary perspective, α-LA shares striking similarities with the c-type lysozyme family, including about 40% sequence identity and a highly conserved three-dimensional fold. Although their biological functions have diverged significantly, by comparing a large and functionally diverse family of homologous proteins, we can identify which residues are essential for maintaining the core structure and which have changed with functional specialization.
By quantifying the degree of conservation at each amino acid position, we can create a "functional importance map." On this map, regions with low conservation scores (i.e., high variability) are the potential "hotspots" for modification. Mutations at these sites are most likely to introduce new, desired properties without disrupting the overall structure and function of the protein.
To translate the above theory into an executable analysis, we adopt a standardized computational biology workflow, which mainly includes the following four steps:
Homolog Discovery: Using our target human α-LA sequence as a "probe," we search large protein sequence databases (such as UniProt/Swiss-Prot) to find its homologous proteins in other species.
Multiple Sequence Alignment (MSA): All homologous sequences collected in the previous step are arranged so that evolutionarily corresponding residues are aligned in the same column.
Phylogenetic Tree Inference: Simply counting the amino acid frequencies in each column is subject to sampling bias. To correct for this, we need to construct a phylogenetic tree from the MSA, which reflects the evolutionary relationships among all sequences. By integrating the tree information into the conservation calculation, we can assign different weights to sequences from different evolutionary branches, thus obtaining a more accurate conservation assessment.
Position-Specific Conservation Scoring: After obtaining the MSA and phylogenetic tree, we can calculate a quantitative conservation score for each column of the MSA. Various algorithms can achieve this, ranging from simple frequency statistics to complex, evolution-based probabilistic methods.
To quantify the variability of each site, we use a classic concept from information theory—Shannon Entropy. The intuitive interpretation of Shannon entropy is the uncertainty or "disorder" of the amino acid composition at a site.
Low Entropy Value (approaching 0): When a column consists almost entirely of one type of amino acid (e.g., 99% alanine), one value in p i is close to 1, and the rest are close to 0. In this case, the entropy H approaches 0. This indicates low information content and low uncertainty, meaning the site is highly conserved.
High Entropy Value: When multiple amino acids appear with similar frequencies in a column, the uncertainty is maximal, and so is the entropy. For example, if 20 amino acids appear with equal frequency (5% each), the entropy will reach its maximum value of log₂(20) ≈ 4.32. This indicates the site is highly variable.
In this project, we will calculate the Shannon entropy for each position in the α-LA homologous sequence alignment and use it as a "variability score." Our goal is to screen for sites with the highest variability scores, as they represent the regions most tolerant to change during evolution.
Sequence Logo Analysis: Sequence logos provide us with more detailed conservation information. In these diagrams, the total height of the stack of letters at each position represents the information content (i.e., degree of conservation), while the relative height of each letter reflects the frequency of that specific amino acid at that position. Positions with low information content (i.e., short stacks with several letters of similar height) correspond to the peaks in the Shannon entropy plot and are potential regions for modification. Conversely, stacks composed of a single, tall letter, such as the highly conserved Cysteine residues forming disulfide bonds, represent structurally or functionally irreplaceable sites that must be avoided in any modification plan.
The bar chart quantifies the analysis above, listing the ten sites with the highest variability scores after excluding critical functional residues. Alignment position 89 leads with the highest score of 2.82, followed by position 39 (2.45) and position 150 (2.27). This list constitutes our initial pool of candidates for subsequent structural validation, providing clear targets for screening.
Although 2D maps are informative, they cannot fully represent the spatial position and accessibility of residues. A highly variable residue located in the protein core is far less valuable for modification than a variable residue on the surface, which is easily accessible to solvents and other molecules. Therefore, we map the variability scores onto the protein's three-dimensional structure, using a color gradient to visually distinguish between "cold" (conserved) and "hot" (variable) regions.
Overall Trend: The curve fluctuates dramatically, with many high "peaks" and deep "valleys." This is a typical SASA distribution for a globular protein.
Significance of "Peaks": Each peak (high SASA value) corresponds to a residue highly exposed on the protein surface. These residues, located in flexible loop regions or at the turns of α-helices/β-sheets, are our primary targets for surface engineering.
Significance of "Valleys": Each valley (low SASA value) corresponds to a residue buried deep within the protein's interior. These are the cornerstones for maintaining the protein's structural stability and should be avoided for modification.
A very significant trend is that many of the residues in the Top 10 list (such as LEU123, LYS122, GLU121, GLU116, LYS114, GLU113) are concentrated in the C-terminal region of the protein sequence. The C-terminus of lactalbumin is a highly flexible and fully exposed region, indicating an ideal area for potential modifications.
Introduce Trypsin Cleavage Sites: Trypsin specifically cleaves at the C-terminus of lysine (Lys, K) and arginine (Arg, R). By designing the protein, a series of oligopeptides with optimized length and composition, which are easily absorbed by PepT1, can be produced.
Biochemical Principles of Trypsin
Core Requirement: The target amino acid for cleavage must be Lysine (K) or Arginine (R). Trypsin specifically recognizes and binds these positively charged amino acid side chains via an aspartate residue at the bottom of its binding pocket.
"Veto" Condition: If a proline (Proline, P) residue is immediately adjacent to the C-terminus of the target K/R residue (i.e., at the P1' position), the cleavage activity at that site will be severely inhibited.
Efficiency Impact: If an acidic residue (such as Aspartate, Asp, D; Glutamate, Glu, E) is present in the immediate vicinity of the cleavage site (P2 or P1' position), the rate of cleavage will be slowed down.
Any amino acid substitution can affect the overall stability of a protein. To quickly screen out mutation schemes that might severely disrupt protein folding, we predict the change in the free energy of folding (ΔΔG) caused by the mutation. By thermodynamic definition, ΔΔG = ΔG_mutant - ΔG_wildtype. A large positive value (typically > 2 kcal/mol) indicates a significant destabilizing effect and is a warning sign.
To avoid the inherent bias of a single prediction algorithm, we adopted a "consensus approach," using two prediction servers based on different principles simultaneously: CUPSAT (an empirical method based on amino acid atomic potentials and main-chain dihedral angle statistics) and MUpro (a machine learning method based on Support Vector Machines (SVM)), and integrated their results for judgment. As shown in the table below, the prediction results provide strong support for our design.
| Mutation | CUPSAT ΔΔG (kcal/mol) | MUpro ΔΔG (kcal/mol) | Consolidated Analysis and Final Recommendation Grade |
|---|---|---|---|
| L123K | -0.28 | +0.11 | Conflicting results: CUPSAT predicts stabilization, while MUpro predicts very slight destabilization. Considering its prime structural advantages (high SASA, flexible region), the risk is extremely low. |
| L123R | -0.54 | +0.32 | Conflicting results: Same as above. CUPSAT's stabilization prediction is stronger, and MUpro's destabilization prediction is still within a safe range. |
| E121K | -0.2 | -0.54 | Perfect consensus: Both servers predict increased stability. This is the most ideal and least controversial modification plan. |
| E121R | -0.24 | -0.4 | Perfect consensus: Same as above. Reconfirms that E121 is a golden site for modification. |
| I98K | +0.86 | +0.12 | Consensus reached: Both servers predict "destabilizing," but the values are well below the dangerous threshold of 2.0. |
| I98R | +0.58 | +0.30 | Consensus reached: Same as above. Confirms this site's status as a backup option. |
| A92K | -0.43 | -0.02 | Consensus reached: Stability prediction is good. However, it cannot overcome the fatal flaw of its excessively low solvent accessibility (3.63%). |
| A92R | -0.12 | -0.02 | Consensus reached: Same as above. Stability is not the issue; functionality (accessibility) is. |
| L26K | +0.55 | +0.23 | Consensus reached: Both servers predict "destabilizing." |
| L26R | +1.14 | +0.40 | Consensus reached: Same as above. CUPSAT's prediction for L26R (>1.0) further confirms that this site should be abandoned. |
Based on the site selection results described above, we focused our modification efforts on the C-terminal sites. The specific mutation plan was to replace the original amino acids at these sites with lysine (K) or arginine (R), resulting in four single-point mutants: E121K, E121R, L123K, and L123R.
When evaluating these designs, it is necessary to go beyond simply "introducing a cleavage site" and conduct an in-depth "product-oriented" analysis. The ultimate goal of the project is to generate dipeptides or tripeptides that can be transported by PepT1. The natural C-terminal sequence of human α-LA is ...Cys(120)-Glu(121)-Lys(122)-Leu(123).
Modification at the E121 site: The sequence will become ...Cys(120)-Lys(121)-Lys(122)-Leu(123). This is an ideal result because this dipeptide is a potential substrate for the PepT1 transporter.
Modification at the L123 site: The sequence will become ...Cys(120)-Glu(121)-Lys(122)-Lys(123). In this case, trypsin is most likely to cleave the peptide bond C-terminal to the naturally occurring Lys122, which would release the single amino acid Lys. Single amino acids cannot be absorbed via the PepT1 pathway, making this a suboptimal design.
Through this product-oriented analysis, an important conclusion can be drawn: modifications at the E121 site are theoretically more efficient at producing the desired dipeptide products due to the favorable upstream position of the natural Lys122. Therefore, from a functional output perspective, E121K/R are the superior options.
The screening criteria will comprehensively consider the following factors:
Distance from Key Functional Regions: Even if a site is highly variable, it should be treated with caution if it is in close proximity to the calcium-binding loop or regions related to substrate binding.
Variability Score (Shannon Entropy): A higher score indicates a stronger tolerance for mutation and is a primary screening indicator.
Solvent Accessibility: Residues on the protein surface are more suitable for modification than those buried inside because they can directly interact with the external environment (e.g., proteases, intestinal epithelial cells).
Structural Context: Residues in flexible loop regions are generally easier to modify than those in regular secondary structures (α-helices or β-sheets) because modifications to loops typically have a smaller impact on the overall structure.
Change in Protein Folding Free Energy upon Mutation (ΔΔG): A large positive value (usually > 2 kcal/mol) signals a significant destabilizing effect and is a warning sign.
Characteristics of Dipeptide Decomposition Products: The desired cleavage products are dipeptides or tripeptides that can be transported by PepT1.
E121K/R were ultimately chosen as the final scheme for modifying the trypsin cleavage site.
Key Analysis Metrics:
Global Stability: Calculate the Root Mean Square Deviation (RMSD) of the protein's backbone atoms relative to the initial structure. A stable and converged RMSD trajectory indicates that the protein maintained its overall fold during the simulation.
Local Flexibility: Calculate the Root Mean Square Fluctuation (RMSF) for each residue. By comparing the RMSF profiles of the mutant and wild-type, it is possible to precisely identify which regions have changed in flexibility. This includes not only the mutation site itself but, more importantly, reveals whether the mutation affects the flexibility of distant functional sites (such as the calcium-binding loop) through allosteric effects.
Compactness: Calculate the Radius of Gyration (Rg) of the protein. A stable Rg value indicates that the protein remains compact, whereas a continuous increase may signal a tendency to unfold.
Structural Integrity: Analyze the evolution of secondary structures using algorithms like DSSP to monitor whether key α-helices or β-sheets remain stable throughout the simulation. Also, analyze the occupancy of key hydrogen bonds (especially the hydrogen bond network that maintains the conformation of the calcium-binding loop) to assess whether their stability is affected.
Analysis from the STRING database indicates a functional interaction between human trypsin and α-lactalbumin, with an interaction score of 0.449 (medium confidence). This provides a solid biological background for our rational design and introduction of a highly efficient enzyme cleavage site.
The docking scores for the two mutants (E121K: -372.37, E121R: -315.40) are both far superior to the wild-type (-287.67). This validates our core design principle. By replacing the electrostatically repulsive glutamic acid (E) in the wild-type with a positively charged lysine (K) or arginine (R), we successfully created a strong electrostatic attraction (salt bridge), thereby significantly enhancing the theoretical binding affinity between α-lactalbumin and trypsin. In the static model, E121K demonstrates the optimal binding energy.
The number of hydrogen bonds formed at the binding interface by E121K (12) and E121R (11) is significantly greater than the 7 formed by the wild-type. This is a direct reflection of the enhanced binding affinity at the atomic level. The stronger electrostatic attraction pulls the two proteins closer together in a more well-matched pose, promoting the formation of a more extensive and stable hydrogen bond network at the interface. This provides a solid physical basis for the stability of the complex.
The ligand RMSD values for both mutants (E121K: 0.22 nm, E121R: 0.25 nm) are markedly lower than that of the wild-type (0.35 nm). This indicates that trypsin "wobbles" less in the binding pockets of the mutants, resulting in a more stable pose.
The average catalytic distance for the wild-type (5.8 Å) is outside the ideal range for catalysis. In contrast, both mutants successfully optimized this distance to fall within the perfect catalytic range (E121K: 3.5 Å, E121R: 3.7 Å). This precisely positions the catalytic site for highly efficient cleavage. This suggests that the actual enzymatic cleavage rates of these two mutants will be higher than that of the wild-type.