Engineering Success

Engineering Process

Introduction
DBTL Cycle 1
DBTL Cycle 2
DBTL Cycle 3
Conclusion
Figures

"Accurately mine polyamine synthases from metagenomes through iterative computational biological processes."

Introduction

It is a typical bioinformatics engineering problem to convert unstructured massive environmental sequencing data into specific functional genes with high credibility. We are committed to mining polyamine synthases from hot spring metagenomes, and designed and implemented a multi-stage, iterative Design-Build-Test-Learn (DBTL) cycle for this purpose. This narrative will show in detail how we overcome difficulties through continuous optimization strategies, and finally successfully complete gene mining and functional presumption at the computational level.

DBTL Cycle 1: The construction of the basic process of metagenome and the dilemma of preliminary exploration

1) Design:

Our initial design is based on the classical paradigm of metagenomic analysis. The goal is to obtain a metagenomic assembly map as complete and continuous as possible, and to perform gene function annotation on this basis. We chose a combination of tools that were considered robust at the time: FastQC was used for visual diagnosis of raw data quality, and Trimmomatic was used as a quality control tool to trim low-quality bases and adaptors with its flexible parameters. For the assembler, we chose metaSPAdes because it has been widely reported to produce good results when processing complex microbial community data. At the level of gene mining, our strategy is simple and direct: Prodigal is used for gene prediction, and then the predicted protein sequence is aligned to the large NCBI non-redundant protein database by BLASTP program. Finally, the target sequence is screened by searching keywords such as 'polyamine synthase' and 'spermidine synthase' in the results. This design we think covers the whole process from the original data to the target annotation.

2) Build:

The construction process involves a large number of command line operations and process scripting. We first established a bioinformatics working environment and installed all the necessary software and dependency libraries. Subsequently, we wrote a Shell script to connect FastQC and Trimmomatic in series, and set the initial parameters of Trimmomatic as ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36, which was the standard parameter recommended by many tutorials at that time. The quality-controlled data was sent to metaSPAdes for assembly, and we used the-meta model to optimize its processing of metagenomic data. After the assembly is completed, we run Prodigal on the generated contigs file for gene prediction, and use the -p meta parameter to process the genes of the metagenomic pattern. Finally, we wrote a batch BLAST script to submit the predicted tens of thousands of protein sequences to the locally constructed NR database for search, and summarized all the results into a large report.

3) Test:

In the test phase, we encountered unexpected setbacks. First, when QUAST was used to evaluate the assembly results, the N50 value of the key indicator was only 2.3 kb, and the total number of contigs was as high as hundreds of thousands. This means that our assembly products are very fragmented, and many complete genes may be segmented on different contigs, which lays a hidden danger for subsequent complete gene mining. Secondly, it is also a more serious problem. When analyzing the BLAST results, we found that there are few sequences screened by keywords, and the annotation information of most hit sequences is extremely vague, such as 'putative transferase', 'hypothetical protein' or 'DUF1234 domain-containing protein'. These annotation information cannot provide us with any definite functional orientation, making it impossible to reliably identify true polyamine synthases from tens of thousands of sequences. Our preliminary mining strategy has a very low signal-to-noise ratio and is almost declared a failure.

4) Learn:

The first cycle gives us a heavy but valuable lesson. We recognize that the default parameters of the standard process do not apply to all data sets. Our hot spring samples have high microbial diversity and are rich in unknown species, requiring more stringent data preprocessing to improve assembly quality. More importantly, we realize that in the ocean of 'dark matter' metagenomes, relying on sequence-based annotations and simple keyword screening is extremely naive and inefficient. The annotations in the public database may themselves be inaccurate or non-specific, and this kind of 'annotation-based annotation' will cause huge error propagation. We must abandon this passive screening method and adopt a more targeted, more active search strategy. At the same time, we also confirmed that improving the assembly quality is the basis of all subsequent analysis and must be prioritized.

DBTL Cycle 2: Precise optimization and active search strategy transformation

1) Design (Redesign):

Based on the lessons of the first round, we conducted a thorough redesign. The goal is very clear: one is to obtain higher quality metagenomic assembly, and the other is to establish a precise BLAST mining process that can significantly reduce false positives. For the assembly problem, we decided to implement a two-pronged strategy: First, optimize the quality control steps and adjust the parameters of Trimmomatic to a stricter SLIDINGWINDOW:5:25 MINLEN:80 to remove low-quality sequences more thoroughly. Secondly, we introduce MEGAHIT as the second assembler, which is a highly efficient and resource-saving metagenomic assembler. We plan to select the final assembly result by comparing the performance of metaSPAdes and MEGAHIT on N50, contig size distribution and other indicators. Aiming at the core mining problem, we have designed a revolutionary new strategy: constructing a manually curated high-quality reference sequence database. Instead of relying on massive and noisy annotations in the NR database, we manually downloaded all experimentally verified and clearly annotated polyamine synthase protein sequences from the UniProtKB database to construct a pure, high-confidence custom BLAST database.

2) Build (Rebuild):

We reconstructed the entire bioinformatics pipeline. The new quality control and dual assembly processes are encoded into the new Snakemake workflow to achieve process automation and repeatability. In terms of database construction, we downloaded more than 200 functional polyamine synthase sequences from different species from UniProt, and constructed an exclusive database using the makeblastdb command. Subsequently, we performed BLASTP search on our custom 'gold standard' database for all the protein sequences predicted in the optimized assembly results. We set a relatively high threshold: E-value < 1e-10, identity > 40%, in order to screen out the conserved homologues in the functional core region.

3) Test (Retest):

The optimization strategy has achieved immediate results. In terms of assembly quality, after stricter quality control, the N50 value of metaSPAdes assembly was increased to 5.8 kb, and the results of MEGAHIT also performed well. We finally selected the metaSPAdes results with better continuity for downstream analysis. On gene mining, the new BLAST strategy has brought a qualitative leap. Compared with hundreds of fuzzy hits in the first round, only 27 high-confidence candidate sequences are returned in this round. Each candidate sequence has significant similarity with one or more known function of polyamine synthase. The results list becomes clear, manageable, and biologically meaningful.

4) Learn (Relearning):

The success of this round makes us deeply understand the true meaning of 'quality is better than quantity' in bioinformatics. A carefully curated small database is far more valuable than a large but noisy general database. By actively defining the search space, we take the initiative in mining. At the same time, strict quality control is essential for downstream analysis and is a step worth investing time in. However, in addition to joy, we maintain rational prudence. We clearly know that sequence similarity (even with known functional enzymes) is still an indirect functional evidence. In order for our selected candidate genes to have a higher degree of success before entering the time-consuming and labor-intensive wet experiment verification, we need to provide stronger and more direct evidence for their functional presumption. This leads us into the next more innovative cycle.

DBTL Cycle 3: Ultimate computational verification from sequence similarity to structural function conservation

1) Design (Redesign):

In this round of design, our goal is to elevate the functional evidence of candidate genes from one-dimensional sequence space to three-dimensional structural space. We plan to use the revolutionary protein structure prediction tool AlphaFold 3.0 to perform high-precision three-dimensional structure modeling for our candidate genes. The core hypothesis is that if an unknown sequence can fold into a three-dimensional structure that is highly similar to the known polyamine synthase, especially its active site region is highly conserved, it is highly likely to assume similar biochemical functions. We selected five candidate genes with the highest sequence diversity for this analysis. The verification scheme is to compare the predicted structure with the crystal structure of the known polyamine synthase downloaded from the PDB database, and use the root mean square deviation as a quantitative overlap accuracy index.

2) Build (Rebuild):

We deployed AlphaFold 3.0 on a computing server equipped with a high-performance GPU. For each candidate sequence, we run a complete prediction process, which includes multiple sequence alignment in a large sequence database to find homologous sequences, and deep learning structure modeling based on attention mechanism. Each prediction generates five models and a confidence score. We use PyMOL software to perform structural alignment work. The specific operation is: superimpose the prediction model with the reference crystal structure, calculate the RMSD value of the two Cα atoms, and focus on the observation and analysis of the substrate binding pocket. Atomic-level spatial arrangement of the region where the catalytic residue is located.

3) Test (Retest):

The results exceeded our expectations. All five prediction models produced high-confidence structures with pLDDT scores above 85. The structural alignment shows that they exhibit amazing structural conservation on the core catalytic domain with the reference template structure, and the RMSD values are between 1.5Å and 2.0Å, which belongs to a highly similar category. In PyMOL's visualization window, we can clearly see that the amino acid residues that make up the active center overlap almost perfectly in three-dimensional space, although their similarity on one-dimensional sequences may only be 40-50%. This strongly suggests that these candidate genes are not only related to polyamine synthase in sequence, but also have structural prerequisites for performing the catalytic function of polyamine synthesis.

4) Learn (Relearning):

This last cycle has taken our project to a new level. We have learned that the three-dimensional structure of proteins is a more ancient and conserved functional indicator than linear sequences. The powerful ability of AlphaFold2 allows us to 'see' the possible functions of candidate genes at a resolution close to the atomic level before wet experiments, which greatly reduces the blindness of subsequent research. We have successfully transformed a screening process that relies on indirect evidence (BLAST similarity) into a computational verification process that provides direct structural evidence. This combination strategy of 'BLAST primary screening + structural verification' has become a powerful and reliable new paradigm for our team to mine functional genes from metagenomes.

Conclusion

Through these three layers of progressive, self-correcting DBTL loops, we completed a complete engineering journey from data to cognition. We not only obtained several candidate genes of polyamine synthase that have been verified by computational depth as valuable biological components of this project, but more importantly, we established and optimized a standardized calculation process for functional gene mining from complex environments. This process began with a failed traditional method, which was improved by optimizing data quality and implementing active search strategies, and finally reached a new high of functional presumption with the help of cutting-edge structural biology AI tools. This fully embodies the core engineering spirit of synthetic biology to define problems, iterative schemes, and integrate cutting-edge technologies to solve biological problems. These high-confidence computational components have now entered the experimental verification stage of cloning and in vitro enzyme activity determination to achieve the final transformation from 'in silico components' to 'in vitro components', laying a solid foundation for the construction of our thermophilic cell factory.

Figures

Fig.1 Part of the code error diagram

Figs.2 Phylogenetic tree of polyamine synthetases

A. Arginine decarboxylase; B. Ornithine decarboxylase; C. SAM-decarboxylase;
D. Spermidine synthetase; E. Aspartate hemialdehyde dehydrogenase; F. Aspartate kinase.

Figs.3 Polyamine synthetase structure comparison diagram

(A-B) Ornithine decarboxylase (C-D) SAM-decarboxylase (E-J) Spermidine synthase
(K) Aspartic kinase (L) Aspartic semialdehyde dehydrogenase