Loading

L o a d i n g ,   p l e a s e   w a i t . . .

Model-Psi-Blast

 Core Algorithm Principle of PSI-BLAST

PSI-BLAST (Position-Specific Iterated BLAST) is a powerful protein sequence alignment tool renowned for its ability to detect remote homology proteins, offering significantly higher sensitivity than standard BLAST. Its success hinges on an iterative cycle.

The Iterative Workflow of Traditional PSI-BLAST

Traditional PSI-BLAST operates as a serial, iterative process. Its core innovation is using the results from each search to refine the "probe" for the subsequent search, as visualized below:

PSI-BLAST
  1. Initial Search: The newly built PSSM is used as the new "query" to search the database again. Due to the PSSM's higher sensitivity, this often identifies more remote homology proteins.
  2. Build PSSM: All sequences with a significant E-value from the results are extracted and used, along with the query sequence, to build a Multiple Sequence Alignment (MSA). Based on this MSA, a Position-Specific Scoring Matrix (PSSM) is calculated. The PSSM is an L x 20 matrix (where L is the query length and 20 represents the amino acids) that precisely captures the preference for each amino acid at every position in the query. Unlike a fixed matrix, the PSSM incorporates evolutionary information, allowing it to identify conserved sites.
  3. Iterative Search: The newly built PSSM is used as the new "query" to search the database again. Due to the PSSM's higher sensitivity, this often identifies more distantly related homologs.
  4. Convergence Check: The process of "building a PSSM and searching" is repeated until no new significant sequences are found, or a predefined maximum number of iterations is reached.

 Innovation: Distributed PSI-BLAST for Massive Databases

Conventional PSI-BLAST faces significant computational and memory bottlenecks with massive datasets. To overcome this limitation, we have designed and implemented a distributed, sharded PSI-BLAST analysis pipeline.

Our core philosophy is "Divide and Conquer, Unify and Learn." The specific workflow is as follows:

Phase I: Parallel First-Round Search

Sharding

The complete protein database is intelligently partitioned into multiple manageable, non-overlapping sub-databases (shards).

Parallel Computation

Each shard is assigned to an independent computing node. All nodes simultaneously run a BLASTP search (the first iteration of PSI-BLAST) against their respective shards using the same query sequence.

Advantage

This step achieves perfect parallelization, drastically reducing the initial "data filtering" time and overcoming single-machine memory constraints.

Phase II: Centralized Iterative Refinement

Result Integration

Results from the first round are collected from all nodes, merged, and deduplicated to form a global, high-quality set of candidate homologous sequences.

Global PSSM Construction

A Multiple Sequence Alignment is built from this global set, which is then used to generate a global PSSM. This PSSM integrates homologous signals from all shards, making it more comprehensive and accurate than any PSSM built from a single shard.

Iterative Search

Subsequent PSI-BLAST iterations use this global PSSM to search the full database, continuously updating the PSSM until convergence is achieved.

 Key Technology: Biologically-Driven E-value Threshold

In a distributed architecture, sequence composition bias can vary between shards (e.g., one shard might be rich in bacterial sequences, another in eukaryotic sequences). Using a uniform E-value threshold might cause biologically relevant, remote homology proteins to be missed in some shards. To address this, we developed a novel E-value threshold calibration method based on prior biological knowledge, ensuring optimal search sensitivity for each shard.

The essence of this method is transforming the E-value from a fixed statistical parameter into a dynamic variable linked to biological objectives. The calibration process for a single shard is shown below:

Cast a Wide Net

For a given shard, a BLAST search is performed using a very permissive E-value (e.g., 1e-23[2]). The goal is to capture all potential sequence signals.

Validate

All candidate sequences are validated using NCBI BLAST against a comprehensive reference database to confirm biological identity (e.g., "kinase" or "receptor") and record E-values.

Calibrate

Identify the maximum E-value among sequences confirmed as "target proteins" — this is the "most permissive threshold" for biologically relevant homologs in the shard.

Apply

The calculated E-value is formally set as the fixed threshold for that shard in subsequent distributed PSI-BLAST analysis.(table 1-1)

table 1-1

Indicator A B C D E F G H I J K L M N O P Q R S
E value 5.64e-28 3.61e-30 2.67e-41 3.34e-32 1.06e-26 8.52e-24 1.14e-24 6.90e-33 1.93e-30 1.03e-33 8.26e-27 2.14e-26 2.78e-24 2.92e-36 1.03e-37 1.45e-27 2.41e-26 1.97e-29 2.66e-28

 References


1. Schaffer, A.A., et al. Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-Based Statistics and Other Refinements. Nucleic Acids Research 29, 2994–3005 (2001). DOI:10.1093/nar/29.14.2994

2. Seo H, Hong H, Park J, Lee SH, Ki D, Ryu A, et al. Landscape profiling of PET depolymerases using a natural sequence cluster framework. Science. 2025 Jan 3;387:eadp5637. doi: 10.1126/science.adp5637.

Email copied! Paste into your email client: tjusls_2025china@163.com