Software

Overview

Video 1: The Introduction to Synprotease software Highlights，Features and Use Process

SynProtease is a sequence-level evaluation tool designed for the early stages of protein engineering. It primarily supports two common types of design decisions:

Assessing the interface compatibility and overall stability of Protein1–Linker–Protein2 fusion constructs
Comparing the feasibility and prioritization of multiple candidate protease recognition sites within a target protein (illustrated here with the ENLYFQ/TEV motif).

The core principle of SynProtease is to provide interpretable, reproducible, and operationally useful quantitative evidence before three-dimensional structures or wet-lab data are available, thereby helping teams set priorities ahead of cloning–expression–purification workflows.

Highlights

Interpretability at the core

All scores are derived from well-defined physicochemical descriptors. Sources of metrics are transparent, the calculation process is clear, and both reviewers and team members can trace and verify intermediate results.

Reliability of the program

SynProtease relies on primary sequences and classical physicochemical property tables—flexibility, hydrophobicity, secondary structure propensity, disorder tendency—that have been validated over decades in literature and engineering practice. These are normalized, weighted, and integrated, then mapped through explicit thresholds into actionable categories such as “High / Medium / Low,” providing reproducible decision criteria and trustworthy quantitative evidence.

Low barrier to use

Only the primary sequence and a few parameters (e.g., window size, recognition motif, cleavage index) are required. The complete analysis can be run directly in a command-line environment, with all necessary figures and result files exported automatically.

Decision-oriented outputs

Results include bar plots, radar charts, distribution plots, heatmaps, and a comprehensive report. These outputs explicitly support “Go / Consider / No-go” style design decisions, enabling rapid identification of viable candidates.

Reproducible evidence chain

Each run generates consistently named PNG figures and text summaries, ready to be incorporated into Wikis, lab notebooks, presentation slides, or manuscripts—ensuring a clear, auditable record of decision-making.

Ease of extension

Weights, window sizes, and recognition motifs are all openly configurable, enabling gradual calibration with team-specific experimental data and continuous improvement of predictive accuracy and applicability.

Applications

For usage and our complete documentation, please visit our GitLab repository!

User groups

Research teams that need to rapidly evaluate the rationality of protein design choices;
Investigators seeking to formalize “rules of thumb” into standardized screening workflows;
Teaching groups that need to guide material selection or project scoping.

Fusion construct design selection

Module A decomposes interface risks into four measurable features—flexibility, hydrophobic balance, interfacial charge, and secondary structure propensity. It is applicable to evaluating linker length and chemical composition (e.g., G/S repeats), deciding N- vs C-terminal tag placement, and determining whether to introduce or remove charged residues at interfaces. Teams can compare scores under “fixed protein / varied linker” or “fixed linker / varied protein” conditions, and use thresholds at 0.4 / 0.6 / 0.8 to decide whether to abandon, iterate, or advance.

Cleavage site planning

Module B assigns a composite score (0-1) to all candidate recognition motifs in a target sequence and ranks them. The scoring scheme emphasizes P1′ preference (weight 0.35), with hydrophilicity (0.25), flexibility (0.20), and disorder propensity (0.20) capturing accessibility. This guides the choice of primary test sites or the need for point mutations to improve cleavability.

Iterative design and local optimization

Bar plots, radar charts, and heatmaps identify which domains or residues act as bottlenecks (global vs local issues), thereby guiding specific point mutations or linker replacements.

Resource prioritization and risk management

With ranked candidate lists generated in minutes, teams can quickly identify cloning and expression priorities, significantly reducing low-yield experimental investment. Consistently named figures and summaries form verifiable decision records, supporting defense presentations and internal reviews.

Teaching and knowledge transfer

Because weights and intermediate variables are transparent, SynProtease is naturally suited for reports, demonstrations, and teaching. As a training tool, it helps new members understand the direct link between sequence, physicochemical properties, and design choices, while embedding team experience into reproducible workflows.

Design Rationale

Starting point of design

From the user’s perspective, we structured the algorithm around two central questions:

Is the interface of the fusion construct stable when joined?
Can the protease efficiently cleave at the designated site?

At the sequence level, both issues can be effectively captured by a set of simple yet reliable physicochemical features. SynProtease addresses them by calculating four measurable indices and upholding principles of interpretability and traceability.

Core principle

“Design feasibility” is decomposed at the sequence level into a set of measurable physicochemical signals. Each feature is calculated using transparent methods so that the final score retains both engineering relevance and auditability.

Given the amino acid sequence of a specific linker and a user-defined flanking window, SynProtease evaluates four influencing factors:

Whether the linker is sufficiently flexible;
Hydrophobicity of the linker and the hydrophobic similarity between its two termini;
Whether the linker’s overall charge is near neutral and whether the charge distribution at the interface is complementary;
The linker’s secondary structure propensity and tendency toward complexity.

Each feature was chosen based on its long-standing validation in literature and engineering practice. Calculations use established empirical scales—for example, flexibility (Karplus–Schulz), hydrophobicity (Kyte–Doolittle), and secondary structure propensity (Chou–Fasman). SynProtease extracts sequence windows from both termini (according to user input), evaluates them together with the linker for these four features, normalizes the results, and integrates them into a 0–1 composite score. Visualizations are generated in parallel. From the user’s viewpoint, the principle is straightforward: if any of the four rules is problematic, it will be highlighted in the charts, prompting timely parameter adjustments.

Using TEV protease as an example, centered on the ENLYFQ motif, SynProtease evaluates four signals within a weighted window:

Suitability of the P1' residue — the key determinant of the upper limit of cleavage efficiency;
Local hydrophilicity — higher values favor solvent exposure and accessibility to the catalytic pocket;
Local flexibility — greater flexibility facilitates access to the active site;
Local disorder — disordered regions are generally more accessible.

The four signals are normalized to a common scale and combined into a 0–1 efficiency score. Results are presented through visualizations. If cleavage efficiency is suboptimal, users can identify the limiting factor: whether the P1′ residue itself is unfavorable, or if the surrounding region is too hydrophobic or too ordered. This supports decisions between site replacement and local point mutation.

Algorithm

This section outlines the main implementation steps, normalization strategies, and quality-control points, to allow reviewers to assess methodological validity.

Data and parameters

Feature calculations rely on well-established scales: Kyte–Doolittle (hydrophobicity), Karplus–Schulz (flexibility), Chou–Fasman (secondary structure propensity), combined with charged-residue counts under pH 7.4. Weights, window sizes, and thresholds are explicit parameters accessible in program settings. The contents of this document have been cross-checked against the current code version.

In Module A, predict_stability() first calls _construct_fusion_protein() to concatenate the C-terminal end of Protein1, the N-terminal start of Protein2, and the linker according to the specified windows. It then calculates four features:

calculate_linker_flexibility() computes the mean Karplus–Schulz flexibility for the linker;
calculate_hydrophobicity_balance() combines (i) hydrophilicity of the linker (inverse Kyte–Doolittle) and (ii) similarity of hydrophobicity between the two flanking regions (smaller difference is preferred);
calculate_interface_charge() counts R/K/H as positive, D/E as negative, others neutral at pH 7.4; it rewards “near-neutral linker + complementary flanking charges”;
calculate_secondary_structure() identifies the maximum α/β propensity and inverts it to penalize strong structural tendencies.

The four features are weighted by default (0.25 / 0.30 / 0.20 / 0.25) and mapped to categorical stability levels.

In Module B, predict_cleavage_efficiency() operates as follows:

find_cleavage_sites() locates all recognition sites;
get_site_sequence() extracts centered sequence fragments;
get_center_weights() generates position-specific weights (the cleavage site and adjacent residues are weighted most heavily; for windows ≥5, ±2 positions are also weighted);
calculate_property_score() applies min–max normalization to the four features, with inverse=True for hydrophobicity to reward hydrophilicity.

Final composite weights are 0.35 (P1′ residue), 0.25 (hydrophilicity), 0.20 (flexibility), and 0.20 (disorder). Scores are mapped to text outputs: Low / Moderate / Good / Excellent.

The main script runs Module B twice:

The first pass uses a heuristic (cleavage_position+2 window) to generate a concise text report;
The second pass uses the user-specified window to generate full visualizations. Functions include plot_efficiency_distribution(), plot_feature_radar(), plot_feature_heatmap(), and visualize_all().

Workflow

To facilitate reproducibility and verification, this section details user operations, inputs/outputs, and branching logic.

User inputs

The program prompts the user to sequentially provide:

N-terminal sequence,
C-terminal sequence,
Linker sequence,
Target protein sequence,
Window size for linker flanking regions (integer),
Recognition sequence (default: ENLYFQ),
Cleavage site index (entered in human 1-based indexing).

Internally, cleavage site indices are converted to 0-based indexing.

The system extracts user-defined windows from both termini, merges them with the linker, computes the four stability features, and outputs a composite score. Visualizations include bar plots, radar charts, combined stability plots, and a comprehensive panel. Filenames are fixed, and the working directory is printed in the console.

The system scans the target sequence for exact matches to the recognition motif:

If no sites are detected, the program notifies the user in the console and terminates the cleavage branch;
If sites are detected, for each site the program builds a weighted window and calculates the four features.

Two steps follow:

A short window (cleavage_position+2) is used to generate a concise text report;
The user-specified window generates full visualizations, including distribution plots, radar charts, heatmaps, and a comprehensive report.

The site with the highest predicted efficiency is visualized in detail, alongside efficiency distribution plots and multi-site × multi-feature heatmaps.

All images and summaries are saved under the current working directory (or a dedicated results folder). The exact paths are printed in the console, and the program waits for a key press before exit.

Figures & Reading Guide

This section explains the visual outputs of Modules A and B, describing the purpose of each figure in decision-making and the follow-up actions it supports.

Figure 1. Stability Feature Score Bar Chart

This bar chart presents four core features for assessing linker stability: flexibility, hydrophobicity balance, interface charge complementarity, and secondary structure propensity. Each feature score is not the raw absolute value but has been normalized by mapping the minimum possible value to 0 and the maximum possible value to 1, thereby aligning them on a common axis. This normalization avoids scale discrepancies between different algorithms and allows users to directly compare the relative performance of the four features. The chart helps identify potential weaknesses; for instance, a notably low hydrophobicity balance score would suggest adjustments in linker composition or terminal residues are needed.

Fig.1 Stability Feature Score Bar Chart

Figure 2. Stability Feature Radar Chart

This radar chart translates the four feature scores from Figure 1 into an integrated shape, emphasizing overall balance. Ideally, the axes are relatively equal, forming a near-square or circular shape, which indicates that the design is balanced without obvious weaknesses. If one dimension contracts significantly—such as a low charge complementarity score—the shape reveals a clear concavity, directly highlighting design risks. Compared with the quantitative comparison offered by the bar chart, the radar chart stresses structural balance, helping teams decide whether minor local adjustments (e.g., residue substitutions) are sufficient or whether a major redesign of the linker is necessary.

Fig.2 Stability Feature Radar Chart

Figure 3. Composite Stability Pie Chart

This figure adopts a dual-ring circular design to visualize both the weighted contributions of four stability features and the overall stability level of a fusion construct. The inner circle functions as a conventional pie chart, where the full circle represents 100% of the composite score. The four features—Linker Flexibility, Hydrophobicity Balance, Interface Charge Complementarity, and Secondary Structure Propensity—are sequentially filled clockwise according to their weighted contributions, thereby showing the relative impact of each feature on the final score. In the current version, the weights are defined as follows: Flexibility 25%, Hydrophobicity Balance 30%, Charge Complementarity 20%, and Secondary Structure Propensity 25%. By comparing sector sizes, users can quickly identify which feature dominates the total score and which contributes less. Unlike single-dimension comparisons, this view emphasizes “share within the composite score,” helping users understand the driving factors behind the overall assessment.

The outer ring represents graded thresholds with four reference marks at 0.4, 0.55, 0.7, and 0.8, which divide the spectrum into five levels: unstable, low, medium, high, and extremely high. The final composite score is highlighted along this radius, clearly indicating the category into which the design falls. With the combination of inner and outer layers, users not only gain a direct sense of the overall stability level but can also trace the contribution of individual features. For example, if the composite score falls into the medium range while the pie chart shows Flexibility as disproportionately high, this would suggest that optimization should focus on Hydrophobicity Balance or Charge Complementarity. In essence, this figure integrates “overall level + weighted components,” enabling teams to judge stability levels at a glance while pinpointing specific optimization targets.

Fig.3 Composite Stability Pie Chart

In the example shown, the composite score is 0.830, which places the design in the extremely high category, indicating outstanding stability. The weighted contributions of the four features are: Flexibility 0.244, Hydrophobicity Balance 0.208, Charge Complementarity 0.200, and Secondary Structure Propensity 0.178. The contributions are relatively balanced, with no single feature disproportionately weakening the result, further supporting the conclusion of an excellent overall stability level.

Figure 4. Cleavage Site Efficiency Distribution

This chart presents the overall efficiency scores of all candidate cleavage sites within the target protein sequence. The x-axis indicates the site position, while the y-axis shows the weighted composite score normalized to the 0–1 range. Two reference lines are drawn at 0.3 and 0.6, dividing sites into Low Efficiency, Moderate Efficiency, and High Efficiency levels. Each bar is labeled with the corresponding P1′ residue, enabling direct linkage between efficiency and local sequence features. By examining the distribution, users can quickly identify whether there are sites that stand out above the average. For instance, when multiple candidates are present, attention should focus on those exceeding 0.6, which can be prioritized for experimental validation. The core role of this figure is to provide a global overview of efficiency, serving as the foundation for more detailed analysis.

Fig.4 Cleavage Site Efficiency Distribution

Figure 5: Cleavage Feature Score Bar Chart

This chart visualizes the four feature scores of a single cleavage site: P1′ preference, hydrophilicity, flexibility, and disorder propensity. Each feature is normalized independently to ensure comparability across different physicochemical scales. By comparing bar heights, users can immediately spot strengths and weaknesses. For example, if the P1′ score is near maximum but flexibility is very low, the recognition signal may be favorable, while local accessibility could be limiting. Although the composite score balances these effects, this figure allows users to pinpoint decisive factors. Its value lies in helping researchers understand how the total score is constructed and whether a site merits experimental trial or further attention.

Fig.5 Cleavage Feature Score Bar Chart

Figure 6: Cleavage Feature Radar Chart

This chart illustrates the overall balance of the four features for a single site. Each axis corresponds to one feature, and higher scores extend further outward. A roughly square shape indicates a balanced profile with no major weaknesses, whereas a significant contraction in one direction—such as low flexibility or disorder—suggests limited accessibility. The radar chart complements the bar chart: while the bar chart emphasizes individual comparison, the radar chart highlights global balance. Through this visualization, teams can quickly judge whether a high-scoring site is a “well-rounded” candidate or one with uneven strengths. Balanced sites are more suitable for direct use, while skewed profiles require careful consideration of context.

Fig.6 Cleavage Feature Radar Chart

Figure 7: Composite Efficiency Pie Chart

This figure decomposes the composite score into its contributing features. The inner circle shows weighted proportions: P1′ preference 35%, hydrophilicity 25%, flexibility 20%, and disorder 20%. For the example site, the weighted contributions are 0.350, 0.155, 0.118, and 0.047, respectively, indicating that efficiency is primarily driven by P1′ preference and hydrophilicity, while lower flexibility and disorder emerge as limiting factors. This breakdown allows users to understand the mechanisms behind the total score rather than relying solely on a single value.

Fig.7 Composite Efficiency Pie Chart

The outer ring defines efficiency levels, segmented at 0.3 and 0.6 into Low Efficiency, Moderate Efficiency, and High Efficiency. The total score is explicitly marked on the radial scale, making it easy to locate the level. Unlike the stability pie chart in Module A, this visualization emphasizes whether a site achieves usable efficiency. Scores below 0.3 indicate negligible utility, 0.3–0.6 suggest an uncertain region requiring caution, and scores above 0.6 highlight candidates worth advancing.

In the example, the site's total weighted score is 0.671, classified as High Efficiency. This indicates strong feasibility for cleavage and supports prioritizing the site in experimental design. At the same time, the inner composition highlights issues to note: while P1′ and hydrophilicity perform strongly, limited flexibility and disorder may influence cleavage under specific conditions. Thus, this chart not only conveys a positive conclusion but also provides nuanced risk awareness, enabling researchers to make more robust and balanced decisions.

Scope & Outlook

In its current version, SynProtease already covers two essential aspects of our protein design pipeline: linker stability assessment and protease recognition site cleavage efficiency prediction. This reduces unnecessary trial-and-error in experimental design. However, there remain several boundaries and opportunities for expansion.

Improving predictive accuracy.

At present, modeling relies on conventional physicochemical parameters and empirically derived scoring schemes. While these provide reasonable guidance in most cases, extreme sequences or novel folds may lead to misclassification. Future developments could incorporate larger experimental datasets to capture more complex sequence–structure–function relationships and improve generalization.

Enhancing interpretability of results.

Current outputs already include feature-level scores and visualizations to help users trace score origins. In practice, however, users often want causal-level insights—e.g., how a specific amino acid change will affect stability, or under what structural context cleavage efficiency improves significantly. Future versions could introduce residue-level sensitivity analysis, combined with three-dimensional modeling or molecular dynamics simulations, to offer more fine-grained interpretive frameworks.

Expanding functional modules.

At present, the software focuses on stability and cleavage efficiency. In real fusion protein applications, however, factors such as expression levels, solubility, and membrane localization are also critical. Future expansions could add modules for predicting transcription/translation feasibility, folding likelihood, or solubility risk, thereby building a more comprehensive computational pipeline.

Diversifying application scenarios.

In the future, the tool could be deployed on a web-based platform, integrated with databases and interactive visualization. This would allow researchers without programming expertise to use SynProtease efficiently.

Summary

The current version of SynProtease already delivers practical value, but significant room for improvement remains. Its future development should not be limited to refining single-function accuracy but should also encompass broadening the workflow into a more integrated design pipeline.

Overview

Highlights

Interpretability at the core

Reliability of the program

Low barrier to use

Decision-oriented outputs

Reproducible evidence chain

Ease of extension

Applications

User groups

Specific scenarios

Fusion construct design selection

Cleavage site planning

Iterative design and local optimization

Resource prioritization and risk management

Teaching and knowledge transfer

Design Rationale

Starting point of design

Core principle

Module A: Fusion protein stability — four guiding rules

Module B: Protease cleavage efficiency — four signals

Algorithm

Data and parameters

Module A: Stability criteria

Module B: Cleavage signals

Visualization and interpretation

Workflow

User inputs

Module A flow

Module B flow & branching

Figures & Reading Guide

Module A figures

Figure 1. Stability Feature Score Bar Chart

Figure 2. Stability Feature Radar Chart

Figure 3. Composite Stability Pie Chart

Module B figures (example: TEV protease, motif ENLYFQ)

Figure 4. Cleavage Site Efficiency Distribution

Figure 5: Cleavage Feature Score Bar Chart

Figure 6: Cleavage Feature Radar Chart

Figure 7: Composite Efficiency Pie Chart

Scope & Outlook

Improving predictive accuracy.

Enhancing interpretability of results.

Expanding functional modules.

Diversifying application scenarios.

Summary