S o f t w a r e O v e r v i e w

Contributors

Index

Background

Primary Function of LEAPS-Software

User-Friendly UI Design

An Intuitive UI

Design Details

Implementation of Site Design

Improvements Based on User Feedback

Pursuit of Safety with a Safety Protocol

Dangers of AI in Protein Engineering

Safety Protocol

Disclaimer and Terms of Use

Background

In recent years, the use of AI and machine learning in protein engineering and design has become widespread. However, handling these methods requires a high level of expertise. As a result, protein researchers and student teams who wish to explore new design techniques often face the significant barrier of machine learning. To address this challenge, we have designed and developed LEAPS, a protein engineering model that operates entirely in silico (on a computer) and can achieve multi-objective optimization even with a small amount of data. Although LEAPS is a highly efficient model capable of simultaneously optimizing the properties of a target protein using just 40 assay data points, its complex architecture, which combines multiple machine learning methods, currently limits its use to a small number of researchers. Therefore, to make this technology accessible to a broader audience, we have decided to release LEAPS as an open platform named “LEAPS-Software.”

Primary Function of LEAPS-Software

LEAPS-Software is an open platform that enables researchers and students to efficiently advance their protein design projects without needing specialized knowledge in machine learning or large datasets. It features an intuitive UI and flexible optimization settings, allowing users to switch between multi-objective and single-objective optimization based on their goals.

Multi-objective Optimization

LEAPS-Software can perform multi-objective optimization, considering the trade-offs between multiple properties while simultaneously enhancing them or adjusting them to fall within a specific range. Users simply input values for properties such as reaction rate, thermal stability, substrate specificity, and optimal pH, and assign a goal to each settings—maximize, minimize, or range. The algorithm then automatically generates sequence candidates that satisfy these multiple conditions. For example, it can propose candidates that meet several objectives at once, such as “maximizing reaction rate while maintaining the optimal pH in the neutral range and keeping thermal stability within a specified range.”

Single-objective Optimization

In addition to multi-objective optimization, LEAPS-Software also supports single-objective optimization, which allows for rapid improvements by focusing on a single metric. By selecting one target label and setting its goal to maximize, minimize, or range, the software automatically generates sequence candidates that meet the specified condition. For instance, in cases like “increasing only thermostability” or “improving only substrate specificity,” single-objective optimization enables efficient design modifications in a short amount of time.

User-Friendly UI Design

An Intuitive UI

The core design principle of LEAPS-Software is an “intuitive UI.” We aimed to create a seamless, unidirectional workflow that allows users to proceed from uploading a dataset to reviewing the results without any confusion.

Initially, we considered a UI where users could figure out settings through a dialogue with an LLM. However, we determined that highly flexible inputs could easily lead to deviations from standard procedures and reduce reproducibility, making it unsuitable for our use case. Instead, we adopted a format that presents necessary information in stages, designed to make the next step immediately clear.

Design Details

Every input field is carefully designed from the perspectives of “why,” “where,” “what,” “when,” “who,” and “how.” We minimized the number of choices by reducing decision-making process, grouped related items for clarity, and presented only the necessary fields. For processes that take time, a loading indicator is displayed. By maintaining a consistent design and terminology throughout the entire interface, we have lowered the psychological cost for the user.

Implementation of Site Design

State Visualization and Input Assistance

Information is divided into steps using collapsible sections, and the status of a task (waiting, running, success, failure, aborted) is instantly delivered with icons and colors. For dataset input, the software automatically detects whether the file is comma-separated or tab-separated, displays a preview in a table format, and displays a small message about header mismatches or missing values directly below the input field. Detailed adjustment options are hidden under “Advanced Settings,” allowing most users to proceed with the default settings.

Optimization of Result Display and Alerts

Generated sequences and their predicted values are quickly displayed in a list, and a loading indicator provides visual feedback during processing. The safety protocol, which excludes potentially hazardous proteins, only displays the ID of a single protein if a match is found. By using alerts sparingly, we maintain their quality and prevent the problem of users becoming desensitized to frequent warnings.

Improvements Based on User Feedback

We conducted repeated validations based on the criterion of “can it be used as intended without confusion,” simulating actual use cases. We ran simple tests with non-engineers in our immediate circle and evaluated the usability based on established principles.

Specific Improvement Examples

Improved Error Display: We changed from a system that notified users of all errors after submission to a new system where errors are validated and displayed in real-time for each input field. The system now provides a written explanation for any edits.

Optimized Item Grouping: Instead of a long page with numerous input fields, we grouped related items into smaller, organized sections. This reduces the amount of information users need to remember at one time.

Pursuit of Safety with a Safety Protocol

Dangers of AI in Protein Engineering

In recent years, the advancement of AI technologies such as AlphaFold and pLMs has made protein structure prediction and design more efficient than ever before. These technologies are being applied in a wide range of fields, including drug discovery, enzyme engineering, and synthetic biology. Design tasks that once took years can now be completed in weeks or even days. However, the proliferation of automated AI-based design also introduces new risks. One concern is the possibility that AI could unintentionally create proteins with dangerous functions. In particular, the risk of generating sequences similar to genes associated with pathogens or toxins (Sequences of Concern, SOCs) has been pointed out. It has been confirmed that AI can indeed create harmful sequences, sparking international debate. Therefore, it is essential for protein design tools like LEAPS-Software to incorporate screening functions that detect the generation of potentially hazardous proteins and automatically exclude dangerous sequences. While the power of AI can lead to new discoveries beyond human imagination, it also increases the risk of misuse and accidents. Consequently, researchers have a responsibility to advance technology with a constant awareness of not only its convenience but also its safety and ethical implications.

Safety Protocol

To eliminate the risks associated with LEAPS-Software, we have introduced a safety protocol that employs two independent methods: sequence homology and structural similarity. For sequence homology, we use MMSeqs2 to compare amino acid sequences against a database of sequences with confirmed risks. This process extracts HIT IDs of toxic proteins with similar sequences. For structural similarity, we use FoldSeek with its default parameters to check for similarities at the 3D structure level against a database of hazardous structures. This provides HIT IDs of structurally similar toxic proteins. By displaying the HIT IDs in the software, users can understand the potential risks of their engineered proteins and proactively avoid safety issues.

Scientific Rationale for the Dual-Evaluation Approach

The reason for conducting structural similarity searches in addition to sequence similarity searches is based on a critical biological principle in protein risk assessment. Even if sequence-level similarity is low, a protein’s 3D structure may resemble a dangerous motif, such as the active site of a toxin. Conversely, even if a sequence is partially similar, the actual 3D structure may be significantly different and lack biological activity. Therefore, by cross-checking both sequence and structure, we can capture risks that might be missed by a single method, thereby improving the accuracy and reliability of the evaluation.

Tools and Technical Specifications

・MMSeqs2

MMSeqs2 (Many-against-Many sequence searching) is a tool capable of performing high-speed similarity searches against large-scale databases containing millions to billions of sequences. Its computational speed is significantly faster than BLAST, enabling the efficient screening of a vast number of sequences. In this study, we used the default parameters of MMSeqs2, conducting the analysis under standard search conditions that balance sensitivity and specificity.

・FoldSeek

FoldSeek is an innovative tool for high-speed comparison and searching of 3D protein structures. It converts structural information into 3Di (3D interaction) descriptors instead of sequences, and evaluates structural similarity using a high-speed algorithm similar to sequence searching. This allows for much faster searches of large structural databases compared to conventional structure comparison methods. For FoldSeek, we also adopted the default parameters, adhering to standard criteria for determining structural similarity.

Disclaimer and Terms of Use

In addition to the safety mechanisms mentioned above, we have created a disclaimer and terms of use, and implemented a system requiring users to agree to them before using LEAPS-Software. This serves primarily as a deterrent against misuse and clarifies the scope of our responsibility, covering the technical limitations of the safety mechanism from ethical and social responsibility perspectives.

Initially, we intended for the safety mechanism to reject the input of harmful proteins into LEAPS-Software and to issue warnings for output sequences with structural similarities to harmful proteins, thereby preventing the software from being used for inhumane acts such as bioterrorism. However, through advice from experts during our Human Practices work, we were confronted with the inadequacies of our initial system. We were advised of the difficulty in defining toxins, the impossibility of creating a perfect list of harmful proteins, and the technical limitations of filtering. Furthermore, it was pointed out that modifying viral proteins or toxins, when conducted as proper research, could contribute to the development of new pharmaceuticals. A blanket ban on the use of sequences on a list of harmful proteins could inadvertently hinder potentially beneficial research.

In addition to these challenges with filtering technology, several experts suggested that we clarify the locus of responsibility in case of user misuse through a disclaimer and deter misuse by explicitly stating our concerns and collecting user information.

We were also aware that researchers might be hesitant to input information related to unpublished research into the software.

Based on these considerations, the final safety mechanism only displays a warning when there is high similarity to the sequence or 3D structure of a harmful protein, allowing the user to proceed with using LEAPS-Software after acknowledging the warning. We then formulated a disclaimer to clarify the scope of responsibility for both the user and us, the service providers, and terms of use that prohibit misuse. By displaying these documents and requiring user agreement, we encourage the appropriate use of LEAPS-Software. We also attempted to alleviate concerns about data retention by creating terms of use and a privacy policy.

The following is an overview of what is stated in each document. For specific details, please refer to the PDF files.

Disclaimer

The disclaimer plays a role in making users re-evaluate the potential risks of their sequences by asking them to confirm whether the input amino acid sequence—the protein they wish to improve—is toxic or pathogenic, or if it matches a list of proteins requiring special attention that we provide. It also states that the user is the person responsible for handling the sequence and requires them to agree to exercise the utmost caution, including compliance with laws and regulations. It demands that the user take full responsibility for managing both the input and output sequences.

Our team created original lists: the Japanese version is based on the “Table of Pathogens and Corresponding Diseases,” a list of regulated agents created by the Ministry of Health, Labour and Welfare of Japan, while the English version is based on the “Select Agents and Toxins List,” jointly created by the Centers for Disease Control and Prevention (CDC) under the U.S. Department of Health and Human Services (HHS) and the Department of Agriculture (USDA). This is because the lists differ between Japan and other countries, and our intention is to provide a list that is more relevant to the user’s research environment.

LEAPS_Disclaimer_en.pdf

Terms of Use

The terms of use first state the philosophy behind the development of LEAPS-Software. This clarifies our stance that the software was developed to contribute to the advancement of protein research, thereby encouraging appropriate use by users.

It also clearly defines the responsibilities and rights of both the user and us as the service provider. It explicitly prohibits the use of LEAPS-Software for acts that violate public safety and ethics and mandates that users agree to the disclaimer, terms of use, and privacy policy displayed before use. Compliance with laws and regulations is described in more detail than in the disclaimer. This establishes that the responsibility for any misuse lies with the user. On the other hand, our responsibilities, particularly the strict management of input information, are described in detail. This is intended to show that we, as the service provider, take seriously the gravity of researchers using unpublished research data and aim to provide a tool they can use for their research with confidence.

Other important provisions regarding rights, such as service suspension and use of the logo, are also included.

LEAPS_Terms of Use_en.pdf

Privacy Policy

From the perspective of research information retention, LEAPS requires user registration with an email address to prevent third-party access to user chats. Accordingly, we have created a privacy policy that defines the handling of personal information. It specifies in detail what constitutes personal information, its intended use after collection, and the retention period. The purpose is to ensure the safety and security of users by clearly stating our response to protect their rights from infringement.

LEAPS_Privacy Policy_en.pdf

The repository used to create this website is available at gitlab.igem.org/2025/tsukuba.