S a f e t y   a n d   S e c u r i t y

Lab Works

Safe Use and Recombination of P1-Classified Organisms and Genes

The Plan and Its Rationale

We conducted laboratory experiments in strict compliance with the Cartagena Act. We used Escherichia coli, specifically strains JM109 and BL21, as host. We used relatively safe genes, such as GFP, β-lactamase, and PETase. We also used vectors with narrow host ranges (pUC and pET28a). All recombinant DNA experiments were performed exclusively in designated P1-level laboratory facilities. As containment measures, all used equipment and genetically modified organisms (E. coli) were inactivated by autoclaving (121°C, 20 minutes). Equipment that could not be autoclaved was sterilized using sodium hypochlorite solution at concentrations above 100 ppm or UV irradiation. Double packaging was strictly enforced during transport. These operations ensured appropriate diffusion prevention measures. To ensure experimental safety, all personnel involved in this experiments completed the Recombinant DNA Training Course at the University of Tsukuba, where they learned detailed information about experimental facilities, knowledge appropriate to the biosafety level of target organisms, and proper transport and disposal methods. A distinctive feature of this project is the use of AI-generated gene sequences. Mutants generated by AI may exhibit unexpected functions or biological activities. To address this, recombinant E. coli containing the generated mutant proteins were handled using the appropriate biological containment procedures described above.

Relevant Articles

This work complies with Articles 12 and 13 of the “Act on the Conservation and Sustainable Use of Biological Diversity through Regulations on the Use of Living Modified Organisms” (Cartagena Act).

Toxic Protein Filtering for Software

Risks of AI-Based Protein Design

In recent years, the life sciences field has witnessed remarkable advances in artificial intelligence (AI). Among these developments, technologies such as AlphaFold and protein Language Models (pLMs) have attracted significant attention as tools that dramatically improve the efficiency of protein structure prediction and design. These technologies hold broad application potential in drug discovery, enzyme engineering, and biomaterial development, enabling design processes that previously required years to be completed within weeks or even days.

However, the rapid proliferation of AI-driven protein design brings new risks from biosafety and biosecurity perspectives. The primary concern is the relative ease with which design and modification can be performed, potentially leading to the creation of novel proteins with dangerous functions contrary to researchers’ intentions. For example, there are risks that modifying normally harmless enzymes could generate mutants with toxic side effects, or that such technologies could be misused to enhance pathogenic functions [1][2][3][4].

The Concept of SOC (Sequences of Concern)

Particularly noteworthy in this context is the concept of SOC (Sequences of Concern). SOC refers to gene sequences related to pathogens or toxins, or sequences whose components may exhibit dangerous functions. The international research community recognizes the importance of identifying sequences that fall under SOC and appropriately regulating and monitoring their use in research and industrial applications. AI-driven design tools inherently carry the potential to inadvertently generate sequences that approximate SOC, making robust screening mechanisms essential [1][2].

Red Teaming

Furthermore, a practical approach to highlighting the dangers of AI technology is red teaming. Originally a method used in cybersecurity, red teaming involves testing system vulnerabilities from the perspective of potential attackers. In recent years, this approach has been introduced in the life sciences and AI-bio domains to verify whether AI can be used to “intentionally” generate dangerous proteins. The results demonstrated that specific algorithms can indeed design sequences with potentially harmful properties, sending shockwaves through the international community [5].

Design Tools Like LEAPS-Software and Risk Management

Efficient and innovative protein design tools such as LEAPS-Software must incorporate filtering mechanisms and safety management systems to prevent the generation of hazardous sequences. Researchers bear the responsibility not only to create novel sequences but also to utilize AI technology within socially acceptable boundaries by combining SOC databases with hazard screening technologies.

For these reasons, this feature adds a screening mechanism to examine whether protein sequences improved by LEAPS-Software contain unintended harmful functions or dual-use concerns. Methods with high exploration and design capabilities, such as LEAPS-Software, may accidentally confer novel functions beyond existing human intuition. In particular, if the arrangement of active residues similar to pathogenic factors or motifs resembling toxic domains are incorporated during the design process, serious risks could arise during experimental handling or third-party use. Therefore, prescreening generated sequences for potential hazards is essential.

Overview of the Screening Workflow

The target is novel modified protein sequences generated by LEAPS-Software. Evaluation is performed in parallel from both sequence similarity and structural similarity perspectives to determine hazard signals.

We included structural similarity searches in addition to sequence homology searches because sequences with low similarity may still converge on dangerous motifs structurally, making cross-checking between both approaches critical. Default parameter settings were used for MMSeqs2 and FoldSeek.

Screenshot 2025-10-09 13.43.43.png

Fig. 1. Overview of screening workflow

Blacklist Construction and Publication Policy

Initially, we considered creating a comprehensive list of hazardous proteins for the iGEM community. However, through IHP activities, we learned that comprehensive lists of dangerous proteins carry their own risks of misuse. Furthermore, since iGEM requires all deliverables to be published on GitLab, we decided not to publicly release a comprehensive database. Instead, we will publish a provisional blacklist on GitLab for use with LEAPS-Software.

The provisional blacklist used in this work was obtained through the following procedure:

1. Dataset for MMSeqs2 sequence homology search

  1. On 2025/10/07, sequences were retrieved from UniProt KB using keywords KW-0800 and KW-0261 with the “Reviewed” filter, and duplicates were removed.

  2. To create a concise dataset with reduced redundancy, clustering was performed using MMseqs2 to compress sequences. (Fig. 2 shows FASTA sequences passed through ESM2 for embedding, with 1280 dimensions visualized via PCA. This visualization confirms that closely related sequences form clusters.)

  3. Clustering yielded 2,128 clusters, from which a sequence dataset containing only representative sequences was created.

Database_Clustering_ESM2embetting.png

Fig. 2. Clustering using ESM2 embeddings

2. Dataset for FoldSeek structural similarity search

  1. Three-dimensional structures were obtained for the 2,128 representative sequences using ColabFold to create a structural database.

Limitations and Future Prospects of Screening

Improving Computational Efficiency and Future Standardization

To enable comparison against large-scale databases, this project utilized MMSeqs2 (sequence similarity search) and FoldSeek (structural similarity search), which are fast and highly accurate tools. In the future, when implementing similar safety protocols, we strongly recommend referencing the common protocol “Common Mechanism” (Commec) proposed by the International Biosafety and Biosecurity Initiative for Science (IBBIS) [7][8]. Commec is a comprehensive framework aimed at standardizing safety assessments in synthetic biology projects and is expected to contribute to safety improvements across the research community.

Fundamental Limitations of Similarity-Based Methods

The screening approach adopted in this project, based on sequence and structural similarity, relies on the principle of predicting function through comparison with known databases. While this method efficiently detects known hazardous factors, it has the significant limitation of potentially missing unknown dangerous functions or novel biological functions not recorded in existing databases due to low similarity. Furthermore, technical limitations such as uncertainty in protein structure prediction remain unavoidable challenges at present. These issues suggest the possibility of false negatives in safety assessments and represent areas requiring future improvement.

Recommendations for the iGEM Community

There is no doubt that projects centered on AI-driven DNA and protein design will increase throughout the iGEM community in the future. Amid this wave of technological innovation, we emphasize the importance of next-generation researchers possessing not only technical knowledge but also sufficient understanding of biosafety and biosecurity. We strongly urge the entire community to cultivate a culture of responsible research practices and risk assessment to realize safe and ethical AI-driven synthetic biology projects.

Disclaimer and Terms of Use

Background

Limitations of Filtering

LEAPS-Software implements filtering using a blacklist to ensure safety. This filtering is highly effective for detecting toxic proteins and viral protein sequences. However, this mechanism has several limitations. Filtering faces two main challenges: blacklist selection and interference with legitimate use, which are explained below.

Note that safety management measures for the software, including filtering, were determined and implemented based on Human Practices outcomes. For details, see the Software section of the IHP page.

Blacklist Selection

The primary misuse we are concerned about is the generation of improved sequences with high toxicity and their use in biological weapons. Proteins that should be selected for the blacklist to prevent this are those with toxicity or pathogenicity. However, drawing clear boundaries for toxicity or pathogenicity is difficult.

For example, toxins cause adverse effects on organisms above exposure thresholds but may be harmless below them. Ultimately, toxins can be described simply as highly bioactive substances. Therefore, substances not generally recognized as toxic may become harmful if the tolerance level is exceeded or certain properties are enhanced. For pathogenicity, whether something is dangerous depends on the host—some agents do not infect humans but do infect birds or dogs.

As shown above, whether the same protein sequence is considered a toxin varies depending on quantity and conditions. Given this property, creating a blacklist that completely identifies proteins with misuse potential using uniform criteria is extremely difficult.

Interference with Legitimate Use

Even if a blacklist could be constructed perfectly, new challenges would emerge.

There are two main methods for filtering using blacklists. The first prevents the input of dangerous proteins with toxicity or pathogenicity, blocking the use of protein sequences contained in the blacklist altogether. However, this system risks interfering with legitimate use.

Research on toxins and pathogens includes pharmaceutical development. Toxins acting in appropriate quantities at appropriate sites become medicines, and research on proteins with pathogenicity directly connects to vaccine development. Such research is not the misuse we anticipate but rather highly beneficial research. If filtering with the blacklist prevents software use, it would uniformly block such legitimate research. This would greatly contradict our principle stated when developing LEAPS-Software: “to broadly provide protein improvement using machine learning, thereby promoting the development of the protein research field.”

To avoid this interference with legitimate use, the second filtering method was considered: allowing the use of protein sequences contained in the blacklist while issuing only warnings. LEAPS-Software adopted this filtering approach, prioritizing broad access to benefits of protein research. However, since this filtering alone would also facilitate misuse, additional countermeasures were necessary.

Concerns About Inputting Unpublished Research Data

To use LEAPS-Software, users must first input a dataset containing sequence information and its function evaluated numerically. This dataset constitutes unpublished research information. Unpublished research data is highly confidential, and IHP revealed that researchers are reluctant to upload it to the internet. Additionally, as service providers, we must retain data online to prepare for misuse, necessitating explanation of this requirement. We must also promise not to misuse the obtained information.

Determined Countermeasures

Based on these considerations, we formulated a Disclaimer, Terms of Use, and Privacy Policy according to the following principles:

  • Clarify the scope of user responsibility
  • Clarify the scope of service provider responsibility
  • Deter misuse by malicious third parties

By displaying and obtaining user agreement to these documents, we aimed to reinforce LEAPS-Software’s safety and create software that users can employ with greater confidence.

Disclaimer

The disclaimer primarily serves to make users aware of whether their sequences pose hazards.

Specifically, users are asked via yes/no questions to confirm whether the protein is derived from toxins or pathogens and whether the input amino acid sequence matches the list of proteins requiring special attention. Though simple in structure, this mechanism aims to deter misuse by making users consciously recognize when handling dangerous sequences.

Regarding proteins requiring attention, we created separate lists for Japanese and English versions, considering regulatory differences between countries. While pathogens like Ebola virus are recognized for their high risk globally and regulated in many countries regarding gene possession and synthesis, regulations differ for endemic pathogens such as Japanese encephalitis virus. Therefore, the Japanese version references the “Correspondence Table of Pathogen Names and Disease Names” issued by Japan’s Ministry of Health, Labour and Welfare, while the English version references the “Select Agents and Toxins List” jointly created by the U.S. Centers for Disease Control and Prevention (CDC) and the U.S. Department of Agriculture (USDA). By allowing users to check lists appropriate to their country or region, LEAPS-Software can be used more smoothly.

Screenshot 2025-10-09 14.35.09.png

Fig. 3. Project Start Screen

The disclaimer is displayed at project initiation. For the full disclaimer text and lists within, please refer to the PDF files.

LEAPS_Disclaimer_en.pdf

Terms of Use

The Terms of Use primarily serve to clarify the scope of responsibility for both users and service providers. Additionally, we present the principles behind LEAPS-Software development. This emphasizes our stance that LEAPS-Software is provided as a product to contribute to protein research advancement, intending to lead users toward appropriate use.

User responsibility encompasses all handling of sequences input to and output from the software, with all benefits and disadvantages arising from this use attributed to the user. Particularly, users are explicitly prohibited from using LEAPS-Software for acts contrary to public safety and ethics.

Service provider responsibility encompasses all handling of sequences input to the software, service maintenance, and other personal information obtained. However, we stated that we cannot guarantee service accuracy regarding output sequence performance or processing time. Here, by explicitly addressing unpublished research data, we sincerely acknowledged its significance to researchers and promised users strict management.

Additionally, we established provisions for other important rights, including service suspension and logo usage.

These Terms of Use are presented during account creation at service initiation, with user agreement deemed complete upon account creation.

By defining the scope of responsibility for both users and service providers, including information handling, we intend to enable users to employ the service with confidence. Additionally, by demonstrating the service provider’s stance against complicity in misuse, we have implemented measures to sustain the service as much as possible in the event of misuse.

Screenshot 2025-10-09 14.30.39.png

Fig. 4. Terms of Use Display Screen

For the full text, please refer to the PDF files. In creating the Terms of Use, we referenced IDT’s “IDT Online Terms and Conditions of Sale” [9].

LEAPS_Terms of Use_en.pdf

Privacy Policy

From the perspective of protecting research information, LEAPS requires user registration via email address to prevent third-party access to user chats. Accordingly, we created a Privacy Policy defining the handling of personal information. It details what personal information encompasses, its use after collection, retention periods, and related matters. By clearly stating our response procedures, we aim to ensure user safety and security without infringing on user rights.

Since our team lacks legal experts, we received assistance from ChatGPT, a large language model, for suggestions and refinement in wording and clarity during Privacy Policy creation. This was strictly supplementary support; content decisions were made by the team.

Screenshot 2025-10-09 14.31.25.png

Fig. 5. Privacy Policy Display Screen

For the full text, please refer to the PDF files.

LEAPS_Privacy Policy_en.pdf

References

  1. International Gene Synthesis Consortium. (n.d.). IGSC Harmonized Screening Protocol v3.0 [PDF]. Retrieved from https://genesynthesisconsortium.org/wp-content/uploads/IGSC-Harmonized-Screening-Protocol-v3.0-1.pdf

  2. International Gene Synthesis Consortium. (n.d.). Home (IGSC). Retrieved from https://genesynthesisconsortium.org

  3. National Center for Biotechnology Information. (n.d.). NCBI Bookshelf. Retrieved from https://www.ncbi.nlm.nih.gov/books/NBK614591

  4. Integrated DNA Technologies (IDT). (n.d.). Biosecurity challenges in the age of AI. In Decoded+ Support & Education. Retrieved from https://sg.idtdna.com/page/support-and-education/decoded-plus/biosecurity-challenges-in-the-age-of-ai

  5. Ikonomova, S., Wittmann, B., Piorino Macruz de Oliveira, F., Ross, D., Schaffter, S., Vasilyeva, O., Strychalski, E., Horvitz, E., Diggans, J., Lin-Gibson, S., & Taghon, G. (2025). Experimental evaluation of AI-driven protein design risks using safe biological proxies. Science. Retrieved from https://www.nist.gov/publications/experimental-evaluation-ai-driven-protein-design-risks-using-safe-biological-proxies

  6. PMC. (n.d.). Article in PMC. Retrieved from http://pmc.ncbi.nlm.nih.gov/articles/PMC12158449/

  7. Laird, T. S., Flyangolts, K., Bartling, C., Gemler, B. T., Beal, J., Mitchell, T., Murphy, S. T., Berlips, J., Foner, L., Doughty, R., Quintana, F., Nute, M., Treangen, T. J., Godbold, G., Ternus, K., Alexanian, T., Wheeler, N., & Forry, S. P. (2025). Inter-tool analysis of a NIST dataset for assessing baseline nucleic acid sequence screening. bioRxiv (Cold Spring Harbor Laboratory).

  8. Wittmann, B. J., Alexanian, T., Bartling, C., Beal, J., Clore, A., Diggans, J., Flyangolts, K., Gemler, B. T., Mitchell, T., Murphy, S. T., Wheeler, N. E., & Horvitz, E. (2024). Toward AI-Resilient Screening of Nucleic Acid Synthesis Orders: process, results, and recommendations. bioRxiv (Cold Spring Harbor Laboratory).

  9. Integrated DNA Technologies. (n.d.). IDT Online Terms and Conditions of Sale.

Slide 1Slide 2Slide 3Slide 4Slide 5

© 2025 - Content on this site is licensed under a Creative Commons Attribution 4.0 International license

The repository used to create this website is available at gitlab.igem.org/2025/tsukuba.