Software Introduction

The Neighborhood Analysis Tool is a specialized bioinformatics application designed for clustering protein sequences based on their evolutionary relationships. Built with C++ for high-performance computation and Qt for an intuitive user interface, this tool implements an advanced adaptive clustering algorithm that automatically adjusts to the local density of sequence networks.

Unlike traditional clustering methods that use a single global threshold, our tool treats each sequence as an individual entity, allowing it to define its own "social circle" based on local network characteristics. This approach produces more biologically meaningful clusters that accurately reflect evolutionary relationships.

To access the software source code, you can directly view or download it from the Software Repository — source code files are stored in the project files(visual studio) directory. For developers, you can also clone the source code to your local environment.
For users who need a ready-to-run version, the Installation_package.zip file in the repository is a pre-packaged archive — you only need to unzip it to use the software without additional installation.

How It Works: The Science Behind the Tool

Core Algorithm Principle

The tool implements a three-tiered adaptive threshold system that mimics how relationships form in biological networks:

Dense Regions: Sequences with many close relatives use strict thresholds to maintain high-quality connections
Sparse Regions: Isolated sequences use more permissive thresholds to avoid excessive fragmentation
Intermediate Regions: Sequences with moderate connectivity use balanced thresholds

This adaptive approach prevents two common problems in traditional clustering:

Over-fragmentation: Isolated sequences being split into meaningless single clusters
Over-merging: Distinct functional groups being lumped together in "hairball" networks

Technical Architecture

The software consists of two main components:

Backend Engine (C++)

• Processes similarity matrices and performs computational clustering
• Implements graph theory algorithms for network analysis
• Handles large datasets efficiently through optimized data structures

Frontend Interface (Qt)

• Provides user-friendly parameter controls
• Displays real-time progress and results
• Enables easy export to visualization tools

Step-by-Step User Guide

1. Getting Started

Launch the Application

• Double-click the ClusteringAppWithGUI.exe
• The main window will open with default parameters already set

Understanding the Interface

The main window is divided into three sections:

• File Settings (top): For input and output file selection
• Analysis Parameters (middle): For adjusting clustering sensitivity
• Progress & Log (bottom): For monitoring the analysis

2. Preparing Your Data

Input File Format

Your input file should be a tab-separated text file with this structure:

• First row: Sequence identifiers
• First column: Sequence identifiers
• Remaining cells: Similarity scores (0.0 to 1.0)

Example Input:

	Seq1	Seq2	Seq3	Seq4
Seq1	1.0	0.8	0.3	0.2
Seq2	0.8	1.0	0.4	0.1
Seq3	0.3	0.4	1.0	0.9
Seq4	0.2	0.1	0.9	1.0

3. Setting Up Your Analysis

Step 1: Select Input File

• Click the "Browse" button next to "Input File"
• Navigate to your similarity matrix file
• Select the file and click "Open"

Step 2: Choose Output Directory

• Click the "Browse" button next to "Output Directory"
• Select where you want results saved
• Click "Choose Folder"

Step 3: Adjust Analysis Parameters (Optional)

The four parameters control clustering sensitivity:

Loose Threshold (0.5 by default):

• Removes only very weak connections
• Use higher values for more conservative clustering

Medium Threshold (0.7 by default):

• Balanced filtering for moderate-density regions
• Good starting point for most analyses

Strict Threshold (0.9 by default):

• Keeps only strong, high-confidence connections
• Use for very dense sequence families

Minimum Neighbors (3 by default):

• Minimum connections each sequence should retain
• Higher values create more connected clusters

Parameter Tips for Beginners:

• Start with the default values for initial exploration
• If you get too many small clusters, try lowering the thresholds
• If clusters seem too mixed, try increasing the thresholds

4. Running the Analysis

Step 1: Start Analysis

• Click the "Start Analysis" button
• A progress bar will show the analysis status
• Detailed logs appear in the text area below

Step 2: Monitor Progress

Watch the log messages for:

• File loading confirmation
• Graph construction status
• Neighborhood analysis progress
• Cluster identification
• Export completion

Step 3: Review Results

• Wait for the "Analysis Complete!" message
• The "View Results" button will become active

5. Understanding Your Results

Output Files Generated:

Cluster List File (components.txt)
- • Lists all sequences grouped by cluster
- • Shows cluster sizes and member sequences
Network File for Visualization (cytoscape_edges.txt)
- • Formatted for import into Cytoscape
- • Contains connection strengths and relationships
Adjacency Matrix Files
- • Mathematical representation of clusters
- • Useful for advanced analysis

Interpreting Results:

• Large Clusters: Groups of closely related sequences
• Small Clusters: Specialized or unique sequences
• Singleton Clusters: Highly divergent sequences

6. Visualizing Results

Using Cytoscape (Recommended):

Install Cytoscape from https://cytoscape.org/
Open Cytoscape and go to File → Import → Network from File
Select the cytoscape_edges.txt file from your output directory
Use Layout → Prefuse Force Directed for optimal clustering visualization

Understanding the Network:

• Node Size: Can represent sequence length or importance
• Edge Thickness: Represents similarity strength
• Cluster Colors: Group related sequences visually

Troubleshooting Common Issues

Problem: Analysis fails to start

• Check that input file is properly formatted
• Ensure all similarity scores are between 0 and 1
• Verify file is not open in another program

Problem: Results show only one big cluster

• Try increasing the strict threshold
• Reduce the minimum neighbors parameter
• Check if input similarity scores are too high

Problem: Results show too many small clusters

• Try decreasing the loose threshold
• Increase the minimum neighbors parameter
• Check if input similarity scores are too low

Problem: Cannot open results in Cytoscape

• Ensure you're selecting the correct file type in Cytoscape
• Check that the output directory is accessible
• Verify Cytoscape version compatibility

If you have any other questions, you can create an issue or contact us, and the contact information is available in the footer.

Advanced Features

Batch Processing

The tool can be run from command line for automated processing
Useful for analyzing multiple datasets

Custom Export Formats

Additional export options available for specialized analysis
Can be customized in the source code for specific needs

Performance Optimization

Handles networks with thousands of sequences efficiently
Memory-optimized for large datasets