Software Introduction
The Neighborhood Analysis Tool is a specialized bioinformatics application designed for clustering protein sequences based on their evolutionary relationships. Built with C++ for high-performance computation and Qt for an intuitive user interface, this tool implements an advanced adaptive clustering algorithm that automatically adjusts to the local density of sequence networks.
Unlike traditional clustering methods that use a single global threshold, our tool treats each sequence as an individual entity, allowing it to define its own "social circle" based on local network characteristics. This approach produces more biologically meaningful clusters that accurately reflect evolutionary relationships.
To access the software source code, you can directly view or download it from the Software
Repository — source code files are stored in the
project files(visual studio) directory. For developers, you can also clone the
source code to your local environment.
For users who need a
ready-to-run version, the Installation_package.zip file in the repository is a
pre-packaged archive — you only need to unzip it to use the software without additional
installation.
How It Works: The Science Behind the Tool
Core Algorithm Principle
The tool implements a three-tiered adaptive threshold system that mimics how relationships form in biological networks:
- Dense Regions: Sequences with many close relatives use strict thresholds to maintain high-quality connections
- Sparse Regions: Isolated sequences use more permissive thresholds to avoid excessive fragmentation
- Intermediate Regions: Sequences with moderate connectivity use balanced thresholds
This adaptive approach prevents two common problems in traditional clustering:
- Over-fragmentation: Isolated sequences being split into meaningless single clusters
- Over-merging: Distinct functional groups being lumped together in "hairball" networks
Technical Architecture
The software consists of two main components:
Backend Engine (C++)
- • Processes similarity matrices and performs computational clustering
- • Implements graph theory algorithms for network analysis
- • Handles large datasets efficiently through optimized data structures
Frontend Interface (Qt)
- • Provides user-friendly parameter controls
- • Displays real-time progress and results
- • Enables easy export to visualization tools
Step-by-Step User Guide
1. Getting Started
Launch the Application
- • Double-click the
ClusteringAppWithGUI.exe - • The main window will open with default parameters already set
Understanding the Interface
The main window is divided into three sections:
- • File Settings (top): For input and output file selection
- • Analysis Parameters (middle): For adjusting clustering sensitivity
- • Progress & Log (bottom): For monitoring the analysis
2. Preparing Your Data
Input File Format
Your input file should be a tab-separated text file with this structure:
- • First row: Sequence identifiers
- • First column: Sequence identifiers
- • Remaining cells: Similarity scores (0.0 to 1.0)
Example Input:
Seq1 Seq2 Seq3 Seq4
Seq1 1.0 0.8 0.3 0.2
Seq2 0.8 1.0 0.4 0.1
Seq3 0.3 0.4 1.0 0.9
Seq4 0.2 0.1 0.9 1.0
3. Setting Up Your Analysis
Step 1: Select Input File
- • Click the "Browse" button next to "Input File"
- • Navigate to your similarity matrix file
- • Select the file and click "Open"
Step 2: Choose Output Directory
- • Click the "Browse" button next to "Output Directory"
- • Select where you want results saved
- • Click "Choose Folder"
Step 3: Adjust Analysis Parameters (Optional)
The four parameters control clustering sensitivity:
Loose Threshold (0.5 by default):
- • Removes only very weak connections
- • Use higher values for more conservative clustering
Medium Threshold (0.7 by default):
- • Balanced filtering for moderate-density regions
- • Good starting point for most analyses
Strict Threshold (0.9 by default):
- • Keeps only strong, high-confidence connections
- • Use for very dense sequence families
Minimum Neighbors (3 by default):
- • Minimum connections each sequence should retain
- • Higher values create more connected clusters
Parameter Tips for Beginners:
- • Start with the default values for initial exploration
- • If you get too many small clusters, try lowering the thresholds
- • If clusters seem too mixed, try increasing the thresholds
4. Running the Analysis
Step 1: Start Analysis
- • Click the "Start Analysis" button
- • A progress bar will show the analysis status
- • Detailed logs appear in the text area below
Step 2: Monitor Progress
Watch the log messages for:
- • File loading confirmation
- • Graph construction status
- • Neighborhood analysis progress
- • Cluster identification
- • Export completion
Step 3: Review Results
- • Wait for the "Analysis Complete!" message
- • The "View Results" button will become active
5. Understanding Your Results
Output Files Generated:
-
Cluster List File (
components.txt)- • Lists all sequences grouped by cluster
- • Shows cluster sizes and member sequences
-
Network File for Visualization (
cytoscape_edges.txt)- • Formatted for import into Cytoscape
- • Contains connection strengths and relationships
-
Adjacency Matrix Files
- • Mathematical representation of clusters
- • Useful for advanced analysis
Interpreting Results:
- • Large Clusters: Groups of closely related sequences
- • Small Clusters: Specialized or unique sequences
- • Singleton Clusters: Highly divergent sequences
6. Visualizing Results
Using Cytoscape (Recommended):
- Install Cytoscape from https://cytoscape.org/
- Open Cytoscape and go to File → Import → Network from File
- Select the
cytoscape_edges.txtfile from your output directory - Use Layout → Prefuse Force Directed for optimal clustering visualization
Understanding the Network:
- • Node Size: Can represent sequence length or importance
- • Edge Thickness: Represents similarity strength
- • Cluster Colors: Group related sequences visually
Troubleshooting Common Issues
Problem: Analysis fails to start
- • Check that input file is properly formatted
- • Ensure all similarity scores are between 0 and 1
- • Verify file is not open in another program
Problem: Results show only one big cluster
- • Try increasing the strict threshold
- • Reduce the minimum neighbors parameter
- • Check if input similarity scores are too high
Problem: Results show too many small clusters
- • Try decreasing the loose threshold
- • Increase the minimum neighbors parameter
- • Check if input similarity scores are too low
Problem: Cannot open results in Cytoscape
- • Ensure you're selecting the correct file type in Cytoscape
- • Check that the output directory is accessible
- • Verify Cytoscape version compatibility
If you have any other questions, you can create an issue or contact us, and the contact information is available in the footer.
Advanced Features
Batch Processing
- The tool can be run from command line for automated processing
- Useful for analyzing multiple datasets
Custom Export Formats
- Additional export options available for specialized analysis
- Can be customized in the source code for specific needs
Performance Optimization
- Handles networks with thousands of sequences efficiently
- Memory-optimized for large datasets