Software

Glucagon Sequence Collection and Processing Workflow


Objective

Collect glucagon and glucagon-related sequences from multiple databases (PSI-BLAST, HMMER, Cluster), clean and deduplicate them, compute sequence statistics, identify unique sequences, and prepare datasets for MUSCLE multiple sequence alignment (MSA).

Step 1. Data Collection

Reference Sequence

We begin with the glucagon reference sequence from PDB 3IOL, Chain B:

>3IOL_2|Chain B|Glucagon|Homo sapiens (9606)
HAEGTFTSDVSSYLEGQAAKEFIAWLVKGRG
  1. PSI-BLAST

    Website: https://www.ebi.ac.uk/jdispatcher/sss/psiblast

    Paste the reference sequence above.

    Download results as: psiblast-entries-fasta.txt

    Example entry:

    >UNIPROT:GLUC_MOUSE UNIPROT:GLUC_MOUSE P55095 Pro-glucagon (Glicentin) (Glicentin-related polypeptide) ...
    MKTIYFVAGLLIMLVQGSWQHALQDTEENPRSFPASQTEAHEDPDEMNEDKRHSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNRNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDFPEEVAIAEELGRRHADGSFSDEMSTILDNLATRDFINWLIQTKITDKK
    
  2. HMMER (phmmer)

    Website: https://www.ebi.ac.uk/Tools/hmmer/search/phmmer

    Paste the same reference sequence.

    Download results as: hmmer-entries-fasta.fa

    Example entry:

    >GLUC1_XENLA/53-83 HSQGTFTSDYSKYLDSRRAQDFVQWLMNTKR
    >GLUC1_XENLA/97-127 HAEGTFTSDVTQQLDEKAAKEFIDWLINGGP
    
  3. Cluster Search

    Paste the reference sequence and download results as: seqdump.txt

    Example entries:

    >AAT00451.1 glucagon, partial [Capra hircus]
    NNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRR
    >BAW32319.1 glucagon, partial [Felis catus]
    HSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNKNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDF
    

Step 2. Data Integration & Cleaning

  • Combine all FASTA files (psiblast-entries-fasta.txt, hmmer-entries-fasta.fa, seqdump.txt) into one master file.
  • You can use a short Python script for this. Name the file something like: combined_sequences.txt.
  • Remove duplicates (identical amino acid sequences; ignore headers). Keep only one copy. Save as combined_sequences_clean.txt.

Step 3. Sequence Statistics

For each dataset (PSI-BLAST, HMMER, Cluster, Combined), compute:

  • Sequence count
  • Mean sequence length
  • Median length
  • Standard deviation (SD)
  • Variance

Save results as: Sequence_statistics.txt

Step 4. Uniqueness Filtering

Criteria for uniqueness:

  • Length filter: Compute cutoff = mean + 2 × SD (e.g., ~243 aa). Exclude sequences longer than this cutoff.
  • Within same length group: Identical sequences → keep one; different sequences → keep both.

Save filtered dataset as: Unique_sequences.txt

Step 5. Multiple Sequence Alignment (MSA)

Use the file Unique_sequences.txt. Upload it to MUSCLE: https://www.ebi.ac.uk/jdispatcher/msa/muscle

Run the alignment and save results as: unique_sequences_msa.aln

Step 6. Python Helper Script

Helper Python script to automate data loading, statistics, and uniqueness filtering.

from statistics import mean, median, pstdev, pvariance

def load_fasta(filename):
    seqs = {}
    with open(filename) as f:
        header, seq = None, []
        for line in f:
            line = line.strip()
            if line.startswith(">"):
                if header:
                    seqs[header] = "".join(seq)
                header, seq = line, []
            else:
                seq.append(line)
        if header:
            seqs[header] = "".join(seq)
    return seqs

def compute_stats(sequences):
    lengths = [len(seq) for seq in sequences.values()]
    return {
        "count": len(lengths),
        "mean": mean(lengths),
        "median": median(lengths),
        "stddev": pstdev(lengths),
        "variance": pvariance(lengths),
        "max": max(lengths),
        "min": min(lengths),
    }

def filter_unique(sequences, cutoff):
    seen, unique = {}, {}
    for header, seq in sequences.items():
        L = len(seq)
        if L > cutoff:
            continue
        if L not in seen:
            seen[L] = []
        if seq not in seen[L]:
            seen[L].append(seq)
            unique[header] = seq
    return unique

# Example usage
seqs = load_fasta("combined_sequences_clean.txt")
stats = compute_stats(seqs)
cutoff = stats["mean"] + 2 * stats["stddev"]
unique_seqs = filter_unique(seqs, cutoff)

with open("unique_sequences.txt", "w") as f:
    for h, s in unique_seqs.items():
        f.write(f"{h}\n{s}\n")

Installation & Setup


Requirements: Python 3.9+ (standard library only). Optional: Biopython for extended FASTA parsing.

python3 -m venv .venv
source .venv/bin/activate
# No external deps required for the minimal workflow
# If you use extras, then:
# pip install biopython

Quick Start


  1. Export search results: psiblast-entries-fasta.txt, hmmer-entries-fasta.fa, seqdump.txt.
  2. Concatenate and deduplicate into combined_sequences_clean.txt.
  3. Run the helper script to compute stats and write unique_sequences.txt.
  4. Upload unique_sequences.txt to MUSCLE and save unique_sequences_msa.aln.

Architecture Overview


  • Data Sources: PSI-BLAST, HMMER, Cluster search.
  • Processing: Concatenate → deduplicate → stats → filter by (mean + 2·SD).
  • Outputs: Clean sequences, uniqueness-filtered set, MUSCLE alignment.

API Overview


def load_fasta(filename) -> dict[str, str]
    """Return mapping of header → sequence."""

def compute_stats(sequences: dict[str, str]) -> dict[str, float]
    """Return count, mean, median, stddev, variance, max, min."""

def filter_unique(sequences: dict[str, str], cutoff: float) -> dict[str, str]
    """Return subset filtered by length cutoff and deduped within length groups."""

Repository