Collect glucagon and glucagon-related sequences from multiple databases (PSI-BLAST, HMMER, Cluster), clean and deduplicate them, compute sequence statistics, identify unique sequences, and prepare datasets for MUSCLE multiple sequence alignment (MSA).
We begin with the glucagon reference sequence from PDB 3IOL, Chain B:
>3IOL_2|Chain B|Glucagon|Homo sapiens (9606)
HAEGTFTSDVSSYLEGQAAKEFIAWLVKGRG
PSI-BLAST
Website: https://www.ebi.ac.uk/jdispatcher/sss/psiblast
Paste the reference sequence above.
Download results as: psiblast-entries-fasta.txt
Example entry:
>UNIPROT:GLUC_MOUSE UNIPROT:GLUC_MOUSE P55095 Pro-glucagon (Glicentin) (Glicentin-related polypeptide) ...
MKTIYFVAGLLIMLVQGSWQHALQDTEENPRSFPASQTEAHEDPDEMNEDKRHSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNRNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDFPEEVAIAEELGRRHADGSFSDEMSTILDNLATRDFINWLIQTKITDKK
HMMER (phmmer)
Website: https://www.ebi.ac.uk/Tools/hmmer/search/phmmer
Paste the same reference sequence.
Download results as: hmmer-entries-fasta.fa
Example entry:
>GLUC1_XENLA/53-83 HSQGTFTSDYSKYLDSRRAQDFVQWLMNTKR
>GLUC1_XENLA/97-127 HAEGTFTSDVTQQLDEKAAKEFIDWLINGGP
Cluster Search
Paste the reference sequence and download results as: seqdump.txt
Example entries:
>AAT00451.1 glucagon, partial [Capra hircus]
NNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRR
>BAW32319.1 glucagon, partial [Felis catus]
HSQGTFTSDYSKYLDSRRAQDFVQWLMNTKRNKNNIAKRHDEFERHAEGTFTSDVSSYLEGQAAKEFIAWLVKGRGRRDF
psiblast-entries-fasta.txt, hmmer-entries-fasta.fa, seqdump.txt) into one master file.combined_sequences.txt.combined_sequences_clean.txt.For each dataset (PSI-BLAST, HMMER, Cluster, Combined), compute:
Save results as: Sequence_statistics.txt
Criteria for uniqueness:
Save filtered dataset as: Unique_sequences.txt
Use the file Unique_sequences.txt. Upload it to MUSCLE: https://www.ebi.ac.uk/jdispatcher/msa/muscle
Run the alignment and save results as: unique_sequences_msa.aln
Helper Python script to automate data loading, statistics, and uniqueness filtering.
from statistics import mean, median, pstdev, pvariance
def load_fasta(filename):
seqs = {}
with open(filename) as f:
header, seq = None, []
for line in f:
line = line.strip()
if line.startswith(">"):
if header:
seqs[header] = "".join(seq)
header, seq = line, []
else:
seq.append(line)
if header:
seqs[header] = "".join(seq)
return seqs
def compute_stats(sequences):
lengths = [len(seq) for seq in sequences.values()]
return {
"count": len(lengths),
"mean": mean(lengths),
"median": median(lengths),
"stddev": pstdev(lengths),
"variance": pvariance(lengths),
"max": max(lengths),
"min": min(lengths),
}
def filter_unique(sequences, cutoff):
seen, unique = {}, {}
for header, seq in sequences.items():
L = len(seq)
if L > cutoff:
continue
if L not in seen:
seen[L] = []
if seq not in seen[L]:
seen[L].append(seq)
unique[header] = seq
return unique
# Example usage
seqs = load_fasta("combined_sequences_clean.txt")
stats = compute_stats(seqs)
cutoff = stats["mean"] + 2 * stats["stddev"]
unique_seqs = filter_unique(seqs, cutoff)
with open("unique_sequences.txt", "w") as f:
for h, s in unique_seqs.items():
f.write(f"{h}\n{s}\n")
Requirements: Python 3.9+ (standard library only). Optional: Biopython for extended FASTA parsing.
python3 -m venv .venv
source .venv/bin/activate
# No external deps required for the minimal workflow
# If you use extras, then:
# pip install biopython
psiblast-entries-fasta.txt, hmmer-entries-fasta.fa, seqdump.txt.combined_sequences_clean.txt.unique_sequences.txt.unique_sequences.txt to MUSCLE and save unique_sequences_msa.aln.def load_fasta(filename) -> dict[str, str]
"""Return mapping of header → sequence."""
def compute_stats(sequences: dict[str, str]) -> dict[str, float]
"""Return count, mean, median, stddev, variance, max, min."""
def filter_unique(sequences: dict[str, str], cutoff: float) -> dict[str, str]
"""Return subset filtered by length cutoff and deduped within length groups."""