maestro

A task orchestration framework for reproducible, distributable bioinformatic workflows.

Introduction

maestro is a framework for writing and distributing computational workflows in a configurable, reproducible, and scalable manner. It is comprised of two core components: the libmaestro library, and the maestro-cli command line interface. libmaestro enables developers to define atomic processes and compose them into complex workflows, while simultaneously leveraging cutting-edge analyses to validate the correctness of these workflows before they ever run. maestro-cli on the other hand, provides modern tooling to developers and users alike, enabling fast builds, process control, and standardized distribution.

Key highlights

maestro powers meduCA’s key structural prediction and molecular dynamics pipelines
The maestro framework empowers bioinformaticians in iGEM and beyond to define more reliable, reproducible, and redistributable workflows
Once defined, workflows are easy to use for generalist and expert users alike
- Configuring pre-defined workflows is simple; input files and execution parameters are specified in a single configuration file
maestro integrates extensively with commonly-used, existing technologies
- Docker, Podman, and Apptainer are supported for containerized workflows
- Slurm execution is supported, enabling workflows to execute on most academic high-performance compute clusters
maestro is open-source and welcomes contributors

Disambiguation

In this document, the following definitions apply:

maestro: the maestro framework, comprising of libmaestro, maestro-cli, and the maestro user experience philosophy
libmaestro: the library component of maestro, i.e., the Rust crate that enables developers to write workflows
maestro-cli: the command-line tool which comprises maestro’s run, build, and distribution system
compile-time: the time at which the developer compiles (“builds”) code into an executable program
runtime: the time at which the program is run by the user
program startup: the moment at which the user begins running a workflow (i.e., the start of runtime)

Preamble

Bioinformatics is defined as the “application of tools of computation and analysis to the capture and interpretation of biological data.” This field, though still somewhat nascent, has cemented itself as essential for the scalable management and analysis of data in modern biology. As such, informatic workflows present themselves as a critical facet of many a team’s iGEM project (including ourselves, with our phylogenetic analysis and modelling pipelines), and the development of improved technologies in this field is critical for the furthering of synthetic biology.

At their core, bioinformatic workflows are typically the summation of many smaller analyses, composed together to transform data from raw inputs into a meaningful analysis. This closely mirrors what is commonly referred to as the “Unix philosophy,” a set of maxims based on the experience of the leading developers of the Unix operating system. The first two are as follows:

Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.”
Expect the output of every program to become the input to another, as yet unknown, program…

Bioinformaticians typically develop programs following this philosophy: as atomic, independent, composable components that can be joined together by making the outputs of one component the input to the next. Then, higher-level programs can build off these independent components; for instance, the TreeSAPP tool used for our team’s phylogenetic analysis is built off HMMER, Prodigal, RAxML-NG, and many other dependencies.

Lately, various tools with the explicit goal of assisting bioinformaticians in composing processes into workflows have emerged, with Nextflow and Snakemake leading the charge. Though our team has leveraged these tools in the past — especially in our modelling workflows — we experienced various shortcomings, most notably in parallel composition, correctness verification, and redistribution. As such, maestro was born, designed along the following guiding principles:

Verify correctness and fail early

Run every possible analysis on the defined workflow to ensure that it will execute properly at runtime
If issues are found, either in workflow definition or user configuration, fail the workflow early (at compile-time or program startup), rather than risking unpredictable behavior or errors later

Scale with complexity

Make it easy to define simple AND complex workflows
A workflow comprised of 50 components should not require additional specialized knowledge to write when compared to a workflow comprised of 5 components
Workflows should be easily optimized for parallelism where possible

Enable redistribution

End-users should be able to re-use previously defined workflows without being required to fully understand how the workflow was defined
End-users should be able to easily configure workflows to execute on whatever environment they wish, with whatever input data they wish
Workflows should be easily distributable, and run in a reproducible manner across diverse environments

Improve the accessibility of bioinformatics

maestro should be usable by beginners and advanced users alike
It should be welcoming, integrate well with existing tooling (i.e., editor completion via the language server protocol, diagnostics, build/test infrastructure, etc.), and be easy to understand and set up
Distributed maestro pipelines should be easily usable by non-technical individuals, making it easier for researchers to integrate informatic analyses into their research

User experience

maestro’s user and developer experience closely follows this grounding philosophy. Every maestro project compiles into an executable (platform-native), as well as two user-facing configuration/information files, Maestro.toml and procinfo.toml.

Maestro.toml allows users to define process arguments and execution environments
- This file is distributed alongside the executable, and allows the user to configure the workflow to their needs at runtime
procinfo.toml provides documentation for each process in a workflow
- This file is auto-generated at compile time from the source code itself, providing a docstring, information about which executor is associated with each process, and the dependencies of each process

Furthermore, the following high-level features are provided by maestro:

Verify correctness and fail early

Developers are able to define processes in familiar shell syntax
These shell scripts are statically analyzed by ShellCheck at compile-time, and static analysis errors are converted directly into compiler errors

Attempted compilation of a shell script with two errors. The $test_dir variable is misspelt, and a missing line escape ( ) is causing a redirection without a command. As such, compilation fails with the error forwarded to the developer.
This feature is entirely novel in workflow execution software, and has the potential to catch entire classes of shell bugs at compile-time
It’s also dear to our team. We had a modelling analysis (running under Nextflow) crash 16 hours into a job due to missing a line escape, an issue that would never occur under maestro
User-defined arguments and inputs (in Maestro.toml) are verified to exist at program startup rather than process runtime, ensuring the user is immediately made aware of any configuration errors rather than having workflows fail much later

Ryan has extensively leveraged existing workflow execution tools — notably Nextflow — in his research and work. As such, we had a conversation to compare and contrast maestro’s features against currently established tools.

Ryan McLaughlin

PhD Candidate, Bioinformatics

In our discussion, Ryan noted that

[maestro] presents improvements over the top of the line.

Specifically, he was impressed with maestro’s compile-time script validation (noting that shell syntax errors are a common frustration when defining workflows) as well as support for easy configuration of execution environments. Furthermore, Ryan expressed that he wishes to leverage maestro in some of his research/work moving forward.

Scale with complexity

libmaestro is designed to
- closely integrate with dagger to enable simple definition of complex parallel processes
- provide a simple interface for enabling user configuration of complex, multi-step workflows
- be self-documenting via procinfo.toml, enabling user transparency on process dependencies and execution configuration
User-configuration is designed to be simple, intuitive, and non-repetitive
- Maestro.toml allows inheritance for executor definitions, where “children” can override attributes of their parents

Enable redistribution

maestro-cli provides the maestro bundle command, allowing generation of bundled, self-contained workflows
- Optionally, these bundles can be packaged directly into a singular compressed archive
- The --arch option enables developers to automatically build for multiple operating systems and architectures
  
  A bundled maestro workflow built for multiple architectures, provided by the “maestro bundle” command in maestro-cli. Shown in its raw (maestro bundle --arch all) and compressed (maestro bundle --arch all --compress xz) formats.
Bundled workflows are platform-native executables, allowing them to be executed directly like any other application without requiring the user to install maestro.

Improve the accessibility of bioinformatics

maestro provides dual APIs in Rust (primary) and Python (simplified)
- The Python API provides less correctness guarantees (due to the nature of the Python language), but improves the accessibility of developing bioinformatics workflows due to the language’s simplicity
- The Rust API is more fully featured and geared toward advanced users, with improved correctness checks and bundling integration
Type hinting and LSP integration is provided for both the Rust and Python APIs
- As such, every modern editor will provide code highlighting, completion and diagnostics inline
  
  maestro’s language server support in both Rust (rust-analyzer) and Python (ty)

Installation

libmaestro is distributed as a standard Rust library. To add it to your project, simply run

cargo add --git https://gitlab.igem.org/2025/software-tools/ubc-vancouver maestro

User Reference

Users can interface with a maestro workflow via two files, Maestro.toml and procinfo.toml . Maestro.toml is used to configure the behaviour of workflows, while procinfo.toml provides workflow dependency and execution documentation.

Maestro.toml

[args]

Type: top-level table containing key = string mappings

This table is designed to allow passing arguments into the program.

Example:

[args]
# Message printed at initialization
init_msg = "Hello, world!"

[inputs]

Type: top-level table containing key = array<string> mappings

This table is designed to allow passing sets of input paths into the program.

Example:

[inputs]
# Input PDB files to the molecular dynamics workflow
input_files = ["data/1v9e_ph4.pdb", "data/1v9e_ph10.pdb", "data/BtCAII_ph4.pdb"]

[executor]

Type: top-level table containing subtables, with each subtable representing an executor definition. The subtable name defines the executor name (e.g., [executor.default] defines an executor named “default”).

This table is designed to allow defining execution environments for processes. Each executor definition MAY either fully define an executor, OR inherit from another executor’s definition. An executor definition MAY be of type "Local" or "Slurm". The Local executor SHALL directly execute the process, whereas the Slurm executor SHALL schedule the process onto a compute node via Slurm. The following configuration options are available:

Local

[executor.name]
type = "Local"
container = { Podman = "ubuntu:rolling" }
staging_mode = "Copy"

container
- Defines a container runtime and a container image which should be used to execute the process
- The container runtime MUST be one of Docker, Podman, or Apptainer
- An image definition accepted by the corresponding runtime MUST be provided; for instance docker://quay.io/hallamlab/metasmith is accepted by Apptainer
staging_mode
- Defines a method to stage process inputs
- The mode MUST be "Copy", "Symlink", or "None"

Slurm

[executor.slurm_base]
type = "Slurm"
staging_mode = "Copy"
container = { Podman = "ubuntu:rolling" }
cpus = 4
memory = { type = "per_cpu", amount = 8192 }
gpus = 2
tasks = 1
nodes = 1
partition = "skylake"
time = { days = 1, hours = 2 }
account = "my-account-id"
mail_user = "myemail@gmail.com"
mail_type = ["NONE", "TIME_LIMIT_50"]
additional_options = [
    ["qos", "high"]
]

staging_mode
- Identical to staging_mode for Local execution
container
- Identical to container for Local execution
modules
- Defines a list of modules to import before the process runs
- Equivalent to module load {name}
cpus
- Defines how many CPUs to request for the process
memory
- Defines how much memory to request for the process
- type must be "per_node" or "per_cpu"
- amount must be a value in megabytes
gpus
- Defines how many GPUs to request for the process
tasks
- Advises the Slurm controller that the process will launch a maximum of n tasks
- Equivalent to the ntasks Slurm directive
nodes
- Defines how many nodes should be assigned to the process
partition
- Defines which partition the process should spawn on
time
- Defines the maximum time to allocate for the process
- Valid fields to set are days, hours, mins, and secs
- mins and secs should be valid (<60)
account
- Defines the account that should be charged for resources associated with the process
mail_user
- Defines an email address for mail notifications
mail_type
- Defines what alerts should send an email
- Values MUST be NONE, ALL, BEGIN, END, FAIL, REQUEUE, INVALID_DEPEND, STAGE_OUT, TIME_LIMIT_50, TIME_LIMIT_80, TIME_LIMIT_90, TIME_LIMIT, or ARRAY_TASKS
additional_options
- Defines additional options to pass to sbatch
- MUST be argument-value pairs

Inheritance

[executor.slurm_inherited]
inherit = "slurm_base"
tasks = 2
cpus = 10

inherit
- Defines which executor definition to inherit from
- Chained inheritance is allowed; the latest item in an inheritance chain takes priority
  - For example, if slurm_inherited2 inherits from slurm_inherited1 which inherits from slurm_base, and all three define some value for cpus, a process which executes on slurm_inherited2 will prioritize the cpus value in slurm_inherited2
- All inheritance MUST be of the same type; e.g., inheriting from a Local executor and setting cpus is not allowed (there is no cpus configuration option for Local executors, only Slurm executors)

procinfo.toml

This file describes all processes in the workflow. It contains one table per process, along with information concerning the process.

# Runs tleap to prepare input files for MD simulation
["src/main.rs:39:17"]
executor = "direct"
deps = ["tleap"]

# Executes parmed to convert from AMBER into a GROMACS-compatible format
["src/main.rs:68:17"]
executor = "direct"
deps = ["python", "py:parmed"]

# Executes the primary GROMACS molecular dynamics process
["src/main.rs:100:17"]
executor = "slurm"
deps = ["gmx"]

Documentation string (above table)
- Additional information provided by the developer about the process
executor
- Describes which executor is used for the process
- This executor MUST be defined in Maestro.toml
deps
- Defines the dependencies of the process
- These dependencies MUST be available when the process runs

Developer Reference

libmaestro’s main mechanism for defining an atomic workflow component is the process! macro. These blocks accept a specialized syntax which accepts arguments to parametrize the process.

Formal Specification

Augmented Backus-Naur Form is a metalanguage — a language for formally specifying the syntax of other languages — outlined in IETF Request for Comment (RFC) 5234. libmaestro’s process! syntax is defined in ABNF as follows (note: %s syntax from RFC 7405 is used):

; === WHITESPACE ===
WSPCHAR     =  %x20 / %x09 / %x0A / %x0D / %x0C
WSP         =  *WSPCHAR

; === MACRO BODY ===
PROCESS     =  %s"process!" WSP "{" WSP *DOCSTR PARAM *(WSP "," WSP PARAM) WSP [","] "}"
PARAM       =  NAME / EXECUTOR / ARG / DEPS / INLINE / SCRIPT

; === PARAMS ===
DOCSTR      =  "///" WSP *(%x00-09 / %x0B-0C / %x0E-FF) (CRLF / LF / CR)
NAME        =  %s"name" WSP "=" WSP EXPR
EXECUTOR    =  %s"executor" WSP "=" WSP LITSTR
ARG         =  (%s"inputs" / %s"args" / %s"outputs") WSP "=" WSP ARG_GROUP
ARG_GROUP   =  "[" WSP IDENT *(WSP "," WSP IDENT) WSP [","] WSP "]"
DEPS        =  %s"dependencies" WSP "=" WSP DEPS_GROUP
DEPS_GROUP  =  "[" WSP LITSTR *(WSP "," WSP LITSTR) WSP [","] WSP "]"
INLINE      =  %s"inline" WSP "=" WSP LITBOOL
SCRIPT      =  %s"script" WSP "=" WSP LITSTR

where the special rule EXPR represents a Rust expression, IDENT represents a Rust identifier, LITSTR represents a Rust string literal, and LITBOOL represents a Rust boolean literal. EXECUTOR and SCRIPT are required, and PARAM types may not be repeated, except for ARG (where max. one definition of each arg type inputs/args/outputs is permitted).

Informal Specification

Less formally, libmaestro exposes its primary process definition API via process! blocks. These blocks contain arguments that specify the process to execute, its inputs and outputs, and its execution environment.

process! blocks MAY begin with a documentation string:

process! {
    /// This is a docstring that describes this process
    /// Maybe I talk more about what it does
    /// ...so the user knows how they should configure its resources

which is included in procinfo.toml to provide information about the process to the user

# This is a docstring that describes this process
# Maybe I talk more about what it does
# ...so the user knows how they should configure its resources
["lib/examples/workflow.rs:31:17"]

Following this doc string, various fields are allowed. Each field MUST follow the format LHS = RHS , where LHS is the field name, and RHS is a Rust expression (the constraints on what expression are allowed depends on the field).

name

Purpose

This field sets the process name, which also becomes the name of the folder in which the process is run. If the process will run multiple times in the same workflow with different inputs, it is desirable to make the process name depend on its input in some way (so each run creates a unique folder, and does not override past runs).

Example

fn my_workflow(run_id: i32) -> NodeResult {
        ...
    process! {
            ...
        name = format!("my_workflow_{run_id}"),

RHS

Any arbitrary Rust expression which produces a value that implements ToString.

Optional

Yes. If a name is not set, a randomly generated ID will be used.

executor

Purpose

This field determines which user-defined executor will be used to execute the process. This information is documented in procinfo.toml ; for instance, a process that depends on an executor named default would show up as follows:

["lib/examples/workflow.rs:31:17"]
executor = "default"

The corresponding executor MUST be configured in Maestro.toml :

[executor.default]
type = "Local"

If it is missing, the workflow will identify the error at program startup and inform the user.

Example

process! {
        ...
    executor = "default",

RHS

A Rust string literal.

Optional

No.

inputs/args/outputs

Purpose

These fields allow injection of Rust variables into the script. Inputs/outputs must be file paths, and their existence will be checked before/after the script runs.

Example

process! {
    ...
    inputs = [
        test_fasta,
        test_dir
    ],
    args = [
        num_cpus
    ],
    outputs = [
        output_path
    ],
    ...
    script = r#"
        ls -R "$test_dir" > "$output_path"
    "#,

RHS

An array, where each element is a Rust identifier that points to a value. For inputs/outputs, these values MUST implement AsRef<OsStr> so that they can be constructed into Path objects. For args, these values MUST implement ToString. The values are then injected into the script as shell variables, allowing the use of $<identifier>` to refer to them.

Optional

Yes.

dependencies

Purpose

To document the dependencies of the process. These dependencies are injected into procinfo.toml:

["src/main.rs:67:17"]
executor = "direct"
deps = ["gromacs"]

Example

process! {
    ...
    dependencies = ["!cat", "gromacs"],
    script = r#"
        cat "$test_fasta" > "$output_path"
        ...
    "#,

RHS

An array, where each element MUST be a Rust string literal. maestro will attempt to auto-determine dependencies by parsing the script definition as Bash syntax; entries specified in this array will be appended to these auto-determined dependencies (for information on this parsing algorithm, see the rundown here). Array entries with the form "!<name>" will instead direct maestro to ignore the auto-determined dependency matching name (in the above example, cat will be ignored). If any array entry matches "!", ALL auto-determined dependencies will be ignored.

Optional

Yes.

inline

Purpose

To specify whether the script is defined inline or in a separate file.

Example

process! {
    ...
    inline = false,

RHS

A Rust boolean literal.

Optional

Yes. If unspecified, defaults to true.

script

Purpose

To define the script that is executed when this process runs.

Example

process! {
    ...
    script = r#"
        cat "$test_fasta" > "$output_path"
    "#,

RHS

A Rust string literal. If inline is true (the default behaviour), this string literal MUST contain the body of the script. A shebang MAY be included; if no shebang is detected, #!/bin/bash will be injected. If inline is false, this string literal MUST be a path to a script file; the contents of this file will be read and included into the workflow at compile-time. Leading/trailing whitespace in the script will be stripped, and leading whitespace on each line will be stripped.

Optional

No.

Output format

The output of the process! block is a list of paths (Vec<PathBuf>), which contains the canonicalized paths associated with each item defined in the process’s outputs block in order, then the process’s working directory. For instance, when executing the following:

let output1 = Path::new("out.txt");
let output2 = Path::new("err.txt");

let process_output = process! {
        name = "my_process",
        ...
        outputs = [output1, output2],
        ...
}?;

the variable process_output would be a vector of length 3, containing the canonicalized path associated with output1, the canonicalized path associated with output2, and the path to the process’s execution directory. For instance, it could be set to:

[
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/out.txt",
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/err.txt",
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/"
]

maestro also provides an efficient API to destructure these vectors via lossy conversion into arrays (the .into_array() method). For example, replacing the above with

let [process_out1, process_out2] = process! {
        name = "my_process",
        ...
        outputs = [output1, output2],
        ...
}?.into_array();

binds output1 from the process definition to process_out1, and output2 to process_out2. The third element of the vector (the process’s execution directory) will remain unbound and be discarded.

Additional APIs

arg! and inputs!

Used to parse arguments and input paths from Maestro.toml. The body MUST be a Rust string literal; this SHOULD match an entry in the [args] or [inputs] tables in Maestro.toml . The existence of matching entries in Maestro.toml is checked at program startup, thus ensuring that processes do not fail when they are executed. arg! yields the argument value as a &str, inputs! yields its paths as a &[&Path].

Example

let init_msg = arg!("init_msg");
let input_files = inputs!("input_files");

#[maestro::main]

This is an attribute macro which can be attached to functions; it MUST be attached to the program’s main function. This provides an ergonomic pattern to verify the Maestro.toml configuration and initialize the workflow’s session directory, as well as inject session teardown functionality. The following:

#[maestro::main]
fn main() {
        // code in main...
}

is effectively identical to

fn main() {
        maestro::initialize();
        let main_result = {
                // code in main...
        };
        maestro::deinitialize();
        main_result
}

dagger integration

libmaestro is explicitly designed to integrate with dagger. Each process! invocation is entirely stateless and self-contained, enabling processes to be parallelized and scaled at will by leveraging dagger’s primitives. Furthermore, maestro is designed to represent process I/O as function I/O, enabling seamless integration with dagger’s “parallelization based on data flow” model. As shown above, a process’s outputs can be destructured; the logical conclusion of this approach is that the outputs of one process can become the inputs to the next. For instance, in our molecular dynamics maestro workflow:

let [prmtop, inpcrd] = tleap(path, molecule_name)?.into_array();
let [gro, topol] = parmed(prmtop, inpcrd, molecule_name)?.into_array();
let [gromacs_workdir] = gromacs(gro, topol, molecule_name)?.into_array();

Here, tleap , parmed, and gromacs are all functions which wrap a process! invocation. The outputs of each process are destructured, then passed as inputs to the next, demonstrating the flow of information between processes in the workflow (this closely mirrors pipes on Unix systems). If, for instance, we wished to execute multiple processes on the outputs of tleap, we could leverage dagger! :

dagger! {
        tleap_out :: tleap(path, molecule_name);
        analysis_1 :: analysis_process_1(tleap_out);
        analysis_2 :: analysis_process_2(tleap_out);
        analysis_3 :: analysis_process_3(analysis_2);
}

resulting in the following parallel process (this visualization is rendered directly by dagger):

For our molecular dynamics simulations, we were more interested in parallelizing the process on multiple input paths (.pdb files for analysis). Specifically, we wished to enable the user to input multiple paths:

[inputs]
input_files = [
        "data/1v9e_ph4.pdb",
        "data/1v9e_ph10.pdb",
        "data/BtCAII_ph4.pdb",
        "data/HpCA_ph7.pdb"
]

and execute the full molecular dynamics pipeline on each file in parallel. This was implemented by leveraging dagger’s parallelize primitive:

let input_files = inputs!("input_files");
let workflow = |path: &&Path| -> NodeResult<PathBuf> {
    // molecular dynamics workflow defined here
};

// executes the `workflow` function on each input file in parallel
let process_results = parallelize(input_files, workflow)
    .into_iter()
    .map(|result| result.expect("No processes should panic!"));

Our molecular dynamics pipeline running on the ARC Sockeye HPC cluster. Three input files (1v9e_ph4, 1v9e_ph10, BtCAII_ph4) are being processed in parallel. The tleap and parmed steps execute directly, and the gromacs step is scheduled on Slurm. Video is sped up 8x.

Python API

A set of Python bindings to libmaestro is available, enabling users to build workflows directly in Python. This binding set is comprehensive, and largely provides users the same APIs as are available in Rust. However, most of the correctness checks and compile-time processes executed by libmaestro (e.g., checking process definitions for errors, verifying that all args/inputs/executor definitions are configured in Maestro.toml, generating procinfo.toml) are not available when using libmaestro from Python. This is simply due to constraints in the language (Python has no compile-time execution mechanisms that enable libmaestro to execute these processes).

Python bindings to libmaestro are generated using pyo3, and the stub file is generated using pyo3-stub-gen.

Stub file

The full Python API of libmaestro is available in the stub file here. This file is bundled as part of the Python package wheel, and informs your favourite editor’s code completion and diagnostics. The stub contains a full set of type annotations, enabling type validation through static analysis tools such as mypy, as well as language servers such as basedpyright and ty.

Core API

Processes can be defined via the constructor of the Process type. This function is defined as follows:

class Process:
    def __init__(
        self,
        name: builtins.str,
        script: builtins.str,
        inputs: typing.Mapping[builtins.str, builtins.str | os.PathLike | pathlib.Path],
        outputs: typing.Mapping[
            builtins.str, builtins.str | os.PathLike | pathlib.Path
        ],
        args: typing.Mapping[builtins.str, builtins.str],
    ) -> Process: ...

Executors, arguments, and input files from Maestro.toml can be queried via the arg, inputs, and executor functions:

def arg(name: builtins.str) -> builtins.str: ...
def executor(name: builtins.str) -> GenericExecutor: ...
def inputs(name: builtins.str) -> builtins.list[pathlib.Path]: ...

As such, the following is an example of a simple workflow definition using the Python API:

from maestro import *

proc_inputs = {
    "test_fasta": "data/seq1.fasta",
    "test_dir": "data/",
}
proc_outputs = {
    "output_path": "out.txt"
}
proc = Process(
    name = "my_proc",
    inputs = proc_inputs,
    outputs = proc_outputs,
    args = {},
    script =
    """
    #!/bin/bash
    cat "$test_fasta"
    tree "$test_dir" > "$output_path"
    """
)

proc_executor = executor("default")
output_files = proc_executor.exe(proc)
print(output_files)

maestro-cli

Installation

If you have the Rust toolchain installed, maestro-cli can be installed from source by using cargo install:

cargo install --git https://gitlab.igem.org/2025/software-tools/ubc-vancouver maestro-cli

This will install the maestro binary and load it into your PATH

Commands

To see all available commands, run maestro help:

λ maestro help
Subcommands in the maestro CLI

Usage: maestro <COMMAND>

Commands:
  init          Initialize a new maestro project
  bundle        Compile a project and package it for redistribution
  build         Build a project
  run           Run a binary or project
  kill          Kill a running maestro process
  update-cache  Update the libmaestro cache
  help          Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

init

λ maestro init --help
Initialize a new maestro project

Usage: maestro init [PATH]

Arguments:
  [PATH]  [default: .]

Options:
  -h, --help  Print help

Initializes a new maestro project, with all dependencies set up and libmaestro pre-built. src/main.rs contains a simple demo workflow pre-defined.

One positional argument to specify a target path at which the project should be initialized is supported
- If unset, the current directory will be used (it must be empty)

bundle

λ maestro bundle --help
Compile a project and package it for redistribution

Usage: maestro bundle [OPTIONS] [CARGO_ARGS]...

Arguments:
  [CARGO_ARGS]...  Arguments to pass to cargo build

Options:
  -c, --compress <COMPRESS>  Compresses the bundle into an archive [possible values: zip, gzip, xz, bzip2, zstd, lzma]
  -a, --arch <ARCH>          Bundle for a target architecture; defaults to the host arch [possible values: linux, apple, all]
  -r, --runtime <RUNTIME>    Container runtime for multi-arch builds; only read if --arch is set [possible values: docker, podman, apptainer]
  -h, --help                 Print help

Bundles the current maestro project for redistribution.

--compress allows the user to compress the bundled folder into a single archive file; multiple compression algorithms are supported
--arch allows the user to build multiple executables to support an operating system. One executable will be built for the x86_64 architecture, and one will be built for the aarch64 (ARM) architecture
--arch builds in a container to enable cross-compilation; --runtime sets which runtime should be used to spawn the container

build

λ maestro build --help
Build a project

Usage: maestro build [CARGO_ARGS]...

Arguments:
  [CARGO_ARGS]...  Arguments to pass to cargo build

Options:
  -h, --help  Print help

Builds the current maestro project.

run

λ maestro run --help
Run a binary or project

Usage: maestro run [OPTIONS] [BINARY] [CARGO_ARGS]... [ARGS]...

Arguments:
  [BINARY]         A binary to run; when unspecified, the current project will be run
  [CARGO_ARGS]...  Arguments to pass to cargo run
  [ARGS]...        Arguments to pass to the program

Options:
  -b, --background  Run detached from the current shell session
  -h, --help        Print help

Runs the current maestro project, or the target binary.

One positional argument which specifies the path to an already built binary is supported
- Built executables can also be run directly (e.g., ./<bin_name>)
--background spawns the process in the background and detaches it, similar to nohup

kill

λ maestro kill --help
Kill a running maestro process

Usage: maestro kill <TARGET>

Arguments:
  <TARGET>  The process to kill, by name or path

Options:
  -h, --help  Print help

Kills a running maestro process.

One positional argument which specifies the process to kill is required
- If a process is running with id elated-bat (meaning it is running at maestro_work/elated-bat), running maestro kill elated-bat OR maestro kill maestro_work/elated-bat will work

update-cache

λ maestro update-cache --help
Update the libmaestro cache

Usage: maestro update-cache

Options:
  -h, --help  Print help

Rebuilds the cached version of libmaestro. This is done automatically when running maestro run/build/bundle.

help

Displays the full help documentation for maestro-cli

λ maestro help
Subcommands in the maestro CLI

Usage: maestro <COMMAND>

Commands:
  init          Initialize a new maestro project
  bundle        Compile a project and package it for redistribution
  build         Build a project
  run           Run a binary or project
  kill          Kill a running maestro process
  update-cache  Update the libmaestro cache
  help          Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

DBTLs and Development

finalflow

The first proof of concept design sketch of a workflow executor. Designed for basic local execution and piping, with inbuilt parallelism primitives.

finalflow was implemented as a single-crate Rust library.

finalflow is able to execute simple (e.g., 2-step) processes consistently, but is riddled with small bugs and does not have proper error handling and stdout/stderr piping.

For future versions, parallelism should be outsourced to a dedicated framework (i.e., dagger) and support for additional execution environments should be expanded.

Implemented at this Git commit.

Atomicity and Extensibility

finalflow was highly reliant on mutable global state, making it difficult to parallelize due to high reliance on locks for safe, concurrent access. Additionally, it was not built to be extended for other execution platforms. This version is designed to make each process atomic, stateless, and configurable to a specific executor.

finalflow was rewritten to improve its atomicity, making it solely reliant on a one-time session directory setup. Furthermore, execution was offloaded to an Executor trait, allowing extensibility for future execution platforms. This DBTL is related to this Git commit.

This version of maestro is able to more reliably execute local processes, and all parallelization primitives have been stripped in favour of designing with [dagger integration](#dagger-integration) in mind. Execution is done by passing a process definition to a struct that implements Executor.

For future versions, execution should be extended to support additional platforms other than direct execution.

Implemented at this Git commit.

SLURM support

maestro was initially only able to execute scripts directly. This DBTL cycle aimed to add support for SLURM execution and configuration.

Implementation of Slurm support is done by implementing a novel SlurmExecutor struct and the Executor trait for it. The SlurmExecutor struct contains a slurm_config field that holds fields wrapping sbatch directives. Slurm support was added in this Git commit.

This version of maestro was tested to successfully schedule and execute scripts via Slurm on the University of British Columbia’s high performance compute cluster, ARC Sockeye.

For future versions, execution should be dynamic based on user configuration rather than hardcoded. Additionally, a framework needs to be developed for passing arguments into workflows.

Implemented at this Git commit.

Reflection and runtime configuration

Instead of hardcoding executor configurations, they will be configured at runtime via a Maestro.toml file that contains tables to define executors and input arguments.

All executor fields are made deserializable via serde-derive, and TOML parsing is done via toml. The process! API is updated to leverage custom, user-defined executors, and arg! is provided to access user-provided arguments.

Maestro.toml support was added in this Git commit. Various configurations were tested to ensure all deserialization is parsed as expected, including nested fields and tagged enumerations.

The runtime-configuration format provides far more transparent user configuration than hardcoded configurations. The most important learning point from this DBTL cycle is that all configuration must be checked at program startup, thus ensuring that execution does not fail unexpectedly midway through program execution.

Implemented at this Git commit.

Dr. Wong, who is familiar with bioinformatic workflows primarily from a biological perspective, observed that interacting with job schedulers poses a major challenge for novice bioinformaticians attempting to run complex computational workflows. He noted that maestro’s design — allowing workflows to be written independent of a specific executor and enabling execution environments to be configured through a config file — could significantly improve the accessibility of bioinformatics.

Dr. Donald Wong

Professor of bioinformatics

Library Internals

process!

process! is implemented as a function-like procedural macro. Broadly, these act like functions, but rather than taking in and outputting data, they take in and output raw code tokens. Effectively, procedural macros act as programmatic compile-time preprocessors which arbitrarily transform their inputs into new code which is executed. The input tokens to a process! block are first parsed into the following data structure by leveraging a custom parser built on top of the syn library:

struct ProcessDefinition {
    name: Option<Expr>,
    executor: LitStr,
    inputs: Punctuated<Ident, Comma>,
    args: Punctuated<Ident, Comma>,
    outputs: Punctuated<Ident, Comma>,
    dependencies: Punctuated<LitStr, Comma>,
    inline: bool,
    literal: LitStr,
}

The tokens then undergo various transformations. For instance, if the script is not inlined (i.e., the content is in a separate file), the external file is read and the path is replaced with its contents; if a name is missing, a random ID is generated. Then, two critical steps are executed:

Script validation

Scripts are passed to ShellCheck for static analysis. However, the script must first undergo various transforms to improve the quality of ShellCheck’s analysis. A shebang is inserted if it is missing; leading and trailing whitespace is cleaned, and a variable definition is added for each Rust injection (the inputs, outputs, and args fields). Once ShellCheck has completed, its output is taken and parsed to identify if any errors were detected. If so, all error paths are transformed to point to a proper file/line/column (GCC format is used for integration with editor “path links”). For instance:

This script contains an error due to a missing line break; by clicking on the generated error message, the user is able to easily navigate to exactly where the error occurred.

Dependency analysis

The script is then analyzed to identify its dependencies. This is implemented via a custom parsing algorithm, which searches for binary identifiers at the start of lines, after | / && / || and within constructs such as {...} and (...). Keywords such as if / fi, for, case, etc. are explicitly ignored, as well as shell builtins such as alias, cd, break, echo, etc. The full list of shell builtins was sourced from the Bash man pages (you can execute

man bash | col -b | less +$(man bash | col -b | grep -n "SHELL BUILTIN COMMANDS" | tail -1 | cut -d: -f1)

to jump to the relevant section). This information is then appended to the growing procinfo.toml file, including the analyzed dependencies, the specified executor, etc.

Finally, various other transformations on the input tokens take place, and they are re-emitted to construct a maestro::Process object and execute it on the specified executor:

let executor_tokens = quote! {
    maestro::submit_request! {
        maestro::RequestedExecutor(#executor, file!(), line!(), column!())
    };
    maestro::config::MAESTRO_CONFIG.executors[#executor].exe(process)
};
quote! {{
    let process = maestro::Process::new(
        #name.to_string(),
        #container,
        vec![#(#input_pairs),*],
        vec![#(#arg_pairs),*],
        vec![#(#output_pairs),*],
        ::std::borrow::Cow::Borrowed(#process_lit),
    );
    #executor_tokens
}}

where quote! is a proc-macro for generation of source code tokens from raw syntax tree elements (for instance a Vec<LitStr>). This newly output source code is what the compiler sees, and what is executed at runtime.

The magic of link sections

Readers experienced in the arcane magiks of writing procedural macros (although I am slightly skeptical that such a reader exists, I would be pleasantly surprised to learn otherwise) may have curiously noticed the maestro::submit_request! macro call embedded within the token expansion above. Every process!, arg!, and inputs! site includes such a call, where a RequestedExecutor/RequestedArg/RequestedInputs struct is submitted. Within libmaestro itself, a thread-safe global collection is initialized to store requested executors, args, and inputs, and each submit_request! site expands to define a function that submits the given struct into the global collection. These function calls are then annotated with a link-section attribute, similar to the following:

#[used]
#[cfg_attr(target_os = "linux", link_section = ".init_array")]
#[cfg_attr(target_vendor = "apple", link_section = "__DATA,__mod_init_func,mod_init_funcs")]
#[cfg_attr(target_os = "windows", link_section = ".CRT$XCU")]
/* ... other platforms elided ... */
static __REQUEST: unsafe extern "C" fn() = __request;

which ensures the submission function runs before main. As such, right at the start of main, all arguments and executors which are expected to be defined in Maestro.toml have already been submitted into the relevant global collections.

As seen above, the #[maestro::main] annotation injects a maestro::initialize() call at the start of main. This function iterates over the “expected arg/executor” collections and short-circuits when it detects an item missing in Maestro.toml. By leveraging this maestro::submit_request! API, libmaestro ensures that no matter where in the program a configuration-dependent value is requested, it is always checked at startup.

maestro build

maestro-cli does not simply offload building to cargo build. Internally, maestro caches a built copy of libmaestro, specific to each version of libmaestro and rustc:

λ ls -l .maestro_cache
total 0
drwxr-xr-x@  4 seb-hyland  staff   128 Sep 30 19:25 maestro-0.2.10_rustc-1.90.0
drwxr-xr-x@  4 seb-hyland  staff   128 Oct  4 18:11 maestro-0.2.10_rustc-1.92.0-nightly
drwxr-xr-x@ 41 seb-hyland  staff  1312 Oct  4 18:11 vendor

Source code for a specific version of libmaestro and its dependencies is obtained via cargo vendor; then, dependency paths are normalized and libmaestro is built into a static library binary (libmaestro.rlib). This allows libmaestro to only be rebuilt when the compiler version changes (which is extremely infrequent), making builds significantly faster.

A side-by-side comparison of building the same project via `maestro build` and `cargo build --release`. Both are fresh builds (no Cargo build cache) and have all optimizations enabled (release profile). Building with `maestro` only takes 1.31s, while building with `cargo build` takes 18.26s.

Implications and Proof of Concept

As a proof of maestro’s potential in building practical, real-world workflows, we rewrote our structural prediction, molecular dynamics, and bioinformatics workflows to leverage libmaestro. For each workflow, libmaestro’s self-documentation mechanisms (the Maestro.toml and procinfo.toml files) provide information on workflow arguments and processes.

Structural prediction and molecular dynamics

Our structural prediction workflow was originally implemented in Nextflow, and our molecular dynamics workflow as a simple shell script. Both were converted to run under maestro to improve their reliability, as well as make them available for other iGEM teams to use. The structural prediction workflow is available here, and the molecular dynamics workflow is available here.

Bioinformatics

We originally had a collection of shell scripts for interacting with EggNOG database files. We decided to combine these discrete one-off tools into a bundled workflow that can be re-used; it is available here.

Active Development

libmaestro is still under active development, with additional features being added consistently.

SSH execution

The primary domain of ongoing development revolves around supporting remote process execution over SSH. The Local and Slurm executors in Maestro.toml will support an ssh field, enabling users to specify a remote server to execute the process on. All SSH connections will be initialized at program startup, thus supporting interactive (password/2FA) login systems.

Binding to libssh2

maestro binds to libssh2 by leveraging Rust’s ssh2 crate. This enables maestro to construct and hold persistent SSH connections, and split the connection into channels on which commands or file transfers can be executed. Futhermore, ssh2 provides primitives for keyboard-interactive authentication; maestro leverages this mechanism, building a user prompting interface for session connection.

Abstraction

Currently, maestro uses std::process::Command to spawn processes and std::fs::copy/std::os::unix::fs::symlink for staging files. The SSH feature requires a rework of these mechanisms altogether; process execution and file transfers are hidden behind an enum implementation, allowing swapping between native mechanisms and SSH/SFTP. Additionally, file paths require a rework, so that paths to remote files can be supported alongside local files. This is roughly implemented as follows:

enum WorkflowPath {
    Local {
        path: PathBuf,
    },
    Remote {
        path: PathBuf,
        connection: &'static Session,
    },
}

As such, processes executed on remote hosts return a Vec<WorkflowPath::Remote>, while local processes return a Vec<WorkflowPath::Local>. This allows chained processes which run on the same remote server to avoid transferring large files to the local client, though files can be still be manually moved by leveraging WorkflowPath’s abstractions:

impl WorkflowPath {
    pub fn copy(&mut self, target: Self) { /* ... */ }
}

Additional QoL Improvements

Additional quality of life improvements in libmaestro and maestro-cli are also being actively developed. Most notably, these include:

Resuming workflows from their last successful process
Monitoring active workflows and querying process information via maestro-cli
Moving libmaestro caching (for maestro run/build/bundle) from a per-project basis to a per-user basis

Description

Inclusive Perspective

Results

Team

Biocementing Bacteria

Cyanobacteria

Caulobacter

E. coli

Functional Validation

Bioinformatics

Modelling

Parts

Protocols

Lab Notebook: Biocementing Bacteria

Bioreactor Overview

CB2A Bioreactor: Requirements + Design

CB2A Bioreactor: Build

CB2A Bioreactor: Validation

UTEX 2973 Bioreactor: Requirements + Design

UTEX 2973 Bioreactor: Build

UTEX 2973 Bioreactor: Validation

UTEX 2973 Bioreactor: Media Optimization

Low-Gravity Bioreactor: Requirements + Design

Low-Gravity Bioreactor: Build

Low-Gravity Bioreactor: Validation

Bioprinter: Overview

Bioink: Composition Testing

Bioink: Model Validation

Bioink: Calcium Diffusion Modelling

Bioprinter: Requirements + Design

Bioprinter: Build

Bioprinter: Validation

Lab Notebook: Bioprinter & Bioink

Software

dagger

maestro

miso

Education

Educational Programming & Outreach

UBC iGEM 2025 SynBio Case Competition

Synthetic Biology Children’s Storybook

Inclusivity

Understanding our User’s Needs

Fine-tuning our Understanding

Iterative DBTL Cycles

Project Outcomes

Alternative Inclusive Design Projects

Best Practices for Inclusive Design Project

Needs Finding

Market Research

Strategic Planning

Business Development

Long Term Growth

Medal Criteria

Contributions

Human Practices

Engineering

Attributions

maestro

Introduction

Key highlights

Disambiguation

Preamble

User experience

Installation

User Reference

Maestro.toml

[args]

[inputs]

[executor]

Local

Slurm

Inheritance

procinfo.toml

Developer Reference

Formal Specification

Informal Specification

name

executor

inputs/args/outputs