maestro

A task orchestration framework for reproducible, distributable bioinformatic workflows.

Introduction

maestro is a framework for writing and distributing computational workflows in a configurable, reproducible, and scalable manner. It is comprised of two core components: the libmaestro library, and the maestro-cli command line interface. libmaestro enables developers to define atomic processes and compose them into complex workflows, while simultaneously leveraging cutting-edge analyses to validate the correctness of these workflows before they ever run. maestro-cli on the other hand, provides modern tooling to developers and users alike, enabling fast builds, process control, and standardized distribution.

Key highlights

Disambiguation

In this document, the following definitions apply:

Preamble

Bioinformatics is defined as the “application of tools of computation and analysis to the capture and interpretation of biological data.” This field, though still somewhat nascent, has cemented itself as essential for the scalable management and analysis of data in modern biology. As such, informatic workflows present themselves as a critical facet of many a team’s iGEM project (including ourselves, with our phylogenetic analysis and modelling pipelines), and the development of improved technologies in this field is critical for the furthering of synthetic biology.

At their core, bioinformatic workflows are typically the summation of many smaller analyses, composed together to transform data from raw inputs into a meaningful analysis. This closely mirrors what is commonly referred to as the “Unix philosophy,” a set of maxims based on the experience of the leading developers of the Unix operating system. The first two are as follows:

Bioinformaticians typically develop programs following this philosophy: as atomic, independent, composable components that can be joined together by making the outputs of one component the input to the next. Then, higher-level programs can build off these independent components; for instance, the TreeSAPP tool used for our team’s phylogenetic analysis is built off HMMER, Prodigal, RAxML-NG, and many other dependencies.

Lately, various tools with the explicit goal of assisting bioinformaticians in composing processes into workflows have emerged, with Nextflow and Snakemake leading the charge. Though our team has leveraged these tools in the past — especially in our modelling workflows — we experienced various shortcomings, most notably in parallel composition, correctness verification, and redistribution. As such, maestro was born, designed along the following guiding principles:

Verify correctness and fail early Scale with complexity Enable redistribution Improve the accessibility of bioinformatics

User experience

maestro’s user and developer experience closely follows this grounding philosophy. Every maestro project compiles into an executable (platform-native), as well as two user-facing configuration/information files, Maestro.toml and procinfo.toml.

Furthermore, the following high-level features are provided by maestro:

Verify correctness and fail early

In our discussion, Ryan noted that

[maestro] presents improvements over the top of the line.

Specifically, he was impressed with maestro’s compile-time script validation (noting that shell syntax errors are a common frustration when defining workflows) as well as support for easy configuration of execution environments. Furthermore, Ryan expressed that he wishes to leverage maestro in some of his research/work moving forward.

Scale with complexity Enable redistribution Improve the accessibility of bioinformatics

Installation

libmaestro is distributed as a standard Rust library. To add it to your project, simply run

cargo add --git https://gitlab.igem.org/2025/software-tools/ubc-vancouver maestro

User Reference

Users can interface with a maestro workflow via two files, Maestro.toml and procinfo.toml . Maestro.toml is used to configure the behaviour of workflows, while procinfo.toml provides workflow dependency and execution documentation.

Maestro.toml

[args]

Type: top-level table containing key = string mappings

This table is designed to allow passing arguments into the program.

Example:

[args]
# Message printed at initialization
init_msg = "Hello, world!"

[inputs]

Type: top-level table containing key = array<string> mappings

This table is designed to allow passing sets of input paths into the program.

Example:

[inputs]
# Input PDB files to the molecular dynamics workflow
input_files = ["data/1v9e_ph4.pdb", "data/1v9e_ph10.pdb", "data/BtCAII_ph4.pdb"]

[executor]

Type: top-level table containing subtables, with each subtable representing an executor definition. The subtable name defines the executor name (e.g., [executor.default] defines an executor named “default”).

This table is designed to allow defining execution environments for processes. Each executor definition MAY either fully define an executor, OR inherit from another executor’s definition. An executor definition MAY be of type "Local" or "Slurm". The Local executor SHALL directly execute the process, whereas the Slurm executor SHALL schedule the process onto a compute node via Slurm. The following configuration options are available:

Local

[executor.name]
type = "Local"
container = { Podman = "ubuntu:rolling" }
staging_mode = "Copy"

Slurm

[executor.slurm_base]
type = "Slurm"
staging_mode = "Copy"
container = { Podman = "ubuntu:rolling" }
cpus = 4
memory = { type = "per_cpu", amount = 8192 }
gpus = 2
tasks = 1
nodes = 1
partition = "skylake"
time = { days = 1, hours = 2 }
account = "my-account-id"
mail_user = "myemail@gmail.com"
mail_type = ["NONE", "TIME_LIMIT_50"]
additional_options = [
    ["qos", "high"]
]

Inheritance

[executor.slurm_inherited]
inherit = "slurm_base"
tasks = 2
cpus = 10

procinfo.toml

This file describes all processes in the workflow. It contains one table per process, along with information concerning the process.

# Runs tleap to prepare input files for MD simulation
["src/main.rs:39:17"]
executor = "direct"
deps = ["tleap"]

# Executes parmed to convert from AMBER into a GROMACS-compatible format
["src/main.rs:68:17"]
executor = "direct"
deps = ["python", "py:parmed"]

# Executes the primary GROMACS molecular dynamics process
["src/main.rs:100:17"]
executor = "slurm"
deps = ["gmx"]

Developer Reference

libmaestro’s main mechanism for defining an atomic workflow component is the process! macro. These blocks accept a specialized syntax which accepts arguments to parametrize the process.

Formal Specification

Augmented Backus-Naur Form is a metalanguage — a language for formally specifying the syntax of other languages — outlined in IETF Request for Comment (RFC) 5234. libmaestro’s process! syntax is defined in ABNF as follows (note: %s syntax from RFC 7405 is used):

; === WHITESPACE ===
WSPCHAR     =  %x20 / %x09 / %x0A / %x0D / %x0C
WSP         =  *WSPCHAR

; === MACRO BODY ===
PROCESS     =  %s"process!" WSP "{" WSP *DOCSTR PARAM *(WSP "," WSP PARAM) WSP [","] "}"
PARAM       =  NAME / EXECUTOR / ARG / DEPS / INLINE / SCRIPT

; === PARAMS ===
DOCSTR      =  "///" WSP *(%x00-09 / %x0B-0C / %x0E-FF) (CRLF / LF / CR)
NAME        =  %s"name" WSP "=" WSP EXPR
EXECUTOR    =  %s"executor" WSP "=" WSP LITSTR
ARG         =  (%s"inputs" / %s"args" / %s"outputs") WSP "=" WSP ARG_GROUP
ARG_GROUP   =  "[" WSP IDENT *(WSP "," WSP IDENT) WSP [","] WSP "]"
DEPS        =  %s"dependencies" WSP "=" WSP DEPS_GROUP
DEPS_GROUP  =  "[" WSP LITSTR *(WSP "," WSP LITSTR) WSP [","] WSP "]"
INLINE      =  %s"inline" WSP "=" WSP LITBOOL
SCRIPT      =  %s"script" WSP "=" WSP LITSTR

where the special rule EXPR represents a Rust expression, IDENT represents a Rust identifier, LITSTR represents a Rust string literal, and LITBOOL represents a Rust boolean literal. EXECUTOR and SCRIPT are required, and PARAM types may not be repeated, except for ARG (where max. one definition of each arg type inputs/args/outputs is permitted).

Informal Specification

Less formally, libmaestro exposes its primary process definition API via process! blocks. These blocks contain arguments that specify the process to execute, its inputs and outputs, and its execution environment.

process! blocks MAY begin with a documentation string:

process! {
    /// This is a docstring that describes this process
    /// Maybe I talk more about what it does
    /// ...so the user knows how they should configure its resources

which is included in procinfo.toml to provide information about the process to the user

# This is a docstring that describes this process
# Maybe I talk more about what it does
# ...so the user knows how they should configure its resources
["lib/examples/workflow.rs:31:17"]

Following this doc string, various fields are allowed. Each field MUST follow the format LHS = RHS , where LHS is the field name, and RHS is a Rust expression (the constraints on what expression are allowed depends on the field).

name

Purpose

This field sets the process name, which also becomes the name of the folder in which the process is run. If the process will run multiple times in the same workflow with different inputs, it is desirable to make the process name depend on its input in some way (so each run creates a unique folder, and does not override past runs).

Example
fn my_workflow(run_id: i32) -> NodeResult {
        ...
    process! {
            ...
        name = format!("my_workflow_{run_id}"),
RHS

Any arbitrary Rust expression which produces a value that implements ToString.

Optional

Yes. If a name is not set, a randomly generated ID will be used.

executor

Purpose

This field determines which user-defined executor will be used to execute the process. This information is documented in procinfo.toml ; for instance, a process that depends on an executor named default would show up as follows:

["lib/examples/workflow.rs:31:17"]
executor = "default"

The corresponding executor MUST be configured in Maestro.toml :

[executor.default]
type = "Local"

If it is missing, the workflow will identify the error at program startup and inform the user.

Example
process! {
        ...
    executor = "default",
RHS

A Rust string literal.

Optional

No.

inputs/args/outputs

Purpose

These fields allow injection of Rust variables into the script. Inputs/outputs must be file paths, and their existence will be checked before/after the script runs.

Example
process! {
    ...
    inputs = [
        test_fasta,
        test_dir
    ],
    args = [
        num_cpus
    ],
    outputs = [
        output_path
    ],
    ...
    script = r#"
        ls -R "$test_dir" > "$output_path"
    "#,
RHS

An array, where each element is a Rust identifier that points to a value. For inputs/outputs, these values MUST implement AsRef<OsStr> so that they can be constructed into Path objects. For args, these values MUST implement ToString. The values are then injected into the script as shell variables, allowing the use of $<identifier>` to refer to them.

Optional

Yes.

dependencies

Purpose

To document the dependencies of the process. These dependencies are injected into procinfo.toml:

["src/main.rs:67:17"]
executor = "direct"
deps = ["gromacs"]
Example
process! {
    ...
    dependencies = ["!cat", "gromacs"],
    script = r#"
        cat "$test_fasta" > "$output_path"
        ...
    "#,
RHS

An array, where each element MUST be a Rust string literal. maestro will attempt to auto-determine dependencies by parsing the script definition as Bash syntax; entries specified in this array will be appended to these auto-determined dependencies (for information on this parsing algorithm, see the rundown here). Array entries with the form "!<name>" will instead direct maestro to ignore the auto-determined dependency matching name (in the above example, cat will be ignored). If any array entry matches "!", ALL auto-determined dependencies will be ignored.

Optional

Yes.

inline

Purpose

To specify whether the script is defined inline or in a separate file.

Example
process! {
    ...
    inline = false,
RHS

A Rust boolean literal.

Optional

Yes. If unspecified, defaults to true.

script

Purpose

To define the script that is executed when this process runs.

Example
process! {
    ...
    script = r#"
        cat "$test_fasta" > "$output_path"
    "#,
RHS

A Rust string literal. If inline is true (the default behaviour), this string literal MUST contain the body of the script. A shebang MAY be included; if no shebang is detected, #!/bin/bash will be injected. If inline is false, this string literal MUST be a path to a script file; the contents of this file will be read and included into the workflow at compile-time. Leading/trailing whitespace in the script will be stripped, and leading whitespace on each line will be stripped.

Optional

No.

Output format

The output of the process! block is a list of paths (Vec<PathBuf>), which contains the canonicalized paths associated with each item defined in the process’s outputs block in order, then the process’s working directory. For instance, when executing the following:

let output1 = Path::new("out.txt");
let output2 = Path::new("err.txt");

let process_output = process! {
        name = "my_process",
        ...
        outputs = [output1, output2],
        ...
}?;

the variable process_output would be a vector of length 3, containing the canonicalized path associated with output1, the canonicalized path associated with output2, and the path to the process’s execution directory. For instance, it could be set to:

[
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/out.txt",
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/err.txt",
    "/home/my-user/Documents/maestro_work/cheery_panther/my_process/"
]

maestro also provides an efficient API to destructure these vectors via lossy conversion into arrays (the .into_array() method). For example, replacing the above with

let [process_out1, process_out2] = process! {
        name = "my_process",
        ...
        outputs = [output1, output2],
        ...
}?.into_array();

binds output1 from the process definition to process_out1, and output2 to process_out2. The third element of the vector (the process’s execution directory) will remain unbound and be discarded.

Additional APIs

arg! and inputs!

Used to parse arguments and input paths from Maestro.toml. The body MUST be a Rust string literal; this SHOULD match an entry in the [args] or [inputs] tables in Maestro.toml . The existence of matching entries in Maestro.toml is checked at program startup, thus ensuring that processes do not fail when they are executed. arg! yields the argument value as a &str, inputs! yields its paths as a &[&Path].

Example
let init_msg = arg!("init_msg");
let input_files = inputs!("input_files");

#[maestro::main]

This is an attribute macro which can be attached to functions; it MUST be attached to the program’s main function. This provides an ergonomic pattern to verify the Maestro.toml configuration and initialize the workflow’s session directory, as well as inject session teardown functionality. The following:

#[maestro::main]
fn main() {
        // code in main...
}

is effectively identical to

fn main() {
        maestro::initialize();
        let main_result = {
                // code in main...
        };
        maestro::deinitialize();
        main_result
}

dagger integration

libmaestro is explicitly designed to integrate with dagger. Each process! invocation is entirely stateless and self-contained, enabling processes to be parallelized and scaled at will by leveraging dagger’s primitives. Furthermore, maestro is designed to represent process I/O as function I/O, enabling seamless integration with dagger’s “parallelization based on data flow” model. As shown above, a process’s outputs can be destructured; the logical conclusion of this approach is that the outputs of one process can become the inputs to the next. For instance, in our molecular dynamics maestro workflow:

let [prmtop, inpcrd] = tleap(path, molecule_name)?.into_array();
let [gro, topol] = parmed(prmtop, inpcrd, molecule_name)?.into_array();
let [gromacs_workdir] = gromacs(gro, topol, molecule_name)?.into_array();

Here, tleap , parmed, and gromacs are all functions which wrap a process! invocation. The outputs of each process are destructured, then passed as inputs to the next, demonstrating the flow of information between processes in the workflow (this closely mirrors pipes on Unix systems). If, for instance, we wished to execute multiple processes on the outputs of tleap, we could leverage dagger! :

dagger! {
        tleap_out :: tleap(path, molecule_name);
        analysis_1 :: analysis_process_1(tleap_out);
        analysis_2 :: analysis_process_2(tleap_out);
        analysis_3 :: analysis_process_3(analysis_2);
}

resulting in the following parallel process (this visualization is rendered directly by dagger):

For our molecular dynamics simulations, we were more interested in parallelizing the process on multiple input paths (.pdb files for analysis). Specifically, we wished to enable the user to input multiple paths:

[inputs]
input_files = [
        "data/1v9e_ph4.pdb",
        "data/1v9e_ph10.pdb",
        "data/BtCAII_ph4.pdb",
        "data/HpCA_ph7.pdb"
]

and execute the full molecular dynamics pipeline on each file in parallel. This was implemented by leveraging dagger’s parallelize primitive:

let input_files = inputs!("input_files");
let workflow = |path: &&Path| -> NodeResult<PathBuf> {
    // molecular dynamics workflow defined here
};

// executes the `workflow` function on each input file in parallel
let process_results = parallelize(input_files, workflow)
    .into_iter()
    .map(|result| result.expect("No processes should panic!"));
Our molecular dynamics pipeline running on the ARC Sockeye HPC cluster. Three input files (1v9e_ph4, 1v9e_ph10, BtCAII_ph4) are being processed in parallel. The tleap and parmed steps execute directly, and the gromacs step is scheduled on Slurm. Video is sped up 8x.

Python API

A set of Python bindings to libmaestro is available, enabling users to build workflows directly in Python. This binding set is comprehensive, and largely provides users the same APIs as are available in Rust. However, most of the correctness checks and compile-time processes executed by libmaestro (e.g., checking process definitions for errors, verifying that all args/inputs/executor definitions are configured in Maestro.toml, generating procinfo.toml) are not available when using libmaestro from Python. This is simply due to constraints in the language (Python has no compile-time execution mechanisms that enable libmaestro to execute these processes).

Python bindings to libmaestro are generated using pyo3, and the stub file is generated using pyo3-stub-gen.

Stub file

The full Python API of libmaestro is available in the stub file here. This file is bundled as part of the Python package wheel, and informs your favourite editor’s code completion and diagnostics. The stub contains a full set of type annotations, enabling type validation through static analysis tools such as mypy, as well as language servers such as basedpyright and ty.

Core API

Processes can be defined via the constructor of the Process type. This function is defined as follows:

class Process:
    def __init__(
        self,
        name: builtins.str,
        script: builtins.str,
        inputs: typing.Mapping[builtins.str, builtins.str | os.PathLike | pathlib.Path],
        outputs: typing.Mapping[
            builtins.str, builtins.str | os.PathLike | pathlib.Path
        ],
        args: typing.Mapping[builtins.str, builtins.str],
    ) -> Process: ...

Executors, arguments, and input files from Maestro.toml can be queried via the arg, inputs, and executor functions:

def arg(name: builtins.str) -> builtins.str: ...
def executor(name: builtins.str) -> GenericExecutor: ...
def inputs(name: builtins.str) -> builtins.list[pathlib.Path]: ...

As such, the following is an example of a simple workflow definition using the Python API:

from maestro import *

proc_inputs = {
    "test_fasta": "data/seq1.fasta",
    "test_dir": "data/",
}
proc_outputs = {
    "output_path": "out.txt"
}
proc = Process(
    name = "my_proc",
    inputs = proc_inputs,
    outputs = proc_outputs,
    args = {},
    script =
    """
    #!/bin/bash
    cat "$test_fasta"
    tree "$test_dir" > "$output_path"
    """
)

proc_executor = executor("default")
output_files = proc_executor.exe(proc)
print(output_files)

maestro-cli

Installation

If you have the Rust toolchain installed, maestro-cli can be installed from source by using cargo install:

cargo install --git https://gitlab.igem.org/2025/software-tools/ubc-vancouver maestro-cli

This will install the maestro binary and load it into your PATH

Commands

To see all available commands, run maestro help:

λ maestro help
Subcommands in the maestro CLI

Usage: maestro <COMMAND>

Commands:
  init          Initialize a new maestro project
  bundle        Compile a project and package it for redistribution
  build         Build a project
  run           Run a binary or project
  kill          Kill a running maestro process
  update-cache  Update the libmaestro cache
  help          Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

init

λ maestro init --help
Initialize a new maestro project

Usage: maestro init [PATH]

Arguments:
  [PATH]  [default: .]

Options:
  -h, --help  Print help

Initializes a new maestro project, with all dependencies set up and libmaestro pre-built. src/main.rs contains a simple demo workflow pre-defined.

bundle

λ maestro bundle --help
Compile a project and package it for redistribution

Usage: maestro bundle [OPTIONS] [CARGO_ARGS]...

Arguments:
  [CARGO_ARGS]...  Arguments to pass to cargo build

Options:
  -c, --compress <COMPRESS>  Compresses the bundle into an archive [possible values: zip, gzip, xz, bzip2, zstd, lzma]
  -a, --arch <ARCH>          Bundle for a target architecture; defaults to the host arch [possible values: linux, apple, all]
  -r, --runtime <RUNTIME>    Container runtime for multi-arch builds; only read if --arch is set [possible values: docker, podman, apptainer]
  -h, --help                 Print help

Bundles the current maestro project for redistribution.

build

λ maestro build --help
Build a project

Usage: maestro build [CARGO_ARGS]...

Arguments:
  [CARGO_ARGS]...  Arguments to pass to cargo build

Options:
  -h, --help  Print help

Builds the current maestro project.

run

λ maestro run --help
Run a binary or project

Usage: maestro run [OPTIONS] [BINARY] [CARGO_ARGS]... [ARGS]...

Arguments:
  [BINARY]         A binary to run; when unspecified, the current project will be run
  [CARGO_ARGS]...  Arguments to pass to cargo run
  [ARGS]...        Arguments to pass to the program

Options:
  -b, --background  Run detached from the current shell session
  -h, --help        Print help

Runs the current maestro project, or the target binary.

kill

λ maestro kill --help
Kill a running maestro process

Usage: maestro kill <TARGET>

Arguments:
  <TARGET>  The process to kill, by name or path

Options:
  -h, --help  Print help

Kills a running maestro process.

update-cache

λ maestro update-cache --help
Update the libmaestro cache

Usage: maestro update-cache

Options:
  -h, --help  Print help

Rebuilds the cached version of libmaestro. This is done automatically when running maestro run/build/bundle.

help

Displays the full help documentation for maestro-cli

λ maestro help
Subcommands in the maestro CLI

Usage: maestro <COMMAND>

Commands:
  init          Initialize a new maestro project
  bundle        Compile a project and package it for redistribution
  build         Build a project
  run           Run a binary or project
  kill          Kill a running maestro process
  update-cache  Update the libmaestro cache
  help          Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

DBTLs and Development

finalflow

design block icon

The first proof of concept design sketch of a workflow executor. Designed for basic local execution and piping, with inbuilt parallelism primitives.

build block icon build block icon

finalflow was implemented as a single-crate Rust library.

test block icon test block icon

finalflow is able to execute simple (e.g., 2-step) processes consistently, but is riddled with small bugs and does not have proper error handling and stdout/stderr piping.

learn block icon learn block icon

For future versions, parallelism should be outsourced to a dedicated framework (i.e., dagger) and support for additional execution environments should be expanded.

Implemented at this Git commit.

Atomicity and Extensibility

design block icon

finalflow was highly reliant on mutable global state, making it difficult to parallelize due to high reliance on locks for safe, concurrent access. Additionally, it was not built to be extended for other execution platforms. This version is designed to make each process atomic, stateless, and configurable to a specific executor.

build block icon build block icon

finalflow was rewritten to improve its atomicity, making it solely reliant on a one-time session directory setup. Furthermore, execution was offloaded to an Executor trait, allowing extensibility for future execution platforms. This DBTL is related to this Git commit.

test block icon test block icon

This version of maestro is able to more reliably execute local processes, and all parallelization primitives have been stripped in favour of designing with [dagger integration](#dagger-integration) in mind. Execution is done by passing a process definition to a struct that implements Executor.

learn block icon learn block icon

For future versions, execution should be extended to support additional platforms other than direct execution.

Implemented at this Git commit.

SLURM support

design block icon

maestro was initially only able to execute scripts directly. This DBTL cycle aimed to add support for SLURM execution and configuration.

build block icon build block icon

Implementation of Slurm support is done by implementing a novel SlurmExecutor struct and the Executor trait for it. The SlurmExecutor struct contains a slurm_config field that holds fields wrapping sbatch directives. Slurm support was added in this Git commit.

test block icon test block icon

This version of maestro was tested to successfully schedule and execute scripts via Slurm on the University of British Columbia’s high performance compute cluster, ARC Sockeye.

learn block icon learn block icon

For future versions, execution should be dynamic based on user configuration rather than hardcoded. Additionally, a framework needs to be developed for passing arguments into workflows.

Implemented at this Git commit.

Reflection and runtime configuration

design block icon

Instead of hardcoding executor configurations, they will be configured at runtime via a Maestro.toml file that contains tables to define executors and input arguments.

build block icon build block icon

All executor fields are made deserializable via serde-derive, and TOML parsing is done via toml. The process! API is updated to leverage custom, user-defined executors, and arg! is provided to access user-provided arguments.

test block icon test block icon

Maestro.toml support was added in this Git commit. Various configurations were tested to ensure all deserialization is parsed as expected, including nested fields and tagged enumerations.

learn block icon learn block icon

The runtime-configuration format provides far more transparent user configuration than hardcoded configurations. The most important learning point from this DBTL cycle is that all configuration must be checked at program startup, thus ensuring that execution does not fail unexpectedly midway through program execution.

Implemented at this Git commit.

Dr. Wong, who is familiar with bioinformatic workflows primarily from a biological perspective, observed that interacting with job schedulers poses a major challenge for novice bioinformaticians attempting to run complex computational workflows. He noted that maestro’s design — allowing workflows to be written independent of a specific executor and enabling execution environments to be configured through a config file — could significantly improve the accessibility of bioinformatics.

profile-image

Dr. Donald Wong

Professor of bioinformatics

Library Internals

process!

process! is implemented as a function-like procedural macro. Broadly, these act like functions, but rather than taking in and outputting data, they take in and output raw code tokens. Effectively, procedural macros act as programmatic compile-time preprocessors which arbitrarily transform their inputs into new code which is executed. The input tokens to a process! block are first parsed into the following data structure by leveraging a custom parser built on top of the syn library:

struct ProcessDefinition {
    name: Option<Expr>,
    executor: LitStr,
    inputs: Punctuated<Ident, Comma>,
    args: Punctuated<Ident, Comma>,
    outputs: Punctuated<Ident, Comma>,
    dependencies: Punctuated<LitStr, Comma>,
    inline: bool,
    literal: LitStr,
}

The tokens then undergo various transformations. For instance, if the script is not inlined (i.e., the content is in a separate file), the external file is read and the path is replaced with its contents; if a name is missing, a random ID is generated. Then, two critical steps are executed:

Script validation

Scripts are passed to ShellCheck for static analysis. However, the script must first undergo various transforms to improve the quality of ShellCheck’s analysis. A shebang is inserted if it is missing; leading and trailing whitespace is cleaned, and a variable definition is added for each Rust injection (the inputs, outputs, and args fields). Once ShellCheck has completed, its output is taken and parsed to identify if any errors were detected. If so, all error paths are transformed to point to a proper file/line/column (GCC format is used for integration with editor “path links”). For instance:

This script contains an error due to a missing line break; by clicking on the generated error message, the user is able to easily navigate to exactly where the error occurred.

Dependency analysis

The script is then analyzed to identify its dependencies. This is implemented via a custom parsing algorithm, which searches for binary identifiers at the start of lines, after | / && / || and within constructs such as {...} and (...). Keywords such as if / fi, for, case, etc. are explicitly ignored, as well as shell builtins such as alias, cd, break, echo, etc. The full list of shell builtins was sourced from the Bash man pages (you can execute

man bash | col -b | less +$(man bash | col -b | grep -n "SHELL BUILTIN COMMANDS" | tail -1 | cut -d: -f1)

to jump to the relevant section). This information is then appended to the growing procinfo.toml file, including the analyzed dependencies, the specified executor, etc.

Finally, various other transformations on the input tokens take place, and they are re-emitted to construct a maestro::Process object and execute it on the specified executor:

let executor_tokens = quote! {
    maestro::submit_request! {
        maestro::RequestedExecutor(#executor, file!(), line!(), column!())
    };
    maestro::config::MAESTRO_CONFIG.executors[#executor].exe(process)
};
quote! {{
    let process = maestro::Process::new(
        #name.to_string(),
        #container,
        vec![#(#input_pairs),*],
        vec![#(#arg_pairs),*],
        vec![#(#output_pairs),*],
        ::std::borrow::Cow::Borrowed(#process_lit),
    );
    #executor_tokens
}}

where quote! is a proc-macro for generation of source code tokens from raw syntax tree elements (for instance a Vec<LitStr>). This newly output source code is what the compiler sees, and what is executed at runtime.

Readers experienced in the arcane magiks of writing procedural macros (although I am slightly skeptical that such a reader exists, I would be pleasantly surprised to learn otherwise) may have curiously noticed the maestro::submit_request! macro call embedded within the token expansion above. Every process!, arg!, and inputs! site includes such a call, where a RequestedExecutor/RequestedArg/RequestedInputs struct is submitted. Within libmaestro itself, a thread-safe global collection is initialized to store requested executors, args, and inputs, and each submit_request! site expands to define a function that submits the given struct into the global collection. These function calls are then annotated with a link-section attribute, similar to the following:

#[used]
#[cfg_attr(target_os = "linux", link_section = ".init_array")]
#[cfg_attr(target_vendor = "apple", link_section = "__DATA,__mod_init_func,mod_init_funcs")]
#[cfg_attr(target_os = "windows", link_section = ".CRT$XCU")]
/* ... other platforms elided ... */
static __REQUEST: unsafe extern "C" fn() = __request;

which ensures the submission function runs before main. As such, right at the start of main, all arguments and executors which are expected to be defined in Maestro.toml have already been submitted into the relevant global collections.

As seen above, the #[maestro::main] annotation injects a maestro::initialize() call at the start of main. This function iterates over the “expected arg/executor” collections and short-circuits when it detects an item missing in Maestro.toml. By leveraging this maestro::submit_request! API, libmaestro ensures that no matter where in the program a configuration-dependent value is requested, it is always checked at startup.

maestro build

maestro-cli does not simply offload building to cargo build. Internally, maestro caches a built copy of libmaestro, specific to each version of libmaestro and rustc:

λ ls -l .maestro_cache
total 0
drwxr-xr-x@  4 seb-hyland  staff   128 Sep 30 19:25 maestro-0.2.10_rustc-1.90.0
drwxr-xr-x@  4 seb-hyland  staff   128 Oct  4 18:11 maestro-0.2.10_rustc-1.92.0-nightly
drwxr-xr-x@ 41 seb-hyland  staff  1312 Oct  4 18:11 vendor

Source code for a specific version of libmaestro and its dependencies is obtained via cargo vendor; then, dependency paths are normalized and libmaestro is built into a static library binary (libmaestro.rlib). This allows libmaestro to only be rebuilt when the compiler version changes (which is extremely infrequent), making builds significantly faster.

A side-by-side comparison of building the same project via `maestro build` and `cargo build --release`. Both are fresh builds (no Cargo build cache) and have all optimizations enabled (release profile). Building with `maestro` only takes 1.31s, while building with `cargo build` takes 18.26s.

Implications and Proof of Concept

As a proof of maestro’s potential in building practical, real-world workflows, we rewrote our structural prediction, molecular dynamics, and bioinformatics workflows to leverage libmaestro. For each workflow, libmaestro’s self-documentation mechanisms (the Maestro.toml and procinfo.toml files) provide information on workflow arguments and processes.

Structural prediction and molecular dynamics

Our structural prediction workflow was originally implemented in Nextflow, and our molecular dynamics workflow as a simple shell script. Both were converted to run under maestro to improve their reliability, as well as make them available for other iGEM teams to use. The structural prediction workflow is available here, and the molecular dynamics workflow is available here.

Our molecular dynamics pipeline running on the ARC Sockeye HPC cluster. Three input files (1v9e_ph4, 1v9e_ph10, BtCAII_ph4) are being processed in parallel. The tleap and parmed steps execute directly, and the gromacs step is scheduled on Slurm. Video is sped up 8x.

Bioinformatics

We originally had a collection of shell scripts for interacting with EggNOG database files. We decided to combine these discrete one-off tools into a bundled workflow that can be re-used; it is available here.

Active Development

libmaestro is still under active development, with additional features being added consistently.

SSH execution

The primary domain of ongoing development revolves around supporting remote process execution over SSH. The Local and Slurm executors in Maestro.toml will support an ssh field, enabling users to specify a remote server to execute the process on. All SSH connections will be initialized at program startup, thus supporting interactive (password/2FA) login systems.

Binding to libssh2

maestro binds to libssh2 by leveraging Rust’s ssh2 crate. This enables maestro to construct and hold persistent SSH connections, and split the connection into channels on which commands or file transfers can be executed. Futhermore, ssh2 provides primitives for keyboard-interactive authentication; maestro leverages this mechanism, building a user prompting interface for session connection.

Abstraction

Currently, maestro uses std::process::Command to spawn processes and std::fs::copy/std::os::unix::fs::symlink for staging files. The SSH feature requires a rework of these mechanisms altogether; process execution and file transfers are hidden behind an enum implementation, allowing swapping between native mechanisms and SSH/SFTP. Additionally, file paths require a rework, so that paths to remote files can be supported alongside local files. This is roughly implemented as follows:

enum WorkflowPath {
    Local {
        path: PathBuf,
    },
    Remote {
        path: PathBuf,
        connection: &'static Session,
    },
}

As such, processes executed on remote hosts return a Vec<WorkflowPath::Remote>, while local processes return a Vec<WorkflowPath::Local>. This allows chained processes which run on the same remote server to avoid transferring large files to the local client, though files can be still be manually moved by leveraging WorkflowPath’s abstractions:

impl WorkflowPath {
    pub fn copy(&mut self, target: Self) { /* ... */ }
}

Additional QoL Improvements

Additional quality of life improvements in libmaestro and maestro-cli are also being actively developed. Most notably, these include:

  1. Resuming workflows from their last successful process
  2. Monitoring active workflows and querying process information via maestro-cli
  3. Moving libmaestro caching (for maestro run/build/bundle) from a per-project basis to a per-user basis