A task orchestration framework for reproducible, distributable bioinformatic workflows.
Introduction
maestro is a framework for writing and distributing computational workflows in a configurable, reproducible, and scalable manner. It is comprised of two core components: the libmaestro library, and the maestro-cli command line interface. libmaestro enables developers to define atomic processes and compose them into complex workflows, while simultaneously leveraging cutting-edge analyses to validate the correctness of these workflows before they ever run. maestro-cli on the other hand, provides modern tooling to developers and users alike, enabling fast builds, process control, and standardized distribution.
Key highlights
maestro powers meduCA’s key structural prediction and molecular dynamics pipelines
The maestro framework empowers bioinformaticians in iGEM and beyond to define more reliable, reproducible, and redistributable workflows
Once defined, workflows are easy to use for generalist and expert users alike
Configuring pre-defined workflows is simple; input files and execution parameters are specified in a single configuration file
maestro integrates extensively with commonly-used, existing technologies
Docker, Podman, and Apptainer are supported for containerized workflows
Slurm execution is supported, enabling workflows to execute on most academic high-performance compute clusters
In this document, the following definitions apply:
maestro: the maestro framework, comprising of libmaestro, maestro-cli, and the maestro user experience philosophy
libmaestro: the library component of maestro, i.e., the Rust crate that enables developers to write workflows
maestro-cli: the command-line tool which comprises maestro’s run, build, and distribution system
compile-time: the time at which the developer compiles (“builds”) code into an executable program
runtime: the time at which the program is run by the user
program startup: the moment at which the user begins running a workflow (i.e., the start of runtime)
Preamble
Bioinformatics is defined as the “application of tools of computation and analysis to the capture and interpretation of biological data.” This field, though still somewhat nascent, has cemented itself as essential for the scalable management and analysis of data in modern biology. As such, informatic workflows present themselves as a critical facet of many a team’s iGEM project (including ourselves, with our phylogenetic analysis and modelling pipelines), and the development of improved technologies in this field is critical for the furthering of synthetic biology.
At their core, bioinformatic workflows are typically the summation of many smaller analyses, composed together to transform data from raw inputs into a meaningful analysis. This closely mirrors what is commonly referred to as the “Unix philosophy,” a set of maxims based on the experience of the leading developers of the Unix operating system. The first two are as follows:
Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features.”
Expect the output of every program to become the input to another, as yet unknown, program…
Bioinformaticians typically develop programs following this philosophy: as atomic, independent, composable components that can be joined together by making the outputs of one component the input to the next. Then, higher-level programs can build off these independent components; for instance, the TreeSAPP tool used for our team’s phylogenetic analysis is built off HMMER, Prodigal, RAxML-NG, and many other dependencies.
Lately, various tools with the explicit goal of assisting bioinformaticians in composing processes into workflows have emerged, with Nextflow and Snakemake leading the charge. Though our team has leveraged these tools in the past — especially in our modelling workflows — we experienced various shortcomings, most notably in parallel composition, correctness verification, and redistribution. As such, maestro was born, designed along the following guiding principles:
Verify correctness and fail early
Run every possible analysis on the defined workflow to ensure that it will execute properly at runtime
If issues are found, either in workflow definition or user configuration, fail the workflow early (at compile-time or program startup), rather than risking unpredictable behavior or errors later
Scale with complexity
Make it easy to define simple AND complex workflows
A workflow comprised of 50 components should not require additional specialized knowledge to write when compared to a workflow comprised of 5 components
Workflows should be easily optimized for parallelism where possible
Enable redistribution
End-users should be able to re-use previously defined workflows without being required to fully understand how the workflow was defined
End-users should be able to easily configure workflows to execute on whatever environment they wish, with whatever input data they wish
Workflows should be easily distributable, and run in a reproducible manner across diverse environments
Improve the accessibility of bioinformatics
maestro should be usable by beginners and advanced users alike
It should be welcoming, integrate well with existing tooling (i.e., editor completion via the language server protocol, diagnostics, build/test infrastructure, etc.), and be easy to understand and set up
Distributed maestro pipelines should be easily usable by non-technical individuals, making it easier for researchers to integrate informatic analyses into their research
User experience
maestro’s user and developer experience closely follows this grounding philosophy. Every maestro project compiles into an executable (platform-native), as well as two user-facing configuration/information files, Maestro.toml and procinfo.toml.
Maestro.toml allows users to define process arguments and execution environments
This file is distributed alongside the executable, and allows the user to configure the workflow to their needs at runtime
procinfo.toml provides documentation for each process in a workflow
This file is auto-generated at compile time from the source code itself, providing a docstring, information about which executor is associated with each process, and the dependencies of each process
Furthermore, the following high-level features are provided by maestro:
Verify correctness and fail early
Developers are able to define processes in familiar shell syntax
These shell scripts are statically analyzed by ShellCheck at compile-time, and static analysis errors are converted directly into compiler errors
Attempted compilation of a shell script with two errors. The $test_dir variable is misspelt, and a missing line escape ( ) is causing a redirection without a command. As such, compilation fails with the error forwarded to the developer.
This feature is entirely novel in workflow execution software, and has the potential to catch entire classes of shell bugs at compile-time
It’s also dear to our team. We had a modelling analysis (running under Nextflow) crash 16 hours into a job due to missing a line escape, an issue that would never occur under maestro
User-defined arguments and inputs (in Maestro.toml) are verified to exist at program startup rather than process runtime, ensuring the user is immediately made aware of any configuration errors rather than having workflows fail much later
Ryan has extensively leveraged existing workflow execution tools — notably Nextflow — in his research and work. As such, we had a conversation to compare and contrast maestro’s features against currently established tools.
Ryan McLaughlin
PhD Candidate, Bioinformatics
In our discussion, Ryan noted that
[maestro] presents improvements over the top of the line.
Specifically, he was impressed with maestro’s compile-time script validation (noting that shell syntax errors are a common frustration when defining workflows) as well as support for easy configuration of execution environments. Furthermore, Ryan expressed that he wishes to leverage maestro in some of his research/work moving forward.
provide a simple interface for enabling user configuration of complex, multi-step workflows
be self-documenting via procinfo.toml, enabling user transparency on process dependencies and execution configuration
User-configuration is designed to be simple, intuitive, and non-repetitive
Maestro.toml allows inheritance for executor definitions, where “children” can override attributes of their parents
Enable redistribution
maestro-cli provides the maestro bundle command, allowing generation of bundled, self-contained workflows
Optionally, these bundles can be packaged directly into a singular compressed archive
The --arch option enables developers to automatically build for multiple operating systems and architectures
A bundled maestro workflow built for multiple architectures, provided by the “maestro bundle” command in maestro-cli. Shown in its raw (maestro bundle --arch all) and compressed (maestro bundle --arch all --compress xz) formats.
Bundled workflows are platform-native executables, allowing them to be executed directly like any other application without requiring the user to install maestro.
Improve the accessibility of bioinformatics
maestro provides dual APIs in Rust (primary) and Python (simplified)
The Python API provides less correctness guarantees (due to the nature of the Python language), but improves the accessibility of developing bioinformatics workflows due to the language’s simplicity
The Rust API is more fully featured and geared toward advanced users, with improved correctness checks and bundling integration
Type hinting and LSP integration is provided for both the Rust and Python APIs
As such, every modern editor will provide code highlighting, completion and diagnostics inline
maestro’s language server support in both Rust (rust-analyzer) and Python (ty)
Installation
libmaestro is distributed as a standard Rust library. To add it to your project, simply run
Users can interface with a maestro workflow via two files, Maestro.toml and procinfo.toml . Maestro.toml is used to configure the behaviour of workflows, while procinfo.toml provides workflow dependency and execution documentation.
This table is designed to allow passing sets of input paths into the program.
Example:
[inputs]# Input PDB files to the molecular dynamics workflowinput_files = ["data/1v9e_ph4.pdb", "data/1v9e_ph10.pdb", "data/BtCAII_ph4.pdb"]
[executor]
Type: top-level table containing subtables, with each subtable representing an executor definition. The subtable name defines the executor name (e.g., [executor.default] defines an executor named “default”).
This table is designed to allow defining execution environments for processes. Each executor definition MAY either fully define an executor, OR inherit from another executor’s definition. An executor definition MAY be of type "Local" or "Slurm". The Local executor SHALL directly execute the process, whereas the Slurm executor SHALL schedule the process onto a compute node via Slurm. The following configuration options are available:
Defines a list of modules to import before the process runs
Equivalent to module load {name}
cpus
Defines how many CPUs to request for the process
memory
Defines how much memory to request for the process
type must be "per_node" or "per_cpu"
amount must be a value in megabytes
gpus
Defines how many GPUs to request for the process
tasks
Advises the Slurm controller that the process will launch a maximum of n tasks
Equivalent to the ntasks Slurm directive
nodes
Defines how many nodes should be assigned to the process
partition
Defines which partition the process should spawn on
time
Defines the maximum time to allocate for the process
Valid fields to set are days, hours, mins, and secs
mins and secs should be valid (<60)
account
Defines the account that should be charged for resources associated with the process
mail_user
Defines an email address for mail notifications
mail_type
Defines what alerts should send an email
Values MUST be NONE, ALL, BEGIN, END, FAIL, REQUEUE, INVALID_DEPEND, STAGE_OUT, TIME_LIMIT_50, TIME_LIMIT_80, TIME_LIMIT_90, TIME_LIMIT, or ARRAY_TASKS
Chained inheritance is allowed; the latest item in an inheritance chain takes priority
For example, if slurm_inherited2 inherits from slurm_inherited1 which inherits from slurm_base, and all three define some value for cpus, a process which executes on slurm_inherited2 will prioritize the cpus value in slurm_inherited2
All inheritance MUST be of the same type; e.g., inheriting from a Local executor and setting cpus is not allowed (there is no cpus configuration option for Local executors, only Slurm executors)
procinfo.toml
This file describes all processes in the workflow. It contains one table per process, along with information concerning the process.
# Runs tleap to prepare input files for MD simulation["src/main.rs:39:17"]executor = "direct"deps = ["tleap"]# Executes parmed to convert from AMBER into a GROMACS-compatible format["src/main.rs:68:17"]executor = "direct"deps = ["python", "py:parmed"]# Executes the primary GROMACS molecular dynamics process["src/main.rs:100:17"]executor = "slurm"deps = ["gmx"]
Documentation string (above table)
Additional information provided by the developer about the process
executor
Describes which executor is used for the process
This executor MUST be defined in Maestro.toml
deps
Defines the dependencies of the process
These dependencies MUST be available when the process runs
Developer Reference
libmaestro’s main mechanism for defining an atomic workflow component is the process! macro. These blocks accept a specialized syntax which accepts arguments to parametrize the process.
Formal Specification
Augmented Backus-Naur Form is a metalanguage — a language for formally specifying the syntax of other languages — outlined in IETF Request for Comment (RFC) 5234. libmaestro’s process! syntax is defined in ABNF as follows (note: %s syntax from RFC 7405 is used):
where the special rule EXPR represents a Rust expression, IDENT represents a Rust identifier, LITSTR represents a Rust string literal, and LITBOOL represents a Rust boolean literal. EXECUTOR and SCRIPT are required, and PARAM types may not be repeated, except for ARG (where max. one definition of each arg type inputs/args/outputs is permitted).
Informal Specification
Less formally, libmaestro exposes its primary process definition API via process! blocks. These blocks contain arguments that specify the process to execute, its inputs and outputs, and its execution environment.
process! blocks MAY begin with a documentation string:
process! { /// This is a docstring that describes this process /// Maybe I talk more about what it does /// ...so the user knows how they should configure its resources
which is included in procinfo.toml to provide information about the process to the user
# This is a docstring that describes this process# Maybe I talk more about what it does# ...so the user knows how they should configure its resources["lib/examples/workflow.rs:31:17"]
Following this doc string, various fields are allowed. Each field MUST follow the format LHS = RHS , where LHS is the field name, and RHS is a Rust expression (the constraints on what expression are allowed depends on the field).
name
Purpose
This field sets the process name, which also becomes the name of the folder in which the process is run. If the process will run multiple times in the same workflow with different inputs, it is desirable to make the process name depend on its input in some way (so each run creates a unique folder, and does not override past runs).
Any arbitrary Rust expression which produces a value that implements ToString.
Optional
Yes. If a name is not set, a randomly generated ID will be used.
executor
Purpose
This field determines which user-defined executor will be used to execute the process. This information is documented in procinfo.toml ; for instance, a process that depends on an executor named default would show up as follows:
These fields allow injection of Rust variables into the script. Inputs/outputs must be file paths, and their existence will be checked before/after the script runs.
An array, where each element is a Rust identifier that points to a value. For inputs/outputs, these values MUST implement AsRef<OsStr> so that they can be constructed into Path objects. For args, these values MUST implement ToString. The values are then injected into the script as shell variables, allowing the use of $<identifier>` to refer to them.
Optional
Yes.
dependencies
Purpose
To document the dependencies of the process. These dependencies are injected into procinfo.toml:
An array, where each element MUST be a Rust string literal. maestro will attempt to auto-determine dependencies by parsing the script definition as Bash syntax; entries specified in this array will be appended to these auto-determined dependencies (for information on this parsing algorithm, see the rundown here). Array entries with the form "!<name>" will instead direct maestro to ignore the auto-determined dependency matching name (in the above example, cat will be ignored). If any array entry matches "!", ALL auto-determined dependencies will be ignored.
Optional
Yes.
inline
Purpose
To specify whether the script is defined inline or in a separate file.
A Rust string literal. If inline is true (the default behaviour), this string literal MUST contain the body of the script. A shebang MAY be included; if no shebang is detected, #!/bin/bash will be injected. If inline is false, this string literal MUST be a path to a script file; the contents of this file will be read and included into the workflow at compile-time. Leading/trailing whitespace in the script will be stripped, and leading whitespace on each line will be stripped.
Optional
No.
Output format
The output of the process! block is a list of paths (Vec<PathBuf>), which contains the canonicalized paths associated with each item defined in the process’s outputs block in order, then the process’s working directory. For instance, when executing the following:
let output1 = Path::new("out.txt");let output2 = Path::new("err.txt");let process_output = process! { name = "my_process", ... outputs = [output1, output2], ...}?;
the variable process_output would be a vector of length 3, containing the canonicalized path associated with output1, the canonicalized path associated with output2, and the path to the process’s execution directory. For instance, it could be set to:
maestro also provides an efficient API to destructure these vectors via lossy conversion into arrays (the .into_array() method). For example, replacing the above with
let [process_out1, process_out2] = process! { name = "my_process", ... outputs = [output1, output2], ...}?.into_array();
binds output1 from the process definition to process_out1, and output2 to process_out2. The third element of the vector (the process’s execution directory) will remain unbound and be discarded.
Additional APIs
arg! and inputs!
Used to parse arguments and input paths from Maestro.toml. The body MUST be a Rust string literal; this SHOULD match an entry in the [args] or [inputs] tables in Maestro.toml . The existence of matching entries in Maestro.toml is checked at program startup, thus ensuring that processes do not fail when they are executed. arg! yields the argument value as a &str, inputs! yields its paths as a &[&Path].
Example
let init_msg = arg!("init_msg");let input_files = inputs!("input_files");
#[maestro::main]
This is an attribute macro which can be attached to functions; it MUST be attached to the program’s main function. This provides an ergonomic pattern to verify the Maestro.toml configuration and initialize the workflow’s session directory, as well as inject session teardown functionality. The following:
#[maestro::main]fn main() { // code in main...}
is effectively identical to
fn main() { maestro::initialize(); let main_result = { // code in main... }; maestro::deinitialize(); main_result}
dagger integration
libmaestro is explicitly designed to integrate with dagger. Each process! invocation is entirely stateless and self-contained, enabling processes to be parallelized and scaled at will by leveraging dagger’s primitives. Furthermore, maestro is designed to represent process I/O as function I/O, enabling seamless integration with dagger’s “parallelization based on data flow” model. As shown above, a process’s outputs can be destructured; the logical conclusion of this approach is that the outputs of one process can become the inputs to the next. For instance, in our molecular dynamics maestro workflow:
Here, tleap , parmed, and gromacs are all functions which wrap a process! invocation. The outputs of each process are destructured, then passed as inputs to the next, demonstrating the flow of information between processes in the workflow (this closely mirrors pipes on Unix systems). If, for instance, we wished to execute multiple processes on the outputs of tleap, we could leverage dagger! :
resulting in the following parallel process (this visualization is rendered directly by dagger):
For our molecular dynamics simulations, we were more interested in parallelizing the process on multiple input paths (.pdb files for analysis). Specifically, we wished to enable the user to input multiple paths:
and execute the full molecular dynamics pipeline on each file in parallel. This was implemented by leveraging dagger’s parallelize primitive:
let input_files = inputs!("input_files");let workflow = |path: &&Path| -> NodeResult<PathBuf> { // molecular dynamics workflow defined here};// executes the `workflow` function on each input file in parallellet process_results = parallelize(input_files, workflow) .into_iter() .map(|result| result.expect("No processes should panic!"));
Our molecular dynamics pipeline running on the ARC Sockeye HPC cluster. Three input files (1v9e_ph4, 1v9e_ph10, BtCAII_ph4) are being processed in parallel. The tleap and parmed steps execute directly, and the gromacs step is scheduled on Slurm. Video is sped up 8x.
Python API
A set of Python bindings to libmaestro is available, enabling users to build workflows directly in Python. This binding set is comprehensive, and largely provides users the same APIs as are available in Rust. However, most of the correctness checks and compile-time processes executed by libmaestro (e.g., checking process definitions for errors, verifying that all args/inputs/executor definitions are configured in Maestro.toml, generating procinfo.toml) are not available when using libmaestro from Python. This is simply due to constraints in the language (Python has no compile-time execution mechanisms that enable libmaestro to execute these processes).
Python bindings to libmaestro are generated using pyo3, and the stub file is generated using pyo3-stub-gen.
Stub file
The full Python API of libmaestro is available in the stub file here. This file is bundled as part of the Python package wheel, and informs your favourite editor’s code completion and diagnostics. The stub contains a full set of type annotations, enabling type validation through static analysis tools such as mypy, as well as language servers such as basedpyright and ty.
Core API
Processes can be defined via the constructor of the Process type. This function is defined as follows:
This will install the maestro binary and load it into your PATH
Commands
To see all available commands, run maestro help:
λ maestro helpSubcommands in the maestro CLIUsage: maestro <COMMAND>Commands: init Initialize a new maestro project bundle Compile a project and package it for redistribution build Build a project run Run a binary or project kill Kill a running maestro process update-cache Update the libmaestro cache help Print this message or the help of the given subcommand(s)Options: -h, --help Print help -V, --version Print version
init
λ maestro init --helpInitialize a new maestro projectUsage: maestro init [PATH]Arguments: [PATH] [default: .]Options: -h, --help Print help
Initializes a new maestro project, with all dependencies set up and libmaestro pre-built. src/main.rs contains a simple demo workflow pre-defined.
One positional argument to specify a target path at which the project should be initialized is supported
If unset, the current directory will be used (it must be empty)
bundle
λ maestro bundle --helpCompile a project and package it for redistributionUsage: maestro bundle [OPTIONS] [CARGO_ARGS]...Arguments: [CARGO_ARGS]... Arguments to pass to cargo buildOptions: -c, --compress <COMPRESS> Compresses the bundle into an archive [possible values: zip, gzip, xz, bzip2, zstd, lzma] -a, --arch <ARCH> Bundle for a target architecture; defaults to the host arch [possible values: linux, apple, all] -r, --runtime <RUNTIME> Container runtime for multi-arch builds; only read if --arch is set [possible values: docker, podman, apptainer] -h, --help Print help
Bundles the current maestro project for redistribution.
--compress allows the user to compress the bundled folder into a single archive file; multiple compression algorithms are supported
--arch allows the user to build multiple executables to support an operating system. One executable will be built for the x86_64 architecture, and one will be built for the aarch64 (ARM) architecture
--arch builds in a container to enable cross-compilation; --runtime sets which runtime should be used to spawn the container
build
λ maestro build --helpBuild a projectUsage: maestro build [CARGO_ARGS]...Arguments: [CARGO_ARGS]... Arguments to pass to cargo buildOptions: -h, --help Print help
Builds the current maestro project.
run
λ maestro run --helpRun a binary or projectUsage: maestro run [OPTIONS] [BINARY] [CARGO_ARGS]... [ARGS]...Arguments: [BINARY] A binary to run; when unspecified, the current project will be run [CARGO_ARGS]... Arguments to pass to cargo run [ARGS]... Arguments to pass to the programOptions: -b, --background Run detached from the current shell session -h, --help Print help
Runs the current maestro project, or the target binary.
One positional argument which specifies the path to an already built binary is supported
Built executables can also be run directly (e.g., ./<bin_name>)
--background spawns the process in the background and detaches it, similar to nohup
kill
λ maestro kill --helpKill a running maestro processUsage: maestro kill <TARGET>Arguments: <TARGET> The process to kill, by name or pathOptions: -h, --help Print help
Kills a running maestro process.
One positional argument which specifies the process to kill is required
If a process is running with id elated-bat (meaning it is running at maestro_work/elated-bat), running maestro kill elated-bat OR maestro kill maestro_work/elated-bat will work
update-cache
λ maestro update-cache --helpUpdate the libmaestro cacheUsage: maestro update-cacheOptions: -h, --help Print help
Rebuilds the cached version of libmaestro. This is done automatically when running maestro run/build/bundle.
help
Displays the full help documentation for maestro-cli
λ maestro helpSubcommands in the maestro CLIUsage: maestro <COMMAND>Commands: init Initialize a new maestro project bundle Compile a project and package it for redistribution build Build a project run Run a binary or project kill Kill a running maestro process update-cache Update the libmaestro cache help Print this message or the help of the given subcommand(s)Options: -h, --help Print help -V, --version Print version
DBTLs and Development
finalflow
The first proof of concept design sketch of a workflow executor. Designed for basic local execution and piping, with inbuilt parallelism primitives.
finalflow was implemented as a single-crate Rust library.
finalflow is able to execute simple (e.g., 2-step) processes consistently, but is riddled with small bugs and does not have proper error handling and stdout/stderr piping.
For future versions, parallelism should be outsourced to a dedicated framework (i.e., dagger) and support for additional execution environments should be expanded.
finalflow was highly reliant on mutable global state, making it difficult to parallelize due to high reliance on locks for safe, concurrent access. Additionally, it was not built to be extended for other execution platforms. This version is designed to make each process atomic, stateless, and configurable to a specific executor.
finalflow was rewritten to improve its atomicity, making it solely reliant on a one-time session directory setup. Furthermore, execution was offloaded to an Executor trait, allowing extensibility for future execution platforms. This DBTL is related to this Git commit.
This version of maestro is able to more reliably execute local processes, and all parallelization primitives have been stripped in favour of designing with [dagger integration](#dagger-integration) in mind. Execution is done by passing a process definition to a struct that implements Executor.
For future versions, execution should be extended to support additional platforms other than direct execution.
maestro was initially only able to execute scripts directly. This DBTL cycle aimed to add support for SLURM execution and configuration.
Implementation of Slurm support is done by implementing a novel SlurmExecutor struct and the Executor trait for it. The SlurmExecutor struct contains a slurm_config field that holds fields wrapping sbatch directives. Slurm support was added in this Git commit.
This version of maestro was tested to successfully schedule and execute scripts via Slurm on the University of British Columbia’s high performance compute cluster, ARC Sockeye.
For future versions, execution should be dynamic based on user configuration rather than hardcoded. Additionally, a framework needs to be developed for passing arguments into workflows.
Instead of hardcoding executor configurations, they will be configured at runtime via a Maestro.toml file that contains tables to define executors and input arguments.
All executor fields are made deserializable via serde-derive, and TOML parsing is done via toml. The process! API is updated to leverage custom, user-defined executors, and arg! is provided to access user-provided arguments.
Maestro.toml support was added in this Git commit. Various configurations were tested to ensure all deserialization is parsed as expected, including nested fields and tagged enumerations.
The runtime-configuration format provides far more transparent user configuration than hardcoded configurations. The most important learning point from this DBTL cycle is that all configuration must be checked at program startup, thus ensuring that execution does not fail unexpectedly midway through program execution.
Dr. Wong, who is familiar with bioinformatic workflows primarily from a biological perspective, observed that interacting with job schedulers poses a major challenge for novice bioinformaticians attempting to run complex computational workflows. He noted that maestro’s design — allowing workflows to be written independent of a specific executor and enabling execution environments to be configured through a config file — could significantly improve the accessibility of bioinformatics.
Dr. Donald Wong
Professor of bioinformatics
Library Internals
process!
process! is implemented as a function-like procedural macro. Broadly, these act like functions, but rather than taking in and outputting data, they take in and output raw code tokens. Effectively, procedural macros act as programmatic compile-time preprocessors which arbitrarily transform their inputs into new code which is executed. The input tokens to a process! block are first parsed into the following data structure by leveraging a custom parser built on top of the syn library:
The tokens then undergo various transformations. For instance, if the script is not inlined (i.e., the content is in a separate file), the external file is read and the path is replaced with its contents; if a name is missing, a random ID is generated. Then, two critical steps are executed:
Script validation
Scripts are passed to ShellCheck for static analysis. However, the script must first undergo various transforms to improve the quality of ShellCheck’s analysis. A shebang is inserted if it is missing; leading and trailing whitespace is cleaned, and a variable definition is added for each Rust injection (the inputs, outputs, and args fields). Once ShellCheck has completed, its output is taken and parsed to identify if any errors were detected. If so, all error paths are transformed to point to a proper file/line/column (GCC format is used for integration with editor “path links”). For instance:
This script contains an error due to a missing line break; by clicking on the generated error message, the user is able to easily navigate to exactly where the error occurred.
Dependency analysis
The script is then analyzed to identify its dependencies. This is implemented via a custom parsing algorithm, which searches for binary identifiers at the start of lines, after | / && / || and within constructs such as {...} and (...). Keywords such as if / fi, for, case, etc. are explicitly ignored, as well as shell builtins such as alias, cd, break, echo, etc. The full list of shell builtins was sourced from the Bash man pages (you can execute
man bash | col -b | less +$(man bash | col -b | grep -n "SHELL BUILTIN COMMANDS" | tail -1 | cut -d: -f1)
to jump to the relevant section). This information is then appended to the growing procinfo.toml file, including the analyzed dependencies, the specified executor, etc.
Finally, various other transformations on the input tokens take place, and they are re-emitted to construct a maestro::Process object and execute it on the specified executor:
let executor_tokens = quote! { maestro::submit_request! { maestro::RequestedExecutor(#executor, file!(), line!(), column!()) }; maestro::config::MAESTRO_CONFIG.executors[#executor].exe(process)};quote! {{ let process = maestro::Process::new( #name.to_string(), #container, vec![#(#input_pairs),*], vec![#(#arg_pairs),*], vec![#(#output_pairs),*], ::std::borrow::Cow::Borrowed(#process_lit), ); #executor_tokens}}
where quote! is a proc-macro for generation of source code tokens from raw syntax tree elements (for instance a Vec<LitStr>). This newly output source code is what the compiler sees, and what is executed at runtime.
The magic of link sections
Readers experienced in the arcane magiks of writing procedural macros (although I am slightly skeptical that such a reader exists, I would be pleasantly surprised to learn otherwise) may have curiously noticed the maestro::submit_request! macro call embedded within the token expansion above. Every process!, arg!, and inputs! site includes such a call, where a RequestedExecutor/RequestedArg/RequestedInputs struct is submitted. Within libmaestro itself, a thread-safe global collection is initialized to store requested executors, args, and inputs, and each submit_request! site expands to define a function that submits the given struct into the global collection. These function calls are then annotated with a link-sectionattribute, similar to the following:
which ensures the submission function runs beforemain. As such, right at the start of main, all arguments and executors which are expected to be defined in Maestro.toml have already been submitted into the relevant global collections.
As seen above, the #[maestro::main] annotation injects a maestro::initialize() call at the start of main. This function iterates over the “expected arg/executor” collections and short-circuits when it detects an item missing in Maestro.toml. By leveraging this maestro::submit_request! API, libmaestro ensures that no matter where in the program a configuration-dependent value is requested, it is always checked at startup.
maestro build
maestro-cli does not simply offload building to cargo build. Internally, maestro caches a built copy of libmaestro, specific to each version of libmaestro and rustc:
λ ls -l .maestro_cachetotal 0drwxr-xr-x@ 4 seb-hyland staff 128 Sep 30 19:25 maestro-0.2.10_rustc-1.90.0drwxr-xr-x@ 4 seb-hyland staff 128 Oct 4 18:11 maestro-0.2.10_rustc-1.92.0-nightlydrwxr-xr-x@ 41 seb-hyland staff 1312 Oct 4 18:11 vendor
Source code for a specific version of libmaestro and its dependencies is obtained via cargo vendor; then, dependency paths are normalized and libmaestro is built into a static library binary (libmaestro.rlib). This allows libmaestro to only be rebuilt when the compiler version changes (which is extremely infrequent), making builds significantly faster.
A side-by-side comparison of building the same project via `maestro build` and `cargo build --release`. Both are fresh builds (no Cargo build cache) and have all optimizations enabled (release profile). Building with `maestro` only takes 1.31s, while building with `cargo build` takes 18.26s.
Implications and Proof of Concept
As a proof of maestro’s potential in building practical, real-world workflows, we rewrote our structural prediction, molecular dynamics, and bioinformatics workflows to leverage libmaestro. For each workflow, libmaestro’s self-documentation mechanisms (the Maestro.toml and procinfo.toml files) provide information on workflow arguments and processes.
Structural prediction and molecular dynamics
Our structural prediction workflow was originally implemented in Nextflow, and our molecular dynamics workflow as a simple shell script. Both were converted to run under maestro to improve their reliability, as well as make them available for other iGEM teams to use. The structural prediction workflow is available here, and the molecular dynamics workflow is available here.
Our molecular dynamics pipeline running on the ARC Sockeye HPC cluster. Three input files (1v9e_ph4, 1v9e_ph10, BtCAII_ph4) are being processed in parallel. The tleap and parmed steps execute directly, and the gromacs step is scheduled on Slurm. Video is sped up 8x.
Bioinformatics
We originally had a collection of shell scripts for interacting with EggNOG database files. We decided to combine these discrete one-off tools into a bundled workflow that can be re-used; it is available here.
Active Development
libmaestro is still under active development, with additional features being added consistently.
SSH execution
The primary domain of ongoing development revolves around supporting remote process execution over SSH. The Local and Slurm executors in Maestro.toml will support an ssh field, enabling users to specify a remote server to execute the process on. All SSH connections will be initialized at program startup, thus supporting interactive (password/2FA) login systems.
Binding to libssh2
maestro binds to libssh2 by leveraging Rust’s ssh2 crate. This enables maestro to construct and hold persistent SSH connections, and split the connection into channels on which commands or file transfers can be executed. Futhermore, ssh2 provides primitives for keyboard-interactive authentication; maestro leverages this mechanism, building a user prompting interface for session connection.
Abstraction
Currently, maestro uses std::process::Command to spawn processes and std::fs::copy/std::os::unix::fs::symlink for staging files. The SSH feature requires a rework of these mechanisms altogether; process execution and file transfers are hidden behind an enum implementation, allowing swapping between native mechanisms and SSH/SFTP. Additionally, file paths require a rework, so that paths to remote files can be supported alongside local files. This is roughly implemented as follows:
As such, processes executed on remote hosts return a Vec<WorkflowPath::Remote>, while local processes return a Vec<WorkflowPath::Local>. This allows chained processes which run on the same remote server to avoid transferring large files to the local client, though files can be still be manually moved by leveraging WorkflowPath’s abstractions: