Introduction to

SEML: Slurm Experiment Management Library

Why SEML?

In a nutshell, SEML enables you to leverage the massive parallelization of a compute cluster without boilerplate code or having to worry about keeping track of experiments. That is, it enables you to:

  • very easily define hyperparameter search spaces using YAML files,
  • run these hyperparameter configurations on a compute cluster using Slurm,
  • and to track the experimental results using sacred and MongoDB.

In addition, SEML offers many more features to make your life easier, such as

  • automatically saving and loading your source code for reproducibility,
  • collecting experiment results into a Pandas dataframe,
  • easy debugging on Slurm or locally,
  • automatically checking your experiment configurations,
  • extending Slurm with local workers,
  • and keeping track of resource usage (experiment runtime, RAM, etc.).

You can even get notified on Mattermost whenever an experiment starts, completes, or fails!

How does it work?

  • SEML takes a YAML file containing hyperparameters and metadata about a set of experiments.
  • SEML stores each individual experiment's data as an entry in a MongoDB database collection.
  • In general, each type of experiments gets their own database collection.
  • Each individual experiment is an entry in the respective collection.
  • A database entry is essentially a JSON dictionary containing (among others):
    • the state of the experiment,
    • the experiment configuration (i.e., hyperparameters),
    • the generated results, and
    • the cached source code (by default).
  • SEML takes a YAML file containing hyperparameters and metadata about a set of experiments.
  • SEML stores each individual experiment's data as an entry in a MongoDB database collection.
  • In general, each type of experiments gets their own database collection.
  • Each individual experiment is an entry in the respective collection.
  • A database entry is essentially a JSON dictionary containing (among others):
    • the state of the experiment,
    • the experiment configuration (i.e., hyperparameters),
    • the generated results, and
    • the cached source code (by default).
In [ ]: