ALPS Project: scheduler library

Running a simulation

Here we discuss how to run a Monte Carlo simulation based on the scheduler library. The source code and input files are provided in the example directory.

Preparing the job file

The first step is to prepare a job file, following the schema at http://xml.comp-phys.org.

The XML job file

The job file in the example looks as follows (without processing instructions):
<JOB>
  <OUTPUT file="parm.xml"/>
  <TASK status="new">
    <INPUT file="parm.task1.in.xml"/>
    <OUTPUT file="parm.task1.xml"/>
  </TASK>
  <TASK status="new">
    <INPUT file="parm.task2.in.xml"/>
    <OUTPUT file="parm.task2.xml"/>
  </TASK>
  <TASK status="new">
    <INPUT file="parm.task3.in.xml"/>
    <OUTPUT file="parm.task3.xml"/>
  </TASK>
</JOB>
The optional OUTPUT element gives a new name to the output job file writtem after checkpoints. It then contains a list of tasks. In each TASK element, the INPUT element specifies an input file for the tasks. Optionally a new file name for the checkpoints can be given by OUTPUT elements.

The XML task files

Currently the only used task file format is the format for Monte Carlo simulations, given by the schema at http://xml.comp-phys.org
<SIMULATION>
  <PARAMETERS>
    <PARAMETER name="L">100</PARAMETER>
    <PARAMETER name="SWEEPS">10000</PARAMETER>
    <PARAMETER name="T">0.5</PARAMETER>
    <PARAMETER name="THERMALIZATION">100</PARAMETER>
    <PARAMETER name="WORK_FACTOR">SWEEPS * L</PARAMETER>
  </PARAMETERS>
</SIMULATION>
Before a simulation starts, this file just lists all simulation parameters. Afterwards results and checkpoint information will be added. See the schema documentation for more details.

Two parameters have a special meaning:

ParameterDefaultMeaning
SEED0The random number seed used in the next Monte Carlo run created. After using a seed in the creation of a Monte Carlo run, this value gets incremented by one.
WORK_FACTOR1A factor by which the work that needs to be done for a simulation is multiplied in load balancing.

Converting old-style parameter files to XML

Versions 1.0 - 1.5 of the old ALEA library, which forms the basis of the current library had simple plain text-based parameter files, consisting of a number of parameter assignments of the form:
MODEL="Ising"; SWEEPS=1000; THERMALIZATION=100; WORK_FACTOR=[L*SWEEPS]; { L=10; T=0.1; } { L=20; T=0.05; }
where each group of assignments inside a block of curly braces {...} indicated a set of parameters for a single simulation. Assignments outside of a block of curly braces are valid globally for all simulation after the point of definition. Strings are given in double quotes, as in "Ising" and expressions in square brackets, as in [L*SWEEPS]. To ensure backwards compatibility, and also because this is a format which is easier to enter than a set of XML files, we provide a tool convert2xml which can convert these parameter files into XML files. The syntax is:
convert2xml parameterfile [xmlfileprefix]
which converts a parameterfile into a set of XML files, starting with the prefix given as optional second argument. The default for the second argument is the name as the parameterfile.

Running the simulation on a serial machine

The simulation is started by first creating the job filse, and then giving the name of the XML job file as argument to the program. In our example, the program is called main and the sequence for running it is:
convert2xml parm job main job.in.xml
The results will be stored in a file job.out.xml, which refers to the files job.task1.out.xml, job.task2.out.xml and job.task3.out.xml for the results of the three simulations.

Command line options

The program takes a number of command line options, to control the behavior of the scheduler:

OptionDefaultDescription
-T or --time-limit timelimitinfinitygives the time (in seconds) which the program should run before writing a final checkpoint and exiting.
--Tc checkpointtime1800gives the time (in seconds) after which the program should write a checkpoint.
--Tmin checkingtime60gives the minimum time (in seconds) which the scheduler waits before checking (again) whether a simulation is finished.
--Tmax checkingtime900gives the maximum time (in seconds) which the scheduler waits before checking (again) whether a simulation is finished.

Running the simulation on a parallel machine

is as easy as running it on a single machine. We will give the example using MPI. After starting the MPI environment (using e.g. lamboot for
LAM MPI, you run the program in parallel using mpirun. In our example, e.g. to run it on four processes you do:
convert2xml parm job mpirun -np 4 main job.in.xml

Command line options

In addition to the command line options for the sequential program there are two more for the parallel program:

OptionDefaultDescription
--Nmin numprocs1gives the minimum number of processes to assign to a simulation.
--Nmax numprocsinfinitygives the maximum number of processes to assign to a simulation.

If there are more processors available than simulations, more than one Monte Carlo run will be started for each simulation.

Extracting full output from the checkpoints

At the moment, the simulation output files (in this example called job.task1.xml, job.task2.xml and job.task3.xml) only contain the collected measurements from all runs. Details about the individual Monte Carlo runs for each simulation can be obtained by converting the checkpoint files to XML, again using the convert2xml tool, e.g.:
convert2xml job.task1.run1
This will produce a file job.task1.run1.xml, containing information extracted from this Monte Carlo run.

Compacting the simulation output

When after running a simulation you are certain that you do not want to continue running it, all the simulation specific information about the configuration of the Monte Carlo simulation can be removed from the checkpoint files , ending with .run followed by a number. This is done using the compactrun tool:
compactrun job.task1.run1
Optionally, the compacted file can be given a new name:
compactrun job.task1.run1 compacted.task1.run1

Converting legacy checkpoints

The old ALEA library (version 1.0-1.5) and preliminary versions of this library used binary checkpoint files not only for the Monte Carlo runs, but also for simulations and for checkpoints of the scheduler. convert2xml can be used to read the checkpoint files of version ALEA version 1.5 and later and convert them to XML.

copyright (c) 1994-2010 by Matthias Troyer

Distributed under the Boost Software License, Version 1.0. (See http://www.boost.org/LICENSE_1_0.txt)