Tzar overview: background and architecture

Background

Tzar is designed to facilitate reproducible distributable scientific computing. It enables scientists to develop computational models using their own choice of programming language and existing tools, and then facilitates the execution of those models across distributed parallel computing resources in a way that makes it easy to track exactly what code / data were used for any given run of the model.

Tzar makes a few assumptions about the actual code to be executed. These are:

  • The model must be "embarassingly parallelisable". This means that there is no interdependency between the tasks. Tzar does not currently provide services for models which require separate tasks to communicate with each other. There is also currently no facilitation for data pipelines, that is sets of tasks whose output is intended to be the input of another set of tasks. This could be done using Tzar, but it doesn't provide any tools to make it easier.
  • Tzar can run arbitrary code, but in order to integrate with a particular language, a "[tzar runner]" must be built. This is a small piece of Java code which plugs into Tzar, and enables calling out into the required language / framework. Currently Tzar ships with Runners for: Jython, Python, Java, and R. Writing a Runner is a relatively straightforward task for someone with proficiency in Java.
  • The code to be executed needs to be in a subversion repository or on local disk. The Tzar framework can be easily extended to support different types of repository (eg GitHub, Mercurial). This again requires writing a fairly simple Java class to support the new type or repository.
  • If the model code to be executed has external dependencies (eg libraries, or external executables), these need to be either deployed to all nodes that are to run the model, or included in the source tree.

How is Tzar useful?

Tzar is useful for researchers who have modelling software that they are developing or using which needs to be:

  • Run in a reproducible way. Often when researchers write modelling software, there are a lot of moving parts. The source code for the model may be changing, as well as the underlying data. There may also be large numbers of runs being executed, and keeping track of what code and what data were used with which runs can be a daunting prospect. Doing this in such a way that other researchers can reproduce the results can be even more difficult. Tzar enables a given results to be made much more easily reproducible simply by referencing the name of the runset and the database with the results.
  • Run many different times with varying parameter values. Certain types of modelling require large numbers of code runs with parameters that vary between runs, potentially with some random parameters, or parameters which vary linearly (or through some other distribution) across runs. Tzar makes it easy to specify parameter sets, and different means for the parameters to vary across runs.
  • Easily scalable from a single machine up to large dedicated clusters: Tzar's architecture means that it's almost as simple to run Tzar on a cluster as it is to run it on a single machine, and no code changes are required to enable this to work. Users can develop and test their models on their local machine, then run the model across a handful of machines in the lab, then deploy it to large clusters of thousands of machines, such as offered by NeCTAR or Amazon Web Services.

See Getting Started for a description of how to get started using Tzar.

Architecture