The ruffus module is a lightweight way to add support for running computational pipelines.
Each stage or task in a computational pipeline is represented by a python function Each python function can be called in parallel to run multiple jobs.
Decorator Examples @follows
- Indicate task dependency
- mkdir prerequisite shorthand
@follows ( task1, 'task2' ))
@parallel
- Parameters for parallel jobs
@parallel ( parameter_list )
@parallel ( parameter_generating_function )
@files
- I/O parameters
- skips up-to-date jobs
@files( parameter_list )
@files( parameter_generating_function )
Simplified syntax for tasks with a single job:
@files ( input_file, output_file, other_params, ... )
@files_re
- I/O file names via regular expressions
- start from lists of file names or glob results
- skips up-to-date jobs
@files_re ( glob_str, matching_regex, output_pattern, ... )
@files_re ( file_names, matching_regex, input_pattern, output_pattern, ... )
@files_re ( glob_str, matching_regex, output_pattern, ... )
@files_re ( file_names, matching_regex, input_pattern, output_pattern, ... )
input_pattern/output_pattern are regex patterns used to create input/output file names from the starting list of either glob_str or file names
@check_if_uptodate
- Checks if task needs to be run
@check_if_uptodate ( is_task_up_to_date_function ) @posttask
- Calls function after task completes
- touch_file shorthand
@posttask ( signal_task_completion_function )
@posttask (@touch_file( 'task1.completed' ))
For a graphical flowchart in jpg, svg, dot, png, ps, gif formats:
pipeline_printout_graph ( open("flowchart.svg", "w"), "svg", list_of_target_tasks)This requires dot to be installed
For a text printout of all jobs
pipeline_printout(sys.stdout, list_of_target_tasks)
pipeline_run(list_of_target_tasks, [list_of_tasks_forced_to_rerun, multiprocess = N_PARALLEL_JOBS])
See the Full Tutorial for a more complete introduction on how to add support for ruffus.