GC3Libs is a python package for controlling the life-cycle of a Grid or batch computational job.
GC3Libs provides services for submitting computational jobs to Grids and batch systems, controlling their execution, persisting job information, and retrieving the final output.
GC3Libs takes an application-oriented approach to batch computing. A generic Application class provides the basic operations for controlling remote computations, but different Application subclasses can expose adapted interfaces, focusing on the most relevant aspects of the application being represented.
Support for running a generic application with the GC3Libs. The following parameters are required to create an Application instance:
Files that will be copied to the remote execution node before execution starts.
There are two possible ways of specifying the inputs parameter:
It can be a Python dictionary: keys are local file paths or URLs, values are remote file names.
It can be a Python list: each item in the list should be a pair (source, remote_file_name): the source can be a local file or a URL; remote_file_name is the path (relative to the execution directory) where source will be downloaded. If remote_file_name is an absolute path, an InvalidArgument error is raised.
A single string file_name is allowed instead of the pair and results in the local file file_name being copied to file_name on the remote host.
Files and directories that will be copied from the remote execution node back to the local computer (or a network-accessible server) after execution has completed. Directories are copied recursively.
There are three possible ways of specifying the outputs parameter:
It can be a Python dictionary: keys are remote file or directory paths (relative to the execution directory), values are corresponding local names.
It can be a Python list: each item in the list should be a pair (remote_file_name, destination): the destination can be a local file or a URL; remote_file_name is the path (relative to the execution directory) that will be uploaded to destination. If remote_file_name is an absolute path, an InvalidArgument error is raised.
A single string file_name is allowed instead of the pair and results in the remote file file_name being copied to file_name on the local host.
The constant gc3libs.ANY_OUTPUT which instructs GC3Libs to copy every file in the remote execution directory back to the local output path (as specified by the output_dir attribute).
Note that no errors will be raised if an output file is not present. Override the terminated() method to raise errors for reacting on this kind of failures.
The following optional parameters may be additionally specified as keyword arguments and will be given special treatment by the Application class logic:
Any other keyword arguments will be set as instance attributes, but otherwise ignored by the Application constructor.
After successful construction, an Application object is guaranteed to have the following instance attributes:
a Run instance; its state attribute is initially set to NEW (Actually inherited from the Task)
A name for applications of this class.
This string is used as a prefix for configuration items related to this application in configured resources. For example, if the application_name is foo, then the application interface code in GC3Pie might search for foo_cmd, foo_extra_args, etc. See qsub_sge() for an actual example.
Get an LSF qsub command-line invocation for submitting an instance of this application. Return a pair (cmd_argv, app_argv), where cmd_argv is a list containing the argv-vector of the command to run to submit an instance of this application to the LSF batch system, and app_argv is the argv-vector to use when invoking the application.
In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.
The default implementation just prefixes any output from the cmdline method with an LSF bsub invocation of the form bsub -cwd . -L /bin/sh + resource limits.
Override this method in application-specific classes to provide appropriate invocation templates and/or add resource-specific submission options.
Return list of command-line arguments for invoking the application.
This is exactly the argv-vector of the application process: the application command name is included as first item (index 0) of the list, further items are command-line arguments.
Hence, to get a UNIX shell command-line, just concatenate the elements of the list, separating them with spaces.
Return a list of compatible resources.
Calls the corresponding method of the controller.
Invocation of Core.fetch_output() on this object failed; ex is the Exception that describes the error.
If this method returns an exception object, that is raised as a result of the Core.fetch_output(), otherwise the return value is ignored and Core.fetch_output returns None.
Default is to return ex unchanged; override in derived classes to change this behavior.
Similar to qsub_sge(), but for the PBS/TORQUE resource manager.
Get an SGE qsub command-line invocation for submitting an instance of this application.
Return a pair (cmd_argv, app_argv). Both cmd_argv and app_argv are argv-lists: the command name is included as first item (index 0) of the list, further items are command-line arguments; cmd_argv is the argv-list for the submission command (excluding the actual application command part); app_argv is the argv-list for invoking the application. By overriding this method, one can add futher resource-specific options at the end of the cmd_argv argv-list.
In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.
The default implementation just prefixes any output from the cmdline method with an SGE qsub invocation of the form qsub -cwd -S /bin/sh + resource limits. Note that there is no generic way of requesting a certain number of cores in SGE: it all depends on the installed parallel environment, and these are totally under control of the local sysadmin; therefore, any request for cores is ignored and a warning is logged.
Override this method in application-specific classes to provide appropriate invocation templates and/or add different submission options.
Sort the given resources in order of preference.
By default, less-loaded resources come first; see _cmp_resources.
Get a SLURM sbatch command-line invocation for submitting an instance of this application.
Return a pair (cmd_argv, app_argv). Both cmd_argv and app_argv are argv-lists: the command name is included as first item (index 0) of the list, further items are command-line arguments; cmd_argv is the argv-list for the submission command (excluding the actual application command part); app_argv is the argv-list for invoking the application. By overriding this method, one can add futher resource-specific options at the end of the cmd_argv argv-list.
In the construction of the command-line invocation, one should assume that all the input files (as named in Application.inputs) have been copied to the current working directory, and that output files should be created in this same directory.
Override this method in application-specific classes to provide appropriate invocation templates and/or add different submission options.
Invocation of Core.submit() on this object failed; exs is a list of Exception objects, one for each attempted submission.
If this method returns an exception object, that is raised as a result of the Core.submit(), otherwise the return value is ignored and Core.submit returns None.
Default is to always return the first exception in the list (on the assumption that it is the root of all exceptions or that at least it refers to the preferred resource). Override in derived classes to change this behavior.
Handle exceptions that occurred during a Core.update_job_state call.
If this method returns an exception object, that exception is processed in Core.update_job_state() instead of the original one. Any other return value is ignored and Core.update_job_state proceeds as if no exception had happened.
Argument ex is the exception that was raised by the backend during job state update.
Default is to return ex unchanged; override in derived classes to change this behavior.
Return a string containing an xRSL sequence, suitable for submitting an instance of this application through ARC’s ngsub command.
The default implementation produces XRSL content based on the construction parameters; you should override this method to produce XRSL tailored to your application.
Warning
WARNING: ARClib SWIG bindings cannot resolve the overloaded constructor if the xRSL string argument is a Python ‘unicode’ object; if you overload this method, force the result to be a ‘str’!
A namespace for all constants and default values used in the GC3Libs package.
only update ARC resources status every this seconds
consider a submitted job lost if it does not show up in the information system after this duration
Proxy validity threshold in seconds. If proxy is expiring before the thresold, it will be marked as to be renewed.
A specialized dict-like object that keeps information about the execution state of an Application instance.
A Run object is guaranteed to have the following attributes:
- log
- A gc3libs.utils.History instance, recording human-readable text messages on events in this job’s history.
- info
- A simplified interface for reading/writing messages to Run.log. Reading from the info attribute returns the last message appended to log. Writing into info appends a message to log.
- timestamp
- Dictionary, recording the most recent timestamp when a certain state was reached. Timestamps are given as UNIX epochs.
For properties state, signal and returncode, see the respective documentation.
Run objects support attribute lookup by both the [...] and the . syntax; see gc3libs.utils.Struct for examples.
Processor architectures, for use as values in the requested_architecture field of the Application class constructor.
The following values are currently defined:
The “exit code” part of a Run.returncode, see os.WEXITSTATUS. This is an 8-bit integer, whose meaning is entirely application-specific. (However, the value 255 is often used to mean that an error has occurred and the application could not end its execution normally.)
Return True if the Run state matches any of the given names.
In addition to the states from Run.State, the two additional names ok and failed are also accepted, with the following meaning:
A simplified interface for reading/writing entries into history.
Setting the info attribute appends a message to the log:
>>> j1 = Run()
>>> j1.info = 'a message'
>>> j1.info = 'a second message'
Getting the value of the info attribute returns the last message entered in the log:
>>> j1.info
u'a second message ...'
The returncode attribute of this job object encodes the Run termination status in a manner compatible with the POSIX termination status as implemented by os.WIFSIGNALED and os.WIFEXITED.
However, in contrast with POSIX usage, the exitcode and the signal part can both be significant: in case a Grid middleware error happened after the application has successfully completed its execution. In other words, os.WEXITSTATUS(returncode) is meaningful iff os.WTERMSIG(returncode) is 0 or one of the pseudo-signals listed in Run.Signals.
Run.exitcode and Run.signal are combined to form the return code 16-bit integer as follows (the convention appears to be obeyed on every known system):
Bit Encodes... 0..7 signal number 8 1 if program dumped core. 9..16 exit code
Note: the “core dump bit” is always 0 here.
Setting the returncode property sets exitcode and signal; you can either assign a (signal, exitcode) pair to returncode, or set returncode to an integer from which the correct exitcode and signal attribute values are extracted:
>>> j = Run()
>>> j.returncode = (42, 56)
>>> j.signal
42
>>> j.exitcode
56
>>> j.returncode = 137
>>> j.signal
9
>>> j.exitcode
0
See also Run.exitcode and Run.signal.
Convert a shell exit code to a POSIX process return code.
A POSIX shell represents the return code of the last-run program within its exit code as follows:
The “signal number” part of a Run.returncode, see os.WTERMSIG for details.
The “signal number” is a 7-bit integer value in the range 0..127; value 0 is used to mean that no signal has been received during the application runtime (i.e., the application terminated by calling exit()).
The value represents either a real UNIX system signal, or a “fake” one that GC3Libs uses to represent Grid middleware errors (see Run.Signals).
The state a Run is in.
The value of Run.state must always be a value from the Run.State enumeration, i.e., one of the following values.
Run.State value | purpose | can change to |
---|---|---|
NEW | Job has not yet been submitted/started (i.e., gsub not called) | SUBMITTED (by gsub) |
SUBMITTED | Job has been sent to execution resource | RUNNING, STOPPED |
STOPPED | Trap state: job needs manual intervention (either user- or sysadmin-level) to resume normal execution | TERMINATING(by gkill), SUBMITTED (by miracle) |
RUNNING | Job is executing on remote resource | TERMINATING |
TERMINATING | Job has finished execution on remote resource; output not yet retrieved | TERMINATED |
TERMINATED | Job execution is finished (correctly or not) and will not be resumed; output has been retrieved | None: final state |
When a Run object is first created, it is assigned the state NEW. After a successful invocation of Core.submit(), it is transitioned to state SUBMITTED. Further transitions to RUNNING or STOPPED or TERMINATED state, happen completely independently of the creator progra; the Core.update_job_state() call provides updates on the status of a job.
The STOPPED state is a kind of generic “run time error” state: a job can get into the STOPPED state if its execution is stopped (e.g., a SIGSTOP is sent to the remote process) or delayed indefinitely (e.g., the remote batch system puts the job “on hold”). There is no way a job can get out of the STOPPED state automatically: all transitions from the STOPPED state require manual intervention, either by the submitting user (e.g., cancel the job), or by the remote systems administrator (e.g., by releasing the hold).
The TERMINATED state is the final state of a job: once a job reaches it, it cannot get back to any other state. Jobs reach TERMINATED state regardless of their exit code, or even if a system failure occurred during remote execution; actually, jobs can reach the TERMINATED status even if they didn’t run at all, for example, in case of a fatal failure during the submission step.
Mix-in class implementing a facade for job control.
A Task can be described as an “active” job, in the sense that all job control is done through methods on the Task instance itself; contrast this with operating on Application objects through a Core or Engine instance.
The following pseudo-code is an example of the usage of the Task interface for controlling a job. Assume that GamessApplication is inheriting from Task (it actually is):
t = GamessApplication(input_file)
t.submit()
# ... do other stuff
t.update_state()
# ... take decisions based on t.execution.state
t.wait() # blocks until task is terminated
Each Task object has an execution attribute: it is an instance of class Run, initialized with a new instance of Run, and at any given time it reflects the current status of the associated remote job. In particular, execution.state can be checked for the current task status.
After successful initialization, a Task instance will have the following attributes:
Use the given Grid interface for operations on the job associated with this task.
Remove any reference to the current grid interface. After this, calling any method other than attach() results in an exception TaskDetachedFromGridError being thrown.
Retrieve the outputs of the computational job associated with this task into directory output_dir, or, if that is None, into the directory whose path is stored in instance attribute .output_dir.
If the execution state is TERMINATING, transition the state to TERMINATED (which runs the appropriate hook).
See gc3libs.Core.fetch_output() for a full explanation.
Returns: | Path to the directory where the job output has been collected. |
---|
Release any remote resources associated with this task.
See gc3libs.Core.free() for a full explanation.
Terminate the computational job associated with this task.
See gc3libs.Core.kill() for a full explanation.
Called when the job state is (re)set to NEW.
Note this will not be called when the application object is created, rather if the state is reset to NEW after it has already been submitted.
The default implementation does nothing, override in derived classes to implement additional behavior.
Download size bytes (at offset offset from the start) from the associated job standard output or error stream, and write them into a local file. Return a file-like object from which the downloaded contents can be read.
See gc3libs.Core.peek() for a full explanation.
Advance the associated job through all states of a regular lifecycle. In detail:
- If execution.state is NEW, the associated job is started.
- The state is updated until it reaches TERMINATED
- Output is collected and the final returncode is returned.
An exception TaskError is raised if the job hits state STOPPED or UNKNOWN during an update in phase 2.
When the job reaches TERMINATING state, the output is retrieved; if this operation is successfull, state is advanced to TERMINATED.
oNCE the job reaches TERMINATED state, the return code (stored also in .returncode) is returned; if the job is not yet in TERMINATED state, calling progress returns None.
Raises : | exception UnexpectedStateError if the associated job goes into state STOPPED or UNKNOWN |
---|---|
Returns: | final returncode, or None if the execution state is not TERMINATED. |
Called when the job state transitions to RUNNING, i.e., the job has been successfully started on a (possibly) remote resource.
The default implementation does nothing, override in derived classes to implement additional behavior.
Called when the job state transitions to STOPPED, i.e., the job has been remotely suspended for an unknown reason and cannot automatically resume execution.
The default implementation does nothing, override in derived classes to implement additional behavior.
Start the computational job associated with this Task instance.
Called when the job state transitions to SUBMITTED, i.e., the job has been successfully sent to a (possibly) remote execution resource and is now waiting to be scheduled.
The default implementation does nothing, override in derived classes to implement additional behavior.
Called when the job state transitions to TERMINATED, i.e., the job has finished execution (with whatever exit status, see returncode) and the final output has been retrieved.
The location where the final output has been stored is available in attribute self.output_dir.
The default implementation does nothing, override in derived classes to implement additional behavior.
Called when the job state transitions to TERMINATING, i.e., the remote job has finished execution (with whatever exit status, see returncode) but output has not yet been retrieved.
The default implementation does nothing, override in derived classes to implement additional behavior.
Called when the job state transitions to UNKNOWN, i.e., the job has not been updated for a certain period of time thus it is placed in UNKNOWN state.
Two possible ways of changing from this state: 1) next update cycle, job status is updated from the remote server 2) derive this method for Application specific logic to deal with this case
The default implementation does nothing, override in derived classes to implement additional behavior.
In-place update of the execution state of the computational job associated with this Task. After successful completion, .execution.state will contain the new state.
After the job has reached the TERMINATING state, the following attributes are also set:
The execution backend may set additional attributes; the exact name and format of these additional attributes is backend-specific. However, you can easily identify the backend-specific attributes because their name is prefixed with the (lowercased) backend name; for instance, the PbsLrms backend sets attributes pbs_queue, pbs_end_time, etc.
Block until the associated job has reached TERMINATED state, then return the job’s return code. Note that this does not automatically fetch the output.
Parameters: | interval (integer) – Poll job state every this number of seconds |
---|
Configure the gc3.gc3libs logger.
Arguments level, format and datefmt set the corresponding arguments in the logging.basicConfig() call.
If a user configuration file exists in file NAME.log.conf in the Default.RCDIR directory (usually ~/.gc3), it is read and used for more advanced configuration; if it does not exist, then a sample one is created.
Create and return a gc3libs.core.Engine class.
It accepts an optional list of configuration filenames. If the filenames contain a ~ or a variable name, it will be expanded automatically.
Called without arguments, a configuration file will be searched in ~/.gc3/gc3pie.conf (and used if found).
Parameters: |
|
---|
Any extra keyword argument is passed unchanged to the Configuration constructor.