========= s3 ========= .. contents:: Overview ======== s3 is a connector to S3, Amazon's Simple Storage System REST API. Use it to upload, download, delete, copy, test files for existence in S3, or update their metadata. S3 files may have metadata in addition to their content. Metadata is a set of key/value pairs. Metadata may be set when the file is uploaded or it can be updated subsequently. S3 files are stored in S3 buckets. Buckets can be created, listed, configured, and deleted. The bucket configuration can be read and the bucket contents can be listed. In addition to the s3 Python module, this package contains a command line tool also named s3. The tool imports the module and offers a command line interface to some of the module's capability. Installation ============ From PyPi :: $ pip install s3 From source :: $ hg clone ssh://hg@bitbucket.org/prometheus/s3 $ pip install -e s3 The installation is successful if you can import s3 and run the command line tool. The following commands must produce no errors: :: $ python -c 'import s3' $ s3 --help API to remote storage ===================== S3 Buckets ---------- Buckets store files. Buckets may be created and deleted. They may be listed, configured, and loaded with files. The configuration can be read, and the files in the bucket can be listed. Bucket names must be unique across S3 so it is best to use a unique prefix on all bucket names. S3 forbids underscores in bucket names, and although it allows periods, these confound DNS and should be avoided. For example, at Prometheus Research we prefix all of our bucket names with: **com-prometheus-** All the bucket configuration options work the same way - the caller provides XML or JSON data and perhaps headers or params as well. s3 accepts a python object for the data argument instead of a string. The object will be converted to XML or JSON as required. Likewise, s3 returns a python dict instead of the XML or JSON string returned by S3. However, that string is readily available if need be, because the response returned by requests.request() is exposed to the caller. S3 Filenames ------------ An S3 file name consists of a bucket and a key. This pair of strings uniquely identifies the file within S3. The S3Name class is instantiated with a key and a bucket; the key is required and the bucket defaults to None. The Storage class methods take a **remote_name** argument which can be either a string which is the key, or an instance of the S3Name class. When no bucket is given (or the bucket is None) then the default_bucket established when the connection is instantiated is used. If no bucket is given (or the bucket is None) and there is no default bucket then a ValueError is raised. In other words, the S3Name class provides a means of using a bucket other than the default_bucket. S3 Directories -------------- Although S3 storage is flat: buckets contain keys, S3 lets you impose a directory tree structure on your bucket by using a delimiter in your keys. For example, if you name a key 'a/b/f', and use '/' as the delimiter, then S3 will consider that 'a' is a directory, 'b' is a sub-directory of 'a', and 'f' is a file in 'b'. Headers and Metadata -------------------- Additional http headers may be sent using the methods which write data. These methods accept an optional **headers** argument which is a python dict. The headers control various aspects of how the file may be handled. S3 supports a variety of headers. These are not discussed here. See Amazon's S3 documentation for more info on S3 headers. Those headers whose key begins with the special prefix: **x-amz-meta-** are considered to be metadata headers and are used to set the metadata attributes of the file. The methods which read files also return the metadata which consists of only those response headers which begin with **x-amz-meta-**. Python classes for S3 data -------------------------- To facilitate the transfer of data between S3 and applications various classes were defined which correspond to data returned by S3. All attributes of these classes are strings. * S3Bucket * creation_date * name * S3Key * e_tag * key * last_modified * owner * size * storage_class * S3Owner * display_name * id XML strings and Python objects ------------------------------ An XML string consists of a series of nested tags. An XML tag can be represented in python as an entry in a dict. An OrderedDict from the collections module should be used when the order of the keys is important. The opening tag (everything between the '<' and the '>') is the key and everything between the opening tag and the closing tag is the value of the key. Since every value must be enclosed in a tag, not every python object can represent XML in this way. In particular, lists may only contain dicts which have a single key. For example this XML:: 1 2 is equivalent to this object:: {'a xmlns="foo"': [{'b1': {'c1': 1}}, {'b2': {'c2': 2}}] } Storage Methods --------------- The arguments **remote_source**, **remote_destination**, and **remote_name** may be either a string, or an S3Name instance. **local_name** is a string and is the name of the file on the local system. This string is passed directly to open(). **bucket** is a string and is the name of the bucket. **headers** is a python dict used to encode additional request headers. **params** is either a python dict used to encode the request parameters, or a string containing all the text of the url query string after the '?'. **data** is a string or an object and is the body of the message. The object will be converted to an XML or JSON string as appropriate. All methods return on success or raise StorageError on failure. Upon return **storage.response** contains the raw response object which was returned by the requests module. So for example, storage.response.headers contains the response headers returned by S3. See http://docs.python-requests.org/en/latest/api/ for a description of the response object. See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketOps.html for a description of the available bucket operations and their arguments. **storage.bucket_create(bucket, headers={}, data=None)** Create a bucket named **bucket**. **headers** may be used to set either ACL or explicit access permissions. **data** may be used to override the default region. If data is None, data is set as follows:: data = { 'CreateBucketConfiguration' ' xmlns="http://s3.amazonaws.com/doc/2006-03-01/"': { 'LocationConstraint': self.connection.region}} **storage.bucket_delete(bucket)** Delete a bucket named **bucket**. **storage.bucket_delete_cors(bucket)** Delete cors configuration of bucket named **bucket**. **storage.bucket_delete_lifecycle(bucket)** Delete lifecycle configuration of bucket named **bucket**. **storage.bucket_delete_policy(bucket)** Delete policy of bucket named **bucket**. **storage.bucket_delete_tagging(bucket)** Delete tagging configuration of bucket named **bucket**. **storage.bucket_delete_website(bucket)** Delete website configuration of bucket named **bucket**. **exists = storage.bucket_exists(bucket)** Test if **bucket** exists in storage. exists - boolean. **storage.bucket_get(self, bucket, params={})** Gets the next block of keys from the bucket based on params. **d = storage.bucket_get_acl(bucket)** Returns bucket acl configuration as a dict. **d = storage.bucket_get_cors(bucket)** Returns bucket cors configuration as a dict. **d = storage.bucket_get_lifecycle(bucket)** Returns bucket lifecycle as a dict. **d = storage.bucket_get_location(bucket)** Returns bucket location configuration as a dict. **d = storage.bucket_get_logging(bucket)** Returns bucket logging configuration as a dict. **d = storage.bucket_get_notification(bucket)** Returns bucket notification configuration as a dict. **d = storage.bucket_get_policy(bucket)** Returns bucket policy as a dict. **d = storage.bucket_get_request_payment(bucket)** Returns bucket requestPayment configuration as a dict. **d = storage.bucket_get_tagging(bucket)** Returns bucket tagging configuration as a dict. **d = storage.bucket_get_versioning(bucket)** Returns bucket versioning configuration as a dict. **d = storage.bucket_get_versions(bucket, params={})** Returns bucket versions as a dict. **d = storage.bucket_get_website(bucket)** Returns bucket website configuration as a dict. **for bucket in storage.bucket_list():** Returns a Generator object which returns all the buckets for the authenticated user's account. Each bucket is returned as an S3Bucket instance. **for key in storage.bucket_list_keys(bucket, delimiter=None, prefix=None, params={}):** Returns a Generator object which returns all the keys in the bucket. Each key is returned as an S3Key instance. * bucket - the name of the bucket to list * delimiter - used to request common prefixes * prefix - used to filter the listing * params - additional parameters. When delimiter is used, the keys (i.e. file names) are returned first, followed by the common prefixes (i.e. directory names). Each key is returned as an S3Key instance. Each common prefix is returned as a string. As a convenience, the delimiter and prefix may be provided as either keyword arguments or as keys in params. If the arguments are provided, they are used to update params. In any case, params are passed to S3. See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html for a description of delimiter, prefix, and the other parameters. **bucket_set_acl(bucket, headers={}, data='')** Configure bucket acl using xml data, or request headers. **bucket_set_cors(bucket, data='')** Configure bucket cors with xml data. **bucket_set_lifecycle(bucket, data='')** Configure bucket lifecycle with xml data. **bucket_set_logging(bucket, data='')** Configure bucket logging with xml data. **bucket_set_notification(bucket, data='')** Configure bucket notification with xml data. **bucket_set_policy(bucket, data='')** Configure bucket policy using json data. **bucket_set_request_payment(bucket, data='')** Configure bucket requestPayment with xml data. **bucket_set_tagging(bucket, data='')** Configure bucket tagging with xml data. **bucket_set_versioning(bucket, headers={}, data='')** Configure bucket versioning using xml data and request headers. **bucket_set_website(bucket, data='')** Configure bucket website with xml data. **storage.copy(remote_source, remote_destination, headers={})** Copy **remote_source** to **remote_destination**. The destination metadata is copied from **headers** when it contains metadata; otherwise it is copied from the source metadata. **storage.delete(remote_name)** Delete **remote_name** from storage. **exists, metadata = storage.exists(remote_name)** Test if **remote_name** exists in storage, retrieve its metadata if it does. exists - boolean, metadata - dict. **metadata = storage.read(remote_name, local_name)** Download **remote_name** from storage, save it locally as **local_name** and retrieve its metadata. metadata - dict. **storage.update_metadata(remote_name, headers)** Update (replace) the metadata associated with **remote_name** with the metadata headers in **headers**. **storage.write(local_name, remote_name, headers={})** Upload **local_name** to storage as **remote_name**, and set its metadata if any metadata headers are in **headers**. StorageError ------------ There are two forms of exceptions. The first form is when a request to S3 completes but fails. For example a read request may fail because the user does not have read permission. In this case a StorageError is raised with: * msg - The name of the method that was called (e.g. 'read', 'exists', etc.) * exception - A detailed error message * response - The raw response object returned by requests. The second form is when any other exception happens. For example a disk or network error. In this case StorageError is raised with: * msg - A detailed error message. * exception - The exception object * response - None Usage ===== Configuration ------------- First configure your yaml file. - **access_key_id** and **secret_access_key** are generated by the S3 account manager. They are effectively the username and password for the account. - **default_bucket** is the name of the default bucket to use when referencing S3 files. bucket names must be unique (on earth) so by convention we use a prefix on all our bucket names: com-prometheus- (NOTE: amazon forbids underscores in bucket names, and although they allow periods, periods will confound DNS - so it is best not to use periods in bucket names. - **endpoint** and **region** are the Amazon server url to connect to and its associated region. See http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region for a list of the available endpoints and their associated regions. - **tls** True => use https://, False => use http://. Default is True. - **retry** contains values used to retry requests.request(). If a request fails with an error listed in `status_codes`, and the `limit` of tries has not been reached, then a retry message is logged, the program sleeps for `interval` seconds, and the request is sent again. Default is:: retry: limit: 5 interval: 2.5 status_codes: - 104 **limit** is the number of times to try to send the request. 0 means unlimited retries. **interval** is the number of seconds to wait between retries. **status_codes** is a list of request status codes (errors) to retry. Here is an example s3.yaml :: --- s3: access_key_id: "XXXXX" secret_access_key: "YYYYYYY" default_bucket: "ZZZZZZZ" endpoint: "s3-us-west-2.amazonaws.com" region: "us-west-2" Next configure your S3 bucket permissions. You can use s3 to create, configure, and manage your buckets (see the examples below) or you can use Amazon's web interface: - Log onto your Amazon account. - Create a bucket or click on an existing bucket. - Click on Properties. - Click on Permissions. - Click on Edit Bucket Policy. Here is a example policy with the required permissions: :: { "Version": "2008-10-17", "Id": "Policyxxxxxxxxxxxxx", "Statement": [ { "Sid": "Stmtxxxxxxxxxxxxx", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/XXXXXXX" }, "Action": [ "s3:AbortMultipartUpload", "s3:GetObjectAcl", "s3:GetObjectVersion", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:GetObject", "s3:PutObjectAcl", "s3:PutObjectVersionAcl", "s3:ListMultipartUploadParts", "s3:PutObject", "s3:GetObjectVersionAcl" ], "Resource": [ "arn:aws:s3:::com.prometheus.cgtest-1/*", "arn:aws:s3:::com.prometheus.cgtest-1" ] } ] } Examples -------- Once the yaml file is configured you can instantiate a S3Connection and you use that connection to instantiate a Storage instance. :: import s3 import yaml with open('s3.yaml', 'r') as fi: config = yaml.load(fi) connection = s3.S3Connection(**config['s3']) storage = s3.Storage(connection) Then you call methods on the Storage instance. The following code creates a bucket called "com-prometheus-my-bucket" and asserts the bucket exists. Then it deletes the bucket, and asserts the bucket does not exist. :: my_bucket_name = 'com-prometheus-my-bucket' storage.bucket_create(my_bucket_name) assert storage.bucket_exists(my_bucket_name) storage.bucket_delete(my_bucket_name) assert not storage.bucket_exists(my_bucket_name) The following code lists all the buckets and all the keys in each bucket. :: for bucket in storage.bucket_list(): print bucket.name, bucket.creation_date for key in storage.bucket_list_keys(bucket.name): print '\t', key.key, key.size, key.last_modified, key.owner.display_name The following code uses the default bucket and uploads a file named "example" from the local filesystem as "example-in-s3" in s3. It then checks that "example-in-s3" exists in storage, downloads the file as "example-from-s3", compares the original with the downloaded copy to ensure they are the same, deletes "example-in-s3", and finally checks that it is no longer in storage. :: import subprocess try: storage.write("example", "example-in-s3") exists, metadata = storage.exists("example-in-s3") assert exists metadata = storage.read("example-in-s3", "example-from-s3") assert 0 == subprocess.call(['diff', "example", "example-from-s3"]) storage.delete("example-in-s3") exists, metadata = storage.exists("example-in-s3") assert not exists except StorageError, e: print 'failed:', e The following code again uploads "example" as "example-in-s3". This time it uses the bucket "my-other-bucket" explicitly, and it sets some metadata and checks that the metadata is set correctly. Then it changes the metadata and checks that as well. :: headers = { 'x-amz-meta-state': 'unprocessed', } remote_name = s3.S3Name("example-in-s3", bucket="my-other-bucket") try: storage.write("example", remote_name, headers=headers) exists, metadata = storage.exists(remote_name) assert exists assert metadata == headers headers['x-amz-meta-state'] = 'processed' storage.update_metadata(remote_name, headers) metadata = storage.read(remote_name, "example-from-s3") assert metadata == headers except StorageError, e: print 'failed:', e The following code configures "com-prometheus-my-bucket" with a policy that restricts "myuser" to write-only. myuser can write files but cannot read them back, delete them, or even list them. :: storage.bucket_set_policy("com-prometheus-my-bucket", data={ "Version": "2008-10-17", "Id": "BucketUploadNoDelete", "Statement": [ { "Sid": "Stmt01", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012:user/myuser" }, "Action": [ "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts", "s3:PutObject", ], "Resource": [ "arn:aws:s3:::com-prometheus-my-bucket/*", "arn:aws:s3:::com-prometheus-my-bucket" ] } ] }) s3 Command Line Tool ==================== This package installs both the s3 Python module and the s3 command line tool. The command line tool provides a convenient way to upload and download files to and from S3 without writing python code. As of now the tool supports the put, get, delete, and list commands; but it does not support all the features of the module API. s3 expects to find ``s3.yaml`` in the current directory. If it is not there you must tell s3 where it is using the --config option. For example:: $ s3 --config /path/to/s3.yaml command [command arguments] You must provide a command. Some commands have required arguments and/or optional arguments - it depends upon the command. Use the --help option to see a list of supported commands and their arguments:: $ s3 --help usage: s3 [-h] [-c CONFIG] [-v] [-b BUCKET] {get,put,delete,list,create-bucket,delete-bucket,list-buckets} ... Commands operate on the default bucket unless the --bucket option is used. Create a bucket create-bucket [bucket_name] The default bucket_name is the default bucket. Delete a file from S3 delete delete_file Delete a bucket delete-bucket [bucket_name] The default bucket_name is the default bucket. Get a file from S3 get remote_src [local_dst] List all files or list a single file and its metadata. list [list_file] List all buckets or list a single bucket. list-buckets [bucket_name] If bucket_name is given but does not exist, this is printed:: '%s NOT FOUND' % bucket_name Put a file to S3 put local_src [remote_dst] arguments: bucket_name The name of the bucket to use. delete_file The remote file to delete. list_file If present, the file to list (with its metadata), otherwise list all files. local_dst The name of the local file to create (or overwrite). The default is the basename of the remote_src. local_src The name of the local file to put. remote_dst The name of the s3 file to create (or overwrite). The default is the basename of the local_src. remote_src The name of the file in S3 to get. positional arguments: {get,put,delete,list,create-bucket,delete-bucket,list-buckets} optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG CONFIG is the configuration file to use. Default is s3.yaml -v, --verbose Show results of commands. -b BUCKET, --bucket BUCKET Use BUCKET instead of the default bucket. See `s3 Command Line Tool`_ in the API Reference. .. _`s3 Command Line Tool`: reference.html#module-bin_s3