=========
s3
=========

.. contents::

Overview
========

s3 is a connector to S3, Amazon's Simple Storage System REST API.

Use it to upload, download, delete, copy, test files for existence in S3, or 
update their metadata.

S3 files may have metadata in addition to their content.  
Metadata is a set of key/value pairs.  
Metadata may be set when the file is uploaded 
or it can be updated subsequently.

S3 files are stored in S3 buckets.  Buckets can be created, listed, 
configured, and deleted.  The bucket configuration can be read and the 
bucket contents can be listed.

In addition to the s3 Python module, 
this package contains a command line tool also named s3.  
The tool imports the module and offers a command line
interface to some of the module's capability.

Installation
============

From PyPi
::

    $ pip install s3 

From source
::

    $ hg clone ssh://hg@bitbucket.org/prometheus/s3
    $ pip install -e s3 

The installation is successful if you can import s3 
and run the command line tool.  The following commands 
must produce no errors:
::

    $ python -c 'import s3'
    $ s3 --help

API to remote storage
=====================

S3 Buckets
----------

Buckets store files.  Buckets may be created and deleted.  They may be
listed, configured, and loaded with files.  The configuration can be read,
and the files in the bucket can be listed.

Bucket names must be unique across S3 so it is best to use a unique prefix on
all bucket names.  S3 forbids underscores in bucket names, and although
it allows periods, these confound DNS and should be avoided.
For example, at Prometheus Research 
we prefix all of our bucket names with: **com-prometheus-**

All the bucket configuration options work the same way - the caller
provides XML or JSON data and perhaps headers or params as well.

s3 accepts a python object for the data argument instead of a string.
The object will be converted to XML or JSON as required.

Likewise, s3 returns a python dict instead of the XML or JSON string
returned by S3.  However, that string is readily available if need be,
because the response returned by requests.request() is exposed to the
caller.

S3 Filenames
------------

An S3 file name consists of a bucket and a key.  This pair of
strings uniquely identifies the file within S3.

The S3Name class is instantiated with a key and a bucket; the key
is required and the bucket defaults to None.

The Storage class methods take a **remote_name** argument which
can be either a string which is the key, or an instance of the
S3Name class.  When no bucket is given (or the bucket is None) then
the default_bucket established when the connection is instantiated
is used.  If no bucket is given (or the bucket is None) and there
is no default bucket then a ValueError is raised.

In other words, the S3Name class provides a means of using a bucket
other than the default_bucket.

S3 Directories
--------------

Although S3 storage is flat: buckets contain keys, S3 lets you impose
a directory tree structure on your bucket by using a delimiter in your
keys.

For example, if you name a key 'a/b/f', and use '/' as the delimiter,
then S3 will consider that 'a' is a directory, 'b' is a sub-directory
of 'a', and 'f' is a file in 'b'.


Headers and Metadata
--------------------

Additional http headers may be sent using the methods which write
data.  These methods accept an optional **headers** argument which
is a python dict.  The headers control various aspects of how the
file may be handled.  S3 supports a variety of headers.  These are
not discussed here.  See Amazon's S3 documentation for more info
on S3 headers.

Those headers whose key begins with the special prefix:
**x-amz-meta-** are considered to be metadata headers and are
used to set the metadata attributes of the file.

The methods which read files also return the metadata which
consists of only those response headers which begin with
**x-amz-meta-**.

Python classes for S3 data
--------------------------

To facilitate the transfer of data between S3 and applications various
classes were defined which correspond to data returned by S3.

All attributes of these classes are strings.

* S3Bucket
    * creation_date
    * name

* S3Key
    * e_tag
    * key
    * last_modified
    * owner
    * size
    * storage_class

* S3Owner
    * display_name
    * id

XML strings and Python objects
------------------------------

An XML string consists of a series of nested tags.  An XML tag can be
represented in python as an entry in a dict.  An OrderedDict from the
collections module should be used when the order of the keys is
important.

The opening tag (everything between the '<' and the '>') is the key and
everything between the opening tag and the closing tag is the value of
the key.

Since every value must be enclosed in a tag, not every python object can
represent XML in this way.  In particular, lists may only contain dicts
which have a single key.

For example this XML::

    <a xmlns="foo">
        <b1>
            <c1> 1 </c1>
        </b1>
        <b2>
            <c2> 2 </c2>
        </b2>
    </a>

is equivalent to this object::

    {'a xmlns="foo"': [{'b1': {'c1': 1}}, {'b2': {'c2': 2}}] }

Storage Methods
---------------

The arguments **remote_source**, **remote_destination**, and
**remote_name** may be either a string, or an S3Name instance.

**local_name** is a string and is the name of the file on the
local system.  This string is passed directly to open().

**bucket** is a string and is the name of the bucket.

**headers** is a python dict used to encode additional request headers.

**params** is either a python dict used to encode the request
parameters, or a string containing all the text of the url query string
after the '?'.

**data** is a string or an object and is the body of the message.  The
object will be converted to an XML or JSON string as appropriate.

All methods return on success or raise StorageError on failure.

Upon return **storage.response** contains the raw response object which
was returned by the requests module.  So for example,
storage.response.headers contains the response headers returned by S3.
See
http://docs.python-requests.org/en/latest/api/ for a description of the
response object.

See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketOps.html
for a description of the available bucket operations and their arguments.

**storage.bucket_create(bucket, headers={}, data=None)**
    Create a bucket named **bucket**.  **headers** may be used to set
    either ACL or explicit access permissions.  **data** may be used to
    override the default region.  If data is None, data is set as
    follows::

        data = {
                'CreateBucketConfiguration'
                ' xmlns="http://s3.amazonaws.com/doc/2006-03-01/"': {
                        'LocationConstraint': self.connection.region}}

**storage.bucket_delete(bucket)**
    Delete a bucket named **bucket**.

**storage.bucket_delete_cors(bucket)**
    Delete cors configuration of bucket named **bucket**.

**storage.bucket_delete_lifecycle(bucket)**
    Delete lifecycle configuration of bucket named **bucket**.

**storage.bucket_delete_policy(bucket)**
    Delete policy of bucket named **bucket**.

**storage.bucket_delete_tagging(bucket)**
    Delete tagging configuration of bucket named **bucket**.

**storage.bucket_delete_website(bucket)**
    Delete website configuration of bucket named **bucket**.

**exists = storage.bucket_exists(bucket)**
    Test if **bucket** exists in storage.

    exists - boolean.

**storage.bucket_get(self, bucket, params={})**
    Gets the next block of keys from the bucket based on params.

**d = storage.bucket_get_acl(bucket)**
    Returns bucket acl configuration as a dict.

**d = storage.bucket_get_cors(bucket)**
    Returns bucket cors configuration as a dict.

**d = storage.bucket_get_lifecycle(bucket)**
    Returns bucket lifecycle as a dict.

**d = storage.bucket_get_location(bucket)**
    Returns bucket location configuration as a dict.

**d = storage.bucket_get_logging(bucket)**
    Returns bucket logging configuration as a dict.

**d = storage.bucket_get_notification(bucket)**
    Returns bucket notification configuration as a dict.

**d = storage.bucket_get_policy(bucket)**
    Returns bucket policy as a dict.

**d = storage.bucket_get_request_payment(bucket)**
    Returns bucket requestPayment configuration as a dict.

**d = storage.bucket_get_tagging(bucket)**
    Returns bucket tagging configuration as a dict.

**d = storage.bucket_get_versioning(bucket)**
    Returns bucket versioning configuration as a dict.

**d = storage.bucket_get_versions(bucket, params={})**
    Returns bucket versions as a dict.

**d = storage.bucket_get_website(bucket)**
    Returns bucket website configuration as a dict.

**for bucket in storage.bucket_list():**
    Returns a Generator object which returns all the buckets for the
    authenticated user's account.  

    Each bucket is returned as an S3Bucket instance.

**for key in storage.bucket_list_keys(bucket, delimiter=None, prefix=None, params={}):**
    Returns a Generator object which returns all the keys in the bucket.
    
    Each key is returned as an S3Key instance.

    * bucket - the name of the bucket to list
    * delimiter - used to request common prefixes
    * prefix - used to filter the listing
    * params - additional parameters.

    When delimiter is used, the keys (i.e. file names) are returned
    first, followed by the common prefixes (i.e. directory names).
    Each key is returned as an S3Key instance.  Each common prefix
    is returned as a string.

    As a convenience, the delimiter and prefix may be
    provided as either keyword arguments or as keys in params.  If the
    arguments are provided, they are used to update params.  In any case,
    params are passed to S3.

    See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGET.html
    for a description of delimiter, prefix, and the other parameters.

**bucket_set_acl(bucket, headers={}, data='')**
    Configure bucket acl using xml data, or request headers.

**bucket_set_cors(bucket, data='')**
    Configure bucket cors with xml data.

**bucket_set_lifecycle(bucket, data='')**
    Configure bucket lifecycle with xml data.

**bucket_set_logging(bucket, data='')**
    Configure bucket logging with xml data.

**bucket_set_notification(bucket, data='')**
    Configure bucket notification with xml data.

**bucket_set_policy(bucket, data='')**
    Configure bucket policy using json data.

**bucket_set_request_payment(bucket, data='')**
    Configure bucket requestPayment with xml data.

**bucket_set_tagging(bucket, data='')**
    Configure bucket tagging with xml data.

**bucket_set_versioning(bucket, headers={}, data='')**
    Configure bucket versioning using xml data and request headers.

**bucket_set_website(bucket, data='')**
    Configure bucket website with xml data.

**storage.copy(remote_source, remote_destination, headers={})**
    Copy **remote_source** to **remote_destination**.

    The destination metadata is copied from **headers** when it
    contains metadata; otherwise it is copied from the source
    metadata.

**storage.delete(remote_name)**
    Delete **remote_name** from storage.

**exists, metadata = storage.exists(remote_name)**
    Test if **remote_name** exists in storage, retrieve its
    metadata if it does.

    exists - boolean, metadata - dict.

**metadata = storage.read(remote_name, local_name)**
    Download **remote_name** from storage, save it locally as
    **local_name** and retrieve its metadata.

    metadata - dict.

**storage.update_metadata(remote_name, headers)**
    Update (replace) the metadata associated with **remote_name**
    with the metadata headers in **headers**.

**storage.write(local_name, remote_name, headers={})**
    Upload **local_name** to storage as **remote_name**, and set
    its metadata if any metadata headers are in **headers**.

StorageError
------------

There are two forms of exceptions.  

The first form is when a request to S3 completes but fails.  For example a 
read request may fail because the user does not have read permission.  
In this case a StorageError is raised with:

* msg - The name of the method that was called (e.g. 'read', 'exists', etc.)
  
* exception - A detailed error message

* response - The raw response object returned by requests.

The second form is when any other exception happens.  For example a disk or 
network error.  In this case StorageError is raised with:

* msg - A detailed error message.

* exception - The exception object

* response - None

Usage
=====

Configuration
-------------

First configure your yaml file.

- **access_key_id** and **secret_access_key** are generated by the S3 
  account manager.  They are effectively the username and password for the 
  account.

- **default_bucket** is the name of the default bucket to use when referencing
  S3 files.  bucket names must be unique (on earth) so by convention we use a
  prefix on all our bucket names: com-prometheus-  (NOTE: amazon forbids
  underscores in bucket names, and although they allow periods, periods will 
  confound DNS - so it is best not to use periods in bucket names.
  
- **endpoint** and **region** are the Amazon server url to connect to and
  its associated region.  See 
  http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region for a list
  of the available endpoints and their associated regions.

- **tls** True => use https://, False => use http://.  Default is True.

- **retry** contains values used to retry requests.request().
  If a request fails with an error listed in `status_codes`,
  and the `limit` of tries has not been reached, 
  then a retry message is logged,
  the program sleeps for `interval` seconds, 
  and the request is sent again.   
  Default is::

    retry:
        limit: 5
        interval: 2.5
        status_codes: 
          - 104

  **limit** is the number of times to try to send the request.
  0 means unlimited retries.

  **interval** is the number of seconds to wait between retries.

  **status_codes** is a list of request status codes (errors) to retry.
  
Here is an example s3.yaml
::

    ---
    s3: 
        access_key_id: "XXXXX"
        secret_access_key: "YYYYYYY"
        default_bucket: "ZZZZZZZ"
        endpoint: "s3-us-west-2.amazonaws.com"
        region: "us-west-2"

Next configure your S3 bucket permissions.  You can use s3 to create, 
configure, and manage your buckets (see the examples below) or you can 
use Amazon's web interface:

- Log onto your Amazon account.
- Create a bucket or click on an existing bucket.
- Click on Properties.
- Click on Permissions.
- Click on Edit Bucket Policy.

Here is a example policy with the required permissions:
::

    {
	    "Version": "2008-10-17",
	    "Id": "Policyxxxxxxxxxxxxx",
	    "Statement": [
		    {
			    "Sid": "Stmtxxxxxxxxxxxxx",
			    "Effect": "Allow",
			    "Principal": {
				    "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/XXXXXXX"
			    },
			    "Action": [
				    "s3:AbortMultipartUpload",
				    "s3:GetObjectAcl",
				    "s3:GetObjectVersion",
				    "s3:DeleteObject",
				    "s3:DeleteObjectVersion",
				    "s3:GetObject",
				    "s3:PutObjectAcl",
				    "s3:PutObjectVersionAcl",
				    "s3:ListMultipartUploadParts",
				    "s3:PutObject",
				    "s3:GetObjectVersionAcl"
			    ],
			    "Resource": [
				    "arn:aws:s3:::com.prometheus.cgtest-1/*",
				    "arn:aws:s3:::com.prometheus.cgtest-1"
			    ]
		    }
	    ]
    }

Examples
--------

Once the yaml file is configured you can instantiate a S3Connection and 
you use that connection to instantiate a Storage instance.
::

    import s3
    import yaml
    
    with open('s3.yaml', 'r') as fi:
        config = yaml.load(fi)

    connection = s3.S3Connection(**config['s3'])    
    storage = s3.Storage(connection)

Then you call methods on the Storage instance.  

The following code creates a bucket called "com-prometheus-my-bucket" and  
asserts the bucket exists.  Then it deletes the bucket, and asserts the 
bucket does not exist.
::

    my_bucket_name = 'com-prometheus-my-bucket'
    storage.bucket_create(my_bucket_name)
    assert storage.bucket_exists(my_bucket_name)
    storage.bucket_delete(my_bucket_name)
    assert not storage.bucket_exists(my_bucket_name)

The following code lists all the buckets and all the keys in each bucket.
::

    for bucket in storage.bucket_list():
        print bucket.name, bucket.creation_date
        for key in storage.bucket_list_keys(bucket.name):
            print '\t', key.key, key.size, key.last_modified, key.owner.display_name
            
The following code uses the default bucket and uploads a file named "example" 
from the local filesystem as "example-in-s3" in s3.  It then checks that 
"example-in-s3" exists in storage, downloads the file as "example-from-s3", 
compares the original with the downloaded copy to ensure they are the same, 
deletes "example-in-s3", and finally checks that it is no longer in storage.
::

    import subprocess
    try:
        storage.write("example", "example-in-s3")
        exists, metadata = storage.exists("example-in-s3")
        assert exists
        metadata = storage.read("example-in-s3", "example-from-s3")
        assert 0 == subprocess.call(['diff', "example", "example-from-s3"])
        storage.delete("example-in-s3")
        exists, metadata = storage.exists("example-in-s3")
        assert not exists
    except StorageError, e:
        print 'failed:', e
        
The following code again uploads "example" as "example-in-s3".  This time it 
uses the bucket "my-other-bucket" explicitly, and it sets some metadata and 
checks that the metadata is set correctly.  Then it changes the metadata 
and checks that as well.
::

    headers = {
        'x-amz-meta-state': 'unprocessed',
        }
    remote_name = s3.S3Name("example-in-s3", bucket="my-other-bucket")
    try:
        storage.write("example", remote_name, headers=headers)
        exists, metadata = storage.exists(remote_name)
        assert exists
        assert metadata == headers
        headers['x-amz-meta-state'] = 'processed'
        storage.update_metadata(remote_name, headers)
        metadata = storage.read(remote_name, "example-from-s3")
        assert metadata == headers
    except StorageError, e:
        print 'failed:', e

The following code configures "com-prometheus-my-bucket" with a policy 
that restricts "myuser" to write-only.  myuser can write files but 
cannot read them back, delete them, or even list them.
::

    storage.bucket_set_policy("com-prometheus-my-bucket", data={
            "Version": "2008-10-17",
            "Id": "BucketUploadNoDelete",
            "Statement": [
                    {
                    "Sid": "Stmt01",
                    "Effect": "Allow",
                    "Principal": {
                            "AWS": "arn:aws:iam::123456789012:user/myuser"
                            },
                    "Action": [
                            "s3:AbortMultipartUpload",
                            "s3:ListMultipartUploadParts",
                            "s3:PutObject",
                            ],
                    "Resource": [
                            "arn:aws:s3:::com-prometheus-my-bucket/*",
                            "arn:aws:s3:::com-prometheus-my-bucket"
                            ]
                    }
                    ]
            })


s3 Command Line Tool
====================

This package installs both the s3 Python module 
and the s3 command line tool.

The command line tool provides a convenient way to upload and download 
files to and from S3 without writing python code.

As of now the tool supports the put, get, delete, and list commands; 
but it does not support all the features of the module API.

s3 expects to find ``s3.yaml`` in the current directory.
If it is not there you must tell s3 where it is using the --config option.
For example::

    $ s3 --config /path/to/s3.yaml command [command arguments]

You must provide a command.  Some commands have required arguments 
and/or optional arguments - it depends upon the command.

Use the --help option to see 
a list of supported commands and their arguments::

    $ s3 --help
    usage: s3 [-h] [-c CONFIG] [-v] [-b BUCKET]
              {get,put,delete,list,create-bucket,delete-bucket,list-buckets} ...

    Commands operate on the default bucket unless the --bucket option is used.

    Create a bucket
      create-bucket [bucket_name]
      The default bucket_name is the default bucket.
       
    Delete a file from S3
      delete delete_file

    Delete a bucket
      delete-bucket [bucket_name]
      The default bucket_name is the default bucket.

    Get a file from S3
      get remote_src [local_dst]

    List all files or list a single file and its metadata.
      list [list_file]

    List all buckets or list a single bucket.  
      list-buckets [bucket_name]
      If bucket_name is given but does not exist, this is printed::
       
          '%s NOT FOUND' % bucket_name

    Put a file to S3
      put local_src [remote_dst]

    arguments:
      bucket_name
        The name of the bucket to use.  
      delete_file
        The remote file to delete.
      list_file
        If present, the file to list (with its metadata),
        otherwise list all files.
      local_dst
        The name of the local file to create (or overwrite).
        The default is the basename of the remote_src.
      local_src
        The name of the local file to put.
      remote_dst
        The name of the s3 file to create (or overwrite).
        The default is the basename of the local_src.
      remote_src
        The name of the file in S3 to get.

    positional arguments:
      {get,put,delete,list,create-bucket,delete-bucket,list-buckets}

    optional arguments:
      -h, --help            show this help message and exit
      -c CONFIG, --config CONFIG
                            CONFIG is the configuration file to use.
                            Default is s3.yaml
      -v, --verbose         Show results of commands.
      -b BUCKET, --bucket BUCKET
                            Use BUCKET instead of the default bucket.

See `s3 Command Line Tool`_  in the API Reference. 

.. _`s3 Command Line Tool`: reference.html#module-bin_s3