InstallationΒΆ
Install the dataduct package using pip
pip install dataduct
Dependencies
dataduct currently has the following dependencies: - boto >= 2.32.0 - yaml
We have tried some older versions of boto with the problem being support some functionality around EMR that will be used in the later versions of dataduct.
Setup Configuration
Setup the configuration file to set the credentials and defaul values for various parameters passed to datapipeline. Copy the config template from https://github.com/coursera/dataduct/../example_config and write it to ~/.dataduct or /etc/.dataduct. You can also set an environment variable pointing to the config file location by setting the DATADUCT_PATH variable.
Config file template:
# Constants that are used across the dataduct library
ec2:
DEFAULT_ROLE: FILL_ME_IN
DEFAULT_RESOURCE_ROLE: FILL_ME_IN
DEFAULT_EC2_INSTANCE_TYPE: m1.large
ETL_AMI: FILL_ME_IN
KEY_PAIR: FILL_ME_IN
SECURITY_GROUP: FILL_ME_IN
emr:
DEFAULT_NUM_CORE_INSTANCES: 3
DEFAULT_CORE_INSTANCE_TYPE: m1.large
DEFAULT_TASK_INSTANCE_BID_PRICE: null # null if we want it to be None
DEFAULT_TASK_INSTANCE_TYPE: m1.large
DEFAULT_MASTER_INSTANCE_TYPE: m1.large
DEFAULT_CLUSTER_TIMEOUT: 6 Hours
DEFAULT_HADOOP_VERSION: null
DEFAULT_HIVE_VERSION: null
DEFAULT_PIG_VERSION: null
DEFAULT_CLUSTER_AMI: 2.4.7
redshift:
REDSHIFT_DATABASE_NAME: FILL_ME_IN
REDSHIFT_CLUSTER_ID: FILL_ME_IN
REDSHIFT_USERNAME: FILL_ME_IN
REDSHIFT_PASSWORD: FILL_ME_IN
mysql:
DATABASE_KEY:
HOST: FILL_ME_IN,
USERNAME: FILL_ME_IN,
PASSWORD: FILL_ME_IN
etl:
RETRY_DELAY: 10 Minutes
DEFAULT_MAX_RETRIES: 0
ETL_BUCKET: FILL_ME_IN
DATA_PIPELINE_TOPIC_ARN: FILL_ME_IN
DAILY_LOAD_TIME: 1 # run at 1AM UTC
bootstrap:
- type: transform
input_node: []
command: whoami >> ${OUTPUT1_STAGING_DIR}/output.txt
resource: FILL_ME_IN
name: bootstrap_transform