Creating an ETL

Dataduct makes it extremely easy to write ETL in Data Pipeline. All the details and logic can be abstracted in the YAML files which will be automatically translated into Data Pipeline with appropriate pipeline objects and other configurations.

Writing a Dataduct YAML File

To learn about general YAML syntax, please see YAML syntax. The structure of a Dataduct YAML file can be broken down into 3 parts:

  • Header information
  • Description
  • Pipeline steps

Example:

# HEADER INFORMATION
name : example_emr_streaming
frequency : one-time
load_time: 01:00  # Hour:Min in UTC
emr_cluster_config:
    num_instances: 1
    instance_size: m1.xlarge

# DESCRIPTION
description : Example for the emr_streaming step

# PIPELINE STEPS
steps:
-   type: extract-local
    path: examples/resources/word_data.txt

-   type: emr-streaming
    mapper: examples/scripts/word_mapper.py
    reducer: examples/scripts/word_reducer.py

-   type: transform
    script: examples/scripts/s3_profiler.py
    script_arguments:
    -   --input=INPUT1_STAGING_DIR
    -   --output=OUTPUT1_STAGING_DIR
    -   -f

Header Information

The header includes configuration information for Data Pipeline and the Elastic MapReduce resource.

The name field sets the overall pipeline name:

name : example_emr_streaming

The frequency represents how often the pipeline is run on a schedule basis. Currently supported intervals are hourly, daily, one-time:

frequency : one-time

The load time is what time of day (in UTC) the pipeline is scheduled to run. It is in the format of HH:MM so 01:00 would set the pipeline to run at 1AM UTC:

load_time: 01:00  # Hour:Min in UTC

If the pipeline includes an EMR-streaming step, the EMR instance can be configured. For example, you can configure the bootstrap, number of core instances, and instance types:

emr_cluster_config:
    num_instances: 1
    instance_size: m1.xlarge

Description

The description allows the creator of the YAML file to clearly explain the purpose of the pipeline.

Pipeline Steps

The pipeline steps are very verbose and easy to understand, as they map directly into Data Pipeline steps. Each step must have a type associated with it (transform step / emr-streaming step) and should be named for clarification purposes. The following lists every step type:

emr-streaming

The emr-streaming step runs on a EMR instance configured from the header. You can specify the bootstrap, mapper, and reducer files.

-   type: emr-streaming
    mapper: examples/scripts/word_mapper.py
    reducer: examples/scripts/word_reducer.py

extract-local

The extract-local step will extract a local file (for example, a TSV file) and write it to the output node. From there, the data can be loaded into redshift or apply further transformations.

-   name: extract_local_step
    type: extract-local
    path: examples/resources/word_data.txt

extract-rds

The extract-rds step extracts data from MySQL databases to S3. You can also specify the SQL statement that you would like to execute. This extraction will look for tables based on the host name and the database name which needs to be pre-configured in ~/.dataduct

-   type: extract-rds
    host_name: maestro
    database: maestro
    sql: |
        SELECT *
        FROM networks_network;

extract-redshift

The extract-redshift step extracts data from AWS Redshift (the host and AWS details must be preconfigured in the ~/.dataduct file) into S3.

-   type: extract-redshift
    schema: dev
    table: categories

extract-s3

The extract-s3 step extracts files from a given S3 URI into the output S3 node.

-   type: extract-s3
    uri: s3://elasticmapreduce/samples/wordcount/wordSplitter.py

load-redshift

The load-redshift step loads data from the input nodes to the specified Redshift table. Before specifying the Redshift table and schema, the host and AWS details must be preconfigured in the ~/.dataduct file. For example, the following steps will upload a local file into dev.test_table

-   type: extract-local
    path: examples/resources/test_table1.tsv

-   type: load-redshift
    schema: dev
    table: test_table

sql-command

The sql-command step will execute a query in Redshift (the host and AWS details must be preconfigured in the ~/.dataduct file).

-   type: sql-command
    command: INSERT INTO dev.test_table VALUES (1, 'hello_etl');

transform

The transform step allows you to specify the input node, apply transformations, and write to a specified output node. The transformation can be in the form of a script or a UNIX command.

# Unix Example
-   type: transform
    command: cp -r $INPUT1_STAGING_DIR/* $OUTPUT1_STAGING_DIR
    input_node:
        step1_a: step2_a
        step1_b: step2_b
    output:
    -   "step2_a"
    -   "step2_b"

# Script Example
-   type: transform
    script: examples/scripts/s3_profiler.py
    input_node:
        step2_a: output1
    script_arguments:
    -   "-i=${INPUT1_STAGING_DIR}"
    -   "-o=${OUTPUT1_STAGING_DIR}"
    -   -f