Configuration quick reference

Setting configuration options

You can set an option by:

  • Passing it on the command line with the switch version (like --some-option)

  • Passing it as a keyword argument to the runner constructor, if you are creating the runner programmatically

  • Putting it in one of the included config files under a runner name, like this:

    runners:
        local:
            python_bin: python2.7  # only used in local runner
        emr:
            python_bin: python2.6  # only used in Elastic MapReduce runner
    

    See Config file format and location for information on where to put config files.

Options that can’t be set from mrjob.conf (all runners)

There are some options that it makes no sense to set in the config file.

These options can be set via command-line switches:

Config Command line Default Type
conf_paths -c, –conf-path, –no-conf see find_mrjob_conf() path list
no_output –no-output False boolean
output_dir –output-dir (automatic) string
step_output_dir –step-output-dir (automatic) string

These options can be set by overriding attributes or methods in your job class:

Option Attribute Method Default
hadoop_input_format HADOOP_INPUT_FORMAT hadoop_input_format() None
hadoop_output_format HADOOP_OUTPUT_FORMAT hadoop_output_format() None
partitioner PARTITIONER partitioner() None

These options can be set by overriding your job’s configure_options() to call the appropriate method:

Option Method Default
extra_args add_passthrough_option() []
file_upload_args add_file_option() []

All of the above can be passed as keyword arguments to MRJobRunner.__init__() (this is what makes them runner options), but you usually don’t want to instantiate runners directly.

Other options for all runners

These options can be passed to any runner without an error, though some runners may ignore some options. See the text after the table for specifics.

Config Command line Default Type
bootstrap –bootstrap [] string list
bootstrap_mrjob –bootstrap-mrjob, –no-bootstrap-mrjob (automatic) boolean
check_input_paths –check-input-paths, –no-check-input-paths True boolean
cleanup –cleanup 'ALL' string
cleanup_on_failure –cleanup-on-failure 'NONE' string
cmdenv –cmdenv {} environment variable dict
hadoop_extra_args –hadoop-arg [] string list
hadoop_streaming_jar –hadoop-streaming-jar (automatic) string
interpreter –interpreter None string
jobconf –jobconf {} dict
label –label script’s module name, or no_script string
libjars –libjar [] string list
local_tmp_dir value of tempfile.gettempdir() path
owner –owner getpass.getuser(), or no_user if that fails string
py_files –py-file [] path list
python_archives –python-archive [] path list
python_bin –python-bin (automatic) command
setup –setup [] string list
setup_cmds –setup-cmd [] string list
setup_scripts –setup-script [] path list
sh_bin –sh-bin sh -ex (with exceptions below) command
spark_args –spark-arg [] string list
steps_interpreter –steps-interpreter current Python interpreter command
steps_python_bin –steps-python-bin (current Python interpreter) command
strict_protocols –strict-protocols, –no-strict-protocols True boolean
task_python_bin –task-python-bin same as python_bin command
upload_archives –archive [] path list
upload_dirs –dir [] path list
upload_files –file [] path list

LocalMRJobRunner takes no additional options, but:

  • bootstrap_mrjob is False by default
  • cmdenv uses the local system path separator instead of : all the time (so ; on Windows, no change elsewhere)
  • python_bin defaults to the current Python interpreter

In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf

InlineMRJobRunner works like LocalMRJobRunner, only it also ignores bootstrap_mrjob, cmdenv, python_bin, setup_cmds, setup_scripts, steps_python_bin, upload_archives, and upload_files.

Additional options for EMRJobRunner

Config Command line Default Type
additional_emr_info –additional-emr-info None special
applications –application [] string list
aws_access_key_id None string
aws_secret_access_key –aws-secret-access-key None string
aws_session_token None string
bootstrap_actions –bootstrap-actions [] string list
bootstrap_cmds –bootstrap-cmd [] string list
bootstrap_files –bootstrap-file [] path list
bootstrap_python –bootstrap-python, –no-bootstrap-python (automatic) boolean
bootstrap_python_packages –bootstrap-python-package [] path list
bootstrap_scripts –bootstrap-script [] path list
bootstrap_spark –bootstrap-spark, –no-bootstrap-spark (automatic) boolean
check_cluster_every –check-cluster-every 30 string
cloud_fs_sync_secs –cloud_fs_sync_secs 5.0 string
cloud_log_dir –cloud-log-dir append logs to cloud_tmp_dir string
cloud_tmp_dir –cloud-tmp-dir (automatic) string
cloud_upload_part_size –cloud-upload-part-size 100 integer
cluster_id –cluster-id automatically create a cluster and use it string
core_instance_bid_price –core-instance-bid-price None string
core_instance_type –core-instance-type value of instance_type string
ec2_key_pair –ec2-key-pair None string
ec2_key_pair_file –ec2-key-pair-file None path
emr_action_on_failure –emr-action-on-failure (automatic) string
emr_api_params –emr-api-param, –no-emr-api-param {} dict
emr_configurations –emr-configuration [] list of dicts
emr_endpoint –emr-endpoint infer from region string
enable_emr_debugging –enable-emr-debugging False boolean
hadoop_streaming_jar_on_emr –hadoop-streaming-jar-on-emr AWS default string
hadoop_version –hadoop-version None string
iam_endpoint –iam-endpoint (automatic) string
iam_instance_profile –iam-instance-profile (automatic) string
iam_service_role –iam-service-role (automatic) string
image_version –image-version '4.8.2' string
instance_type –instance-type (automatic) string
master_instance_bid_price –master-instance-bid-price None string
master_instance_type –master-instance-type (automatic) string
max_hours_idle –max-hours-idle None string
mins_to_end_of_hour –mins-to-end-of-hour 5.0 string
num_core_instances –num-core-instances 0 string
num_ec2_instances –num-ec2-instances 1 string
num_task_instances –num-task-instances 0 string
pool_clusters –pool-clusters False string
pool_name –pool-name 'default' string
pool_wait_minutes –pool-wait-minutes 0 string
region –region 'us-west-2' string
release_label –release-label None string
s3_endpoint –s3-endpoint (automatic) string
ssh_bin –ssh-bin 'ssh' command
ssh_bind_ports –ssh-bind-ports range(40001, 40841) special
ssh_tunnel –ssh-tunnel, –no-ssh-tunnel False boolean
ssh_tunnel_is_open –ssh-tunnel-is-open False boolean
subnet –subnet None string
tags –tag {} dict
task_instance_bid_price –task-instance-bid-price None string
task_instance_type –task-instance-type value of core_instance_type string
visible_to_all_users –visible-to-all-users, –no-visible-to-all-users True boolean
zone zone AWS default string