Configuration quick reference

Options that can’t be set from mrjob.conf (all runners)

Option Default Switches
conf_path (automatic; see find_mrjob_conf()) -c, --conf-path, --no-conf
extra_args [] (see add_passthrough_option())
file_upload_args [] (see add_file_option())
hadoop_input_format None (see hadoop_input_format())
hadoop_output_format None (see hadoop_output_format())
output_dir (automatic) -o, --output-dir
no_output False --no-output
partitioner None --partitioner (see also partitioner())

See mrjob.runner.MRJobRunner.__init__() for details.

Other options for all runners

Option Default Combined by Switches
base_tmp_dir (automatic) combine_paths() (set TMPDIR)
bootstrap_mrjob True combine_values() --boostrap-mrjob, --no-bootstrap-mrjob
cleanup 'ALL' combine_values() --cleanup
cleanup_on_failure 'NONE' combine_values() --cleanup-on-failure
cmdenv {} combine_envs() --cmdenv
hadoop_extra_args [] combine_lists() --hadoop-arg
hadoop_streaming_jar (automatic) combine_values() --hadoop-streaming-jar
interpreter (value of python_bin) combine_cmds() --interpreter
jobconf {} combine_dicts() --jobconf (see also jobconf())
label (automatic) combine_values() --label
owner (automatic) combine_values() --owner
python_archives [] combine_path_lists() --python-archive
python_bin python combine_cmds() --python-bin
setup_cmds [] combine_lists() --setup-cmd
setup_scripts [] combine_path_lists() --setup-script
steps_python_bin (current Python interpreter) combine_cmds() --steps-python-bin
upload_archives [] combine_path_lists() --archive
upload_files [] combine_path_lists() --file

See mrjob.runner.MRJobRunner.__init__() for details.

LocalMRJobRunner takes no additional options, but:

  • bootstrap_mrjob is False by default
  • cmdenv is combined with combine_local_envs()
  • python_bin defaults to the current Python interpreter

In addition, it ignores hadoop_input_format, hadoop_output_format, hadoop_streaming_jar, and jobconf

InlineMRJobRunner works like LocalMRJobRunner, only it also ignores bootstrap_mrjob, cmdenv, python_bin, setup_cmds, setup_scripts, steps_python_bin, upload_archives, and upload_files.

Additional options for EMRJobRunner

Option Default Combined by Switches
additional_emr_info None combine_values() --additional-emr-info
ami_version None combine_values() --ami-version
aws_access_key_id (automatic) combine_values() (set AWS_ACCESS_KEY_ID)
aws_availability_zone (automatic) combine_values() --aws-availability-zone
aws_region (automatic) combine_values() --aws-region
aws_secret_access_key (automatic) combine_values() (set AWS_SECRET_ACCESS_KEY)
bootstrap_actions [] combine_lists() --bootstrap-action
bootstrap_cmds [] combine_lists() --bootstrap-cmd
bootstrap_files [] combine_path_lists() --bootstrap-file
bootstrap_python_packages [] combine_path_lists() --bootstrap-python-package
bootstrap_scripts [] combine_lists() --bootstrap-script
check_emr_status_every 30 combine_values() --check-emr-status-every
ec2_core_instance_bid_price None combine_values() --ec2-core-instance-bid-price
ec2_core_instance_type 'm1.small' combine_values() --ec2-core-instance-type
ec2_instance_type (effectively m1.small) combine_values() --ec2-instance-type
ec2_key_pair None combine_values() --ec2-key-pair
ec2_key_pair_file None combine_paths() --ec2-key-pair-file
ec2_master_instance_bid_price None combine_values() --ec2-master-instance-bid-price
ec2_master_instance_type 'm1.small' combine_values() --ec2-master-instance-type
ec2_slave_instance_type (see ec2_core_instance_type) combine_values() --ec2-slave-instance-type
ec2_task_instance_bid_price None combine_values() --ec2-task-instance-bid-price
ec2_task_instance_type (effectively 'm1.small') combine_values() --ec2-task-instance-type
emr_endpoint (automatic) combine_values() --emr-endpoint
emr_job_flow_id (create our own job flow) combine_values() --emr-job-flow-id
emr_job_flow_pool_name 'default' combine_values() --pool-name
enable_emr_debugging False combine_values() --enable-emr-debugging, --disable-emr-debugging
hadoop_streaming_jar_on_emr None combine_values() --hadoop-streaming-jar-on-emr
hadoop_version '0.20' combine_values() --hadoop-version
num_ec2_core_instances None combine_values() --num-ec2-core-instances
num_ec2_instances 1 combine_values() --num-ec2-instances
num_ec2_task_instances None combine_values() --num-ec2-task-instances
pool_emr_job_flows False combine_values() --pool-emr-job-flows, --no-pool-emr-job-flows
pool_wait_minutes 0 combine_values() --pool-wait-minutes
s3_endpoint (automatic) combine_paths() --s3-endpoint
s3_log_uri (automatic) combine_paths() --s3-log-uri
s3_scratch_uri (automatic) combine_values() --s3-scratch-uri
s3_sync_wait_time 5.0 combine_values() --s3-sync-wait-time
ssh_bin ssh combine_cmds() --ssh-bin
ssh_bind_ports range(40001, 40841) combine_values() --ssh-bind-ports
ssh_tunnel_is_open False combine_values() --ssh-tunnel-is-open, --ssh-tunnel-is-closed
ssh_tunnel_to_job_tracker False combine_values() --ssh-tunnel-to-job-tracker

See mrjob.emr.EMRJobRunner.__init__() for details.

Additional options for HadoopJobRunner

Option Default Combined by Switches
hadoop_bin (automatic) combine_cmds() --hadoop-bin
hadoop_home HADOOP_HOME combine_values() (set HADOOP_HOME)
hdfs_scratch_dir tmp/mrjob (in HDFS) combine_paths() --hdfs-scratch-dir

See mrjob.hadoop.HadoopJobRunner.__init__() for details.