Config file format and location

We look for mrjob.conf in these locations:

  • The location specified by MRJOB_CONF
  • ~/.mrjob.conf
  • /etc/mrjob.conf

If your mrjob.conf path is deprecated, use this table to fix it:

Old Location New Location
~/.mrjob ~/.mrjob.conf
somewhere in PYTHONPATH Specify in MRJOB_CONF

You can specify one or more configuration files with the --conf-path flag. See Options available to all runners for more information.

The point of mrjob.conf is to let you set up things you want every job to have access to so that you don’t have to think about it. For example:

  • libraries and source code you want to be available for your jobs
  • where temp directories and logs should go
  • security credentials

mrjob.conf is just a YAML- or JSON-encoded dictionary containing default values to pass in to the constructors of the various runner classes. Here’s a minimal mrjob.conf:

runners:
  emr:
    cmdenv:
      TZ: America/Los_Angeles

Now whenever you run mr_your_script.py -r emr, EMRJobRunner will automatically set TZ to America/Los_Angeles in your job’s environment when it runs on EMR.

If you don’t have the yaml module installed, you can use JSON in your mrjob.conf instead (JSON is a subset of YAML, so it’ll still work once you install yaml). Here’s how you’d render the above example in JSON:

{
  "runners": {
    "emr": {
      "cmdenv": {
        "TZ": "America/Los_Angeles"
      }
    }
  }
}

Precedence and combining options

Options specified on the command-line take precedence over mrjob.conf. Usually this means simply overriding the option in mrjob.conf. However, we know that cmdenv contains environment variables, so we do the right thing. For example, if your mrjob.conf contained:

runners:
  emr:
    cmdenv:
      PATH: /usr/local/bin
      TZ: America/Los_Angeles

and you ran your job as:

mr_your_script.py -r emr --cmdenv TZ=Europe/Paris --cmdenv PATH=/usr/sbin

We’d automatically handle the PATH variables and your job’s environment would be:

{'TZ': 'Europe/Paris', 'PATH': '/usr/sbin:/usr/local/bin'}

What’s going on here is that cmdenv is associated with combine_envs(). Each option is associated with an appropriate combiner function that that combines options in an appropriate way.

Combiner functions can also do useful things like expanding environment variables and globs in paths. For example, you could set:

runners:
  local:
    upload_files: &upload_files
    - $DATA_DIR/*.db
  hadoop:
    upload_files: *upload_files
  emr:
    upload_files: *upload_files

and every time you ran a job, every job in your .db file in $DATA_DIR would automatically be loaded into your job’s current working directory.

Also, if you specified additional files to upload with --file, those files would be uploaded in addition to the .db files, rather than instead of them.

See Configuration quick reference for the entire dizzying array of configurable options.

Using multiple config files

If you have several standard configurations, you may want to have several config files “inherit” from a base config file. For example, you may have one set of AWS credentials, but two code bases and default instance sizes. To accomplish this, use the include option:

~/mrjob.very-large.conf:

include: ~/.mrjob.base.conf
runners:
    emr:
        num_ec2_core_instances: 20
        ec2_core_instace_type: m1.xlarge

~/mrjob.very-small.conf:

include: $HOME/.mrjob.base.conf
runners:
    emr:
        num_ec2_core_instances: 2
        ec2_core_instace_type: m1.small

~/.mrjob.base.conf:

runners:
    emr:
        aws_access_key_id: HADOOPHADOOPBOBADOOP
        aws_region: us-west-1
        aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP

Options that are lists, commands, dictionaries, etc. combine the same way they do between the config files and the command line (with combiner functions).

You can use $ENVIRONMENT_VARIABLES and ~/file_in_your_home_dir inside include.

You can inherit from multiple config files by passing include a list instead of a string. Files on the right will have precedence over files on the left. To continue the above examples, this config:

~/.mrjob.everything.conf

include:
- ~/.mrjob.very-small.conf
- ~/.mrjob.very-large.conf

will be equivalent to this one:

~/.mrjob.everything-2.conf

runners:
    emr:
        aws_access_key_id: HADOOPHADOOPBOBADOOP
        aws_region: us-west-1
        aws_secret_access_key: MEMIMOMADOOPBANANAFANAFOFADOOPHADOOP
        num_ec2_core_instances: 20
        ec2_core_instace_type: m1.xlarge

In this case, ~/.mrjob.very-large.conf has taken precedence over ~/.mrjob.very-small.conf.

Table Of Contents

Previous topic

Runners

Next topic

Options available to all runners

This Page