What’s New

For a complete list of changes, see CHANGES.txt

0.5.10

Fixed an issue where bootstrapping mrjob on Dataproc or EMR could stall if mrjob was already installed.

The aws_security_token option has been renamed to aws_session_token. If you want to set it via environment variable, you still have to use $AWS_SECURITY_TOKEN because that’s what boto uses.

Added protocol support for rapidjson; see RapidJSONProtocol and RapidJSONValueProtocol. If available, rapidjson will be used as the default JSON implementation if ujson is not installed.

The master bootstrap script on EMR and Dataproc now has the correct file extension (.sh, not .py).

0.5.9

Fixed a bug that prevented setup scripts from working on EMR AMIs 5.2.0 and later. Our workaround should be completely transparent unless you use a custom shell binary; see sh_bin for details.

The EMR runner now correctly re-starts the SSH tunnel to the job tracker/resource manager when a cluster it tries to run a job on auto-terminates. It also no longer requires a working SSH tunnel to fetch job progress (you still a working SSH; see ec2_key_pair_file).

The emr_applications option has been renamed to applications.

The terminate-idle-clusters utility is now slightly more robust in cases where your S3 temp directory is an different region from your clusters.

Finally, there a couple of changes that probably only matter if you’re trying to wrap your Hadoop tasks (mappers, reducers, etc.) in docker:

  • You can set just the python binary for tasks with task_python_bin. This allows you to use a wrapper script in place of Python without perturbing setup scripts.
  • Local mode now no longer relies on an absolute path to access the mrjob.cat utility it uses to handle compressed input files; copying the job’s working directory into Docker is enough.

0.5.8

You can now pass directories to jobs, either directly with the upload_dirs option, or through setup commands. For example:

--setup 'export PYTHONPATH=$PYTHONPATH:your-src-code/#'

mrjob will automatically tarball these directories and pass them to Hadoop as archives.

For multi-step jobs, you can now specify where inter-step output goes with step_output_dir (--step-output-dir), which can be useful for debugging.

All job step types now take the jobconf keyword argument to set Hadoop properties for that step.

Jobs’ --help printout is now better-organized and less verbose.

Made several fixes to pre-filters (commands that pipe into streaming steps):

mrjob now respects sh_bin when it needs to wrap a command in sh before passing it to Hadoop (e.g. to support pipes)

On EMR, mrjob now fetches logs from task nodes when determining probable cause of error, not just core nodes (the ones that run tasks and host HDFS).

Several unused functions in mrjob.util are now deprecated:

bunzip2_stream() and gunzip_stream() have been moved from mrjob.util to mrjob.cat.

SSHFilesystem.ssh_slave_hosts() has been deprecated.

Option group attributes in MRJobs have been deprecated, as has the get_all_option_groups() method.

0.5.7

Cluster pooling

mrjob can now add up to 1,000 steps on pooled clusters on EMR (except on very old AMIs). mrjob now prints debug messages explaining why your job matched a particular pooled cluster when running in verbose mode (the -v option). Fixed a bug that caused pooling to fail when there was no need for a master bootstrap script (e.g. when running with --no-bootstrap-mrjob).

Other improvements

Log interpretation is much more efficient at determining a job’s probable cause of failure (this works with Spark as well).

When running custom JARs (see JarStep) mrjob now repects libjars and jobconf.

The hadoop_streaming_jar option now supports environment variables and ~.

The terminate-idle-clusters tool now works with all step types, including Spark. (It’s still recommended that you rely on the max_hours_idle option rather than this tool.)

mrjob now works in Anaconda3 Jupyter Notebook.

Bugfixes

Added several missing command-line switches, including --no-bootstrap-python on Dataproc. Made a major refactor that should prevent these kinds of issues in the future.

Fixed a bug that caused mrjob to crash when the ssh binary (see ssh_bin) was missing or not executable.

Fixed a bug that erroneously reported failed or just-started jobs as 100% complete.

Fixed a bug where timestamps were erroneously recognized as URIs. mrjob now only recognizes strings containing :// as URIs (see is_uri()).

Deprecation

The following are deprecated and will be removed in v0.6.0:

0.5.6

Fixed a critical bug that caused Dataproc runner to always crash when determining Hadoop version.

Log interpretation now prioritizes task errors (e.g. a traceback from your Python script) as probable cause of failure, even if they aren’t the most recent error. Log interpretation will now continue to download and parse task logs until it finds a non-empty stderr log.

Log interpretation also strips the “subprocess failed” Java stack trace that appears in task stderr logs from Hadoop 1.

0.5.5

Functionally equivalent to 0.5.4, except that it restores the deprecated ami_version option as an alias for image_version, making it easier to upgrade from earlier versions of mrjob.

Also slightly improves EMR cluster pooling with updated information on memory and CPU power of various EC2 instance types, and by treating application names (e.g. “Spark”) as case-insensitive.

0.5.4

Pooling and idle cluster self-termination

Warning

This release accidentally removed the ami_version option instead of merely deprecating it. If you are upgrading from an earlier version of mrjob, use version 0.5.5 or later.

This release resolves a long-standing EMR API race condition that made it difficult to use cluster pooling and idle cluster self-termination (see max_hours_idle) together. Now if your pooled job unknowingly runs on a cluster that was in the process of shutting down, it will detect that and re-launch the job on a different cluster.

This means pretty much everyone running jobs on EMR should now enable pooling, with a configuration like this:

runners:
  emr:
    max_hours_idle: 1
    pool_clusters: true

You may also run the terminate-idle-clusters script periodically, but (barring any bugs) this shouldn’t be necessary.

Generic EMR option names

Many options to the EMR runner have been made more generic, to make it easier to share code with the Dataproc runner (in most cases, the new names are also shorter and easier to remember):

old option name new option name
ami_version image_version
aws_availablity_zone zone
aws_region region
check_emr_status_every check_cluster_every
ec2_core_instance_bid_price core_instance_bid_price
ec2_core_instance_type core_instance_type
ec2_instance_type instance_type
ec2_master_instance_bid_price master_instance_bid_price
ec2_master_instance_type master_instance_type
ec2_slave_instance_type core_instance_type
ec2_task_instance_bid_price task_instance_bid_price
ec2_task_instance_type task_instance_type
emr_tags tags
num_ec2_core_instances num_core_instances
num_ec2_task_instances num_task_instances
s3_log_uri cloud_log_dir
s3_sync_wait_time cloud_fs_sync_secs
s3_tmp_dir cloud_tmp_dir
s3_upload_part_size cloud_upload_part_size

The old option names and command-line switches are now deprecated but will continue to work until v0.6.0. (Exception: ami_version was accidentally removed; if you need it, use 0.5.5 or later.)

num_ec2_instances has simply been deprecated (it’s just num_core_instances plus one).

hadoop_streaming_jar_on_emr has also been deprecated; in its place, you can now pass a file:// URI to hadoop_streaming_jar to reference a path on the master node.

Log interpretation

Log interpretation (counters and probable cause of job failure) on Hadoop is more robust, handing a wider variety of log4j formats and recovering more gracefully from permissions errors. This includes fixing a crash that could happen on Python 3 when attempting to read data from HDFS.

Log interpretation used to be partially broken on EMR AMI 4.3.0 and later due to a permissions issue; this is now fixed.

pass_through_option()

You can now pass through existing command-line switches to your job; for example, you can tell a job which runner launched it. See pass_through_option() for details.

If you don’t do this, self.options.runner will now always be None in your job (it used to confusingly default to 'inline').

Stop logging credentials

When mrjob is run in verbose mode (the -v option), the values of all runner options are debug-logged to stderr. This has been the case since the very early days of mrjob.

Unfortunately, this means that if you set your AWS credentials in mrjob.conf, they get logged as well, creating a surprising potential security vulnerability. (This doesn’t happen for AWS credentials set through environment variables.)

Starting in this version, the values of aws_secret_access_key and aws_security_token are shown as '...' if they are set, and all but the last four characters of aws_access_key_id are blanked out as well (e.g. '...YNDR').

Other improvements and bugfixes

The ssh tunnel to the resource manager on EMR (see ssh_tunnel) now connects to its correct internal IP; this resolves a firewall issue that existed on some VPC setups.

Uploaded files will no longer be given names starting with _ or ., since Hadoop’s input processing treats these files as “hidden”.

The EMR idle cluster self-termination script (see max_hours_idle) now only runs on the master node.

The audit-emr-usage command-line tool should no longer constantly trigger throttling warnings.

bootstrap_python no longer bothers trying to install Python 3 on EMR AMI 4.6.0 and later, since it is already installed.

The --ssh-bind-ports command-line switch was broken (starting in 0.4.5!), and is now fixed.

0.5.3

This release adds support for custom libjars (such as nicknack), allowing easy access to custom input and output formats. This works on Hadoop and EMR (including on a cluster that’s already running).

In addition, jobs can specify needed libjars by setting the LIBJARS attribute or overriding the libjars() method. For examples, see Input and output formats.

The Hadoop runner now tries even harder to find your log files without needing additional configuration (see hadoop_log_dirs).

The EMR runner now supports Amazon VPC subnets (see subnet), and, on 4.x AMIs, Application Configurations (see emr_configurations).

If your EMR cluster fails during bootstrapping, mrjob can now determine the probable cause of failure.

There are also some minor improvements to SSH tunneling and a handful of small bugfixes; see CHANGES.txt for details.

0.5.2

This release adds basic support for Google Cloud Dataproc which is Google’s Hadoop service, roughly analogous to EMR. See Dataproc Quickstart. Some features are not yet implemented:

  • fetching counters
  • finding probable cause of errors
  • running Java JARs as steps

Added the emr_applications option, which helps you configure 4.x AMIs.

Fixed an EMR bug (introduced in v0.5.0) where we were waiting for steps to complete in the wrong order (in a multi-step job, we wouldn’t register that the first step had finished until the last one had).

Fixed a bug in SSH tunneling (introduced in v0.5.0) that made connections to the job tracker/resource manager on EMR time out when running on a 2.x AMI inside a VPC (Virtual Private Cluster).

Fixed a bug (introduced in v0.4.6) that kept mrjob from interpreting ~ (home directory) in includes in mrjob.conf.

It is now again possible to run tool modules deprecated in v0.5.0 directly (e.g. python -m mrjob.tools.emr.create_job_flow). This is still a deprecated feature; it’s recommended that you use the appropriate mrjob subcommand instead (e.g. mrjob create-cluster).

0.5.1

Fixes a bug in the previous relase that broke SORT_VALUES and any other attempt by the job to set the partitioner. The --partitioner switch is now deprecated (the choice of partitioner is part of your job semantics).

Fixes a bug in the previous release that caused strict_protocols and check_input_paths to be ignored in mrjob.conf. (We would much prefer you fixed jobs that are using “loose protocols” rather than setting strict_protocols: false in your config file, but we didn’t break this on purpose, we promise!)

mrjob terminate-idle-clusters now correctly handles EMR debugging steps (see enable_emr_debugging) set up by boto 2.40.0.

Fixed a bug that could result in showing a blank probable cause of error for pre-YARN (Hadoop 1) jobs.

ssh_bind_ports now defaults to a range object (xrange on Python 2), so that when you run on emr in verbose mode (-r emr -v), debug logging devotes one line to the value of ssh_bind_ports rather than 840.

0.5.0

Python versions

mrjob now fully supports Python 3.3+ in a way that should be transparent to existing Python 2 users (you don’t have to suddenly start handling unicode instead of str). For more information, see Python 2 vs. Python 3.

If you run a job with Python 3, mrjob will automatically install Python 3 on ElasticMapreduce AMIs (see bootstrap_python).

When you run jobs on EMR in Python 2, mrjob attempts to match your minor version of Python as well (either python2.6 or python2.7); see python_bin for details.

Note

If you’re currently running Python 2.7, and using yum to install python libraries, you’ll want to use the Python 2.7 version of the package (e.g. python27-numpy rather than python-numpy).

The mrjob command is now installed with Python-version-specific aliases (e.g. mrjob-3, mrjob-3.4), in case you install mrjob for multiple versions of Python.

Hadoop

mrjob should now work out-of-the box on almost any Hadoop setup. If hadoop is in your path, or you set any commonly-used $HADOOP_* environment variable, mrjob will find the Hadoop binary, the streaming jar, and your logs, without any help on your part (see hadoop_bin, hadoop_log_dirs, hadoop_streaming_jar).

mrjob has been updated to fully support Hadoop 2 (YARN), including many updates to HadoopFilesystem. Hadoop 1 is still supported, though anything prior to Hadoop 0.20.203 is not (mrjob is actually a few months older than Hadoop 0.20.203, so this used to matter).

3.x and 4.x AMIs

mrjob now fully supports the 3.x and 4.x Elastic MapReduce AMIs, including SSH tunneling to the resource mananager, fetching counters and finding probable cause of job failure.

The default ami_version (see image_version) is now 3.11.0. Our plan is to continue updating this to the lastest (non-broken) 3.x AMI for each 0.5.x release of mrjob.

The default instance_type is now m1.medium (m1.small is too small for the 3.x and 4.x AMIs)

You can specify 4.x AMIs with either the new release_label option, or continue using ami_version; both work.

mrjob continues to support 2.x AMIs. However:

Warning

2.x AMIs are deprecated by AWS, and based on a very old version of Debian (squeeze), which breaks apt-get and exposes you to security holes.

Please, please switch if you haven’t already.

AWS Regions

The new default aws_region (see region) is us-west-2 (Oregon). This both matches the default in the EMR console and, according to Amazon, is carbon neutral.

An edge case that might affect you: EC2 key pairs (i.e. SSH credentials) are region-specific, so if you’ve set up SSH but not explicitly specified a region, you may get an error saying your key pair is invalid. The fix is simply to create new SSH keys for the us-west-2 (Oregon) region.

S3

mrjob is much smarter about the way it interacts with S3:
  • automatically creates temp bucket in the same region as jobs
  • connects to S3 buckets on the endpoint matching their region (no more 307 errors)
    • EMRJobRunner and S3Filesystem methods no longer take s3_conn args (passing around a single S3 connection no longer makes sense)
  • no longer uses the temp bucket’s location to choose where you run your job
  • rm() no longer has special logic for *_$folder$ keys
  • ls() recurses “subdirectories” even if you pass it a URI without a trailing slash

Log interpretation

The part of mrjob that fetches counters and tells you what probably caused your job to fail was basically unmaintainable and has been totally rewritten. Not only do we now have solid support across Hadoop and EMR AMI versions, but if we missed anything, it should be straightforward to add it.

Once casualty of this change was the mrjob fetch-logs command, which means mrjob no longer offers a way to fetch or interpret logs from a past job. We do plan to re-introduce this functionality.

Protocols

Protocols are now strict by default (they simply raise an exception on unencodable data). “Loose” protocols can be re-enabled with the --no-strict-protocols switch; see strict_protocols for why this is a bad idea.

Protocols will now use the much faster ujson library, if installed, to encode and decode JSON. This is especially recommended for simple jobs that spend a significant fraction of their time encoding and data.

Note

If you’re using EMR, try out this bootstrap recipe to install ujson.

mrjob will fall back to the simplejson library if ujson is not installed, and use the built-in json module if neither is installed.

You can now explicitly specify which JSON implementation you wish to use (e.g. StandardJSONProtocol, SimpleJSONProtocol, UltraJSONProtocol).

Status messages

We’ve tried to cut the logging messages that your job prints as it runs down to the basics (either useful info, like where a temp directory is, or something that tells you why you’re waiting). If there are any messages you miss, try running your job with -v.

When a step in your job fails, mrjob no longer prints a useless stacktrace telling you where in the code the runner raised an exception about your step failing. This is thanks to StepFailedException, which you can also catch and interpret if you’re running jobs programmatically.

Deprecation

Many things that were deprecated in 0.4.6 have been removed:

mrjob.compat functions supports_combiners_in_hadoop_streaming(), supports_new_distributed_cache_options(), and uses_generic_jobconf(), which only existed to support very old versions of Hadoop, were removed without deprecation warnings (sorry!).

To avoid a similar wave of deprecation warnings in the future, the name of every part of mrjob that isn’t meant to be a stable interface provided by the library now starts with an underscore. You can still use these things (or copy them; it’s Open Source), but there’s no guarantee they’ll exist in the next release.

If you want to get ahead of the game, here is a list of things that are deprecated starting in mrjob 0.5.0 (do these after upgrading mrjob):

  • mrjob subcommands - mrjob create-job-flow is now mrjob create-cluster - mrjob terminate-idle-job-flows is now mrjob terminate-idle-clusters - mrjob terminate-job-flow is now mrjob temrinate-cluster

Other changes

  • mrjob now requires boto 2.35.0 or newer (chances are you’re already doing this). Later 0.5.x releases of mrjob may require newer versions of boto.
  • visible_to_all_users now defaults to True
  • HadoopFilesystem.rm() uses -skipTrash
  • new iam_endpoint option
  • custom hadoop_streaming_jars are properly uploaded
  • JOB cleanup on EMR is temporarily disabled
  • mrjob now follows symlinks when ls()ing the local filesystem (beware recursive symlinks!)
  • The interpreter option disables bootstrap_mrjob by default (interpreter is meant for non-Python jobs)
  • cluster pooling now respects ec2_key_pair
  • cluster self-termination (see max_hours_idle) now respects non-streaming jobs
  • LocalFilesystem now rejects URIs rather than interpreting them as local paths
  • local and inline runners no longer have a default hadoop_version, instead handling jobconf in a version-agnostic way
  • steps_python_bin now defaults to the current Python interpreter.
  • minor changes to mrjob.util:
    • file_ext() takes filename, not path
    • gunzip_stream() now yields chunks of bytes, not lines
    • moved random_identifier() method here from mrjob.aws
    • buffer_iterator_to_line_iterator() is now named to_lines(), and no longer appends a trailing newline to data.

0.4.6

include: in conf files can now use relative paths in a meaningful way. See Relative includes.

List and environment variable options loaded from included config files can be totally overridden using the !clear tag. See Clearing configs.

Options that take lists (e.g. setup) now treat scalar values as single-item lists. See this example.

Fixed a bug that kept the pool_wait_minutes option from being loaded from config files.

0.4.5

This release moves mrjob off the deprecated DescribeJobFlows EMR API call.

Warning

AWS again broke older versions mrjob for at least some new accounts, by returning 400s for the deprecated DescribeJobFlows API call. If you have a newer AWS account (circa July 2015), you must use at least this version of mrjob.

The new API does not provide a way to tell when a job flow (now called a “cluster”) stopped provisioning instances and started bootstrapping, so the clock for our estimates of when we are close to the end of a billing hour now start at cluster creation time, and are thus more conservative.

Related to this change, terminate_idle_job_flows no longer considers job flows in the STARTING state idle; use report_long_jobs to catch jobs stuck in this state.

terminate_idle_job_flows performs much better on large numbers of job flows. Formerly, it collected all job flow information first, but now it terminates idle job flows as soon as it identifies them.

collect_emr_stats and job_flow_pool have not been ported to the new API and will be removed in v0.5.0.

Added an aws_security_token option to allow you to run mrjob on EMR using temporary AWS credentials.

Added an emr_tags (see tags) option to allow you to tag EMR job flows at creation time.

EMRJobRunner now has a get_ami_version() method.

The hadoop_version option no longer has any effect in EMR. This option only every did anything on the 1.x AMIs, which mrjob no longer supports.

Added many missing switches to the EMR tools (accessible from the mrjob command). Formerly, you had to use a config file to get at these options.

You can now access the mrboss tool from the command line: mrjob boss <args>.

Previous 0.4.x releases have worked with boto as old as 2.2.0, but this one requires at least boto 2.6.0 (which is still more than two years old). In any case, it’s recommended that you just use the latest version of boto.

This branch has a number of additional deprecation warnings, to help prepare you for mrjob v0.5.0. Please heed them; a lot of deprecated things really are going to be completely removed.

0.4.4

mrjob now automatically creates and uses IAM objects as necessary to comply with new requirements from Amazon Web Services.

(You do not need to install the AWS CLI or run aws emr create-default-roles as the link above describes; mrjob takes care of this for you.)

Warning

The change that AWS made essentially broke all older versions of mrjob for all new accounts. If the first time your AWS account created an Elastic MapReduce cluster was on or after April 6, 2015, you should use at least this version of mrjob.

If you must use an old version of mrjob with a new AWS account, see this thread for a possible workaround.

--iam-job-flow-role has been renamed to --iam-instance-profile.

New --iam-service-role option.

0.4.3

This release also contains many, many bugfixes, one of which probably affects you! See CHANGES.txt for details.

Added a new subcommand, mrjob collect-emr-active-stats, to collect stats about active jobflows and instance counts.

--iam-job-flow-role option allows setting of a specific IAM role to run this job flow.

You can now use --check-input-paths and --no-check-input-paths on EMR as well as Hadoop.

Files larger than 100MB will be uploaded to S3 using multipart upload if you have the filechunkio module installed. You can change the limit/part size with the --s3-upload-part-size option, or disable multipart upload by setting this option to 0.

You can now require protocols to be strict from mrjob.conf; this means unencodable input/output will result in an exception rather than the job quietly incrementing a counter. It is recommended you set this for all runners:

runners:
  emr:
    strict_protocols: true
  hadoop:
    strict_protocols: true
  inline:
    strict_protocols: true
  local:
    strict_protocols: true

You can use --no-strict-protocols to turn off strict protocols for a particular job.

Tests now support pytest and tox.

Support for Python 2.5 has been dropped.

0.4.2

JarSteps, previously experimental, are now fully integrated into multi-step jobs, and work with both the Hadoop and EMR runners. You can now use powerful Java libraries such as Mahout in your MRJobs. For more information, see Jar steps.

Many options for setting up your task’s environment (--python-archive, setup-cmd and --setup-script) have been replaced by a powerful --setup option. See the Job Environment Setup Cookbook for examples.

Similarly, many options for bootstrapping nodes on EMR (--bootstrap-cmd, --bootstrap-file, --bootstrap-python-package and --bootstrap-script) have been replaced by a single --bootstrap option. See the EMR Bootstrapping Cookbook.

This release also contains many bugfixes, including problems with boto 2.10.0+, bz2 decompression, and Python 2.5.

0.4.1

The SORT_VALUES option enables secondary sort, ensuring that your reducer(s) receive values in sorted order. This allows you to do things with reducers that would otherwise involve storing all the values in memory, such as:

  • Receiving a grand total before any subtotals, so you can calculate percentages on the fly. See mr_next_word_stats.py for an example.
  • Running a window of fixed length over an arbitrary amount of sorted values (e.g. a 24-hour window over timestamped log data).

The max_hours_idle option allows you to spin up EMR job flows that will terminate themselves after being idle for a certain amount of time, in a way that optimizes EMR/EC2’s full-hour billing model.

For development (not production), we now recommend always using job flow pooling, with max_hours_idle enabled. Update your mrjob.conf like this:

runners:
  emr:
    max_hours_idle: 0.25
    pool_emr_job_flows: true

Warning

If you enable pooling without max_hours_idle (or cronning terminate_idle_job_flows), pooled job flows will stay active forever, costing you money!

You can now use --no-check-input-paths with the Hadoop runner to allow jobs to run even if hadoop fs -ls can’t see their input files (see check_input_paths).

Two bits of straggling deprecated functionality were removed:

  • Built-in protocols must be instantiated to be used (formerly they had class methods).
  • Old locations for mrjob.conf are no longer supported.

This version also contains numerous bugfixes and natural extensions of existing functionality; many more things will now Just Work (see CHANGES.txt).

0.4.0

The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.

The EMR tools can now be accessed more easily via the mrjob command. Learn more here.

Job steps are much richer now:

  • You can now use mrjob to run jar steps other than Hadoop Streaming. More info
  • You can filter step input with UNIX commands. More info
  • In fact, you can use arbitrary UNIX commands as your whole step (mapper/reducer/combiner). More info

If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info

mrjob v0.4 requires boto 2.2.

We removed all deprecated functionality from v0.2:

  • –hadoop-*-format
  • –*-protocol switches
  • MRJob.DEFAULT_*_PROTOCOL
  • MRJob.get_default_opts()
  • MRJob.protocols()
  • PROTOCOL_DICT
  • IF_SUCCESSFUL
  • DEFAULT_CLEANUP
  • S3Filesystem.get_s3_folder_keys()

We love contributions, so we wrote some guidelines to help you help us. See you on Github!

0.3.5

The pool_wait_minutes (--pool-wait-minutes) option lets your job delay itself in case a job flow becomes available. Reference: Configuration quick reference

The JOB and JOB_FLOW cleanup options tell mrjob to clean up the job and/or the job flow on failure (including Ctrl+C). See CLEANUP_CHOICES for more information.

0.3.2

The EMR instance type/number options have changed to support spot instances:

  • core_instance_bid_price
  • core_instance_type
  • master_instance_bid_price
  • master_instance_type
  • slave_instance_type (alias for core_instance_type)
  • task_instance_bid_price
  • task_instance_type

There is also a new ami_version option to change the AMI your job flow uses for its nodes.

For more information, see mrjob.emr.EMRJobRunner.__init__().

The new report_long_jobs tool alerts on jobs that have run for more than X hours.

0.3

Features

Support for Combiners

You can now use combiners in your job. Like mapper() and reducer(), you can redefine combiner() in your subclass to add a single combiner step to run after your mapper but before your reducer. (MRWordFreqCount does this to improve performance.) combiner_init() and combiner_final() are similar to their mapper and reducer equivalents.

You can also add combiners to custom steps by adding keyword argumens to your call to steps().

More info: One-step jobs, Multi-step jobs

*_init(), *_final() for mappers, reducers, combiners

Mappers, reducers, and combiners have *_init() and *_final() methods that are run before and after the input is run through the main function (e.g. mapper_init() and mapper_final()).

More info: One-step jobs, Multi-step jobs

Custom Option Parsers

It is now possible to define your own option types and actions using a custom OptionParser subclass.

More info: Custom option types

Job Flow Pooling

EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.

More info: Pooling Clusters

SSH Log Fetching

mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.

More info: Configuring SSH credentials

New EMR Tool: fetch_logs

If you want to fetch the counters or error logs for a job after the fact, you can use the new fetch_logs tool.

More info: mrjob.tools.emr.fetch_logs

New EMR Tool: mrboss

If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new mrboss tool.

More info: mrjob.tools.emr.mrboss

Changes and Deprecations

Configuration

The search path order for mrjob.conf has changed. The new order is:

  • The location specified by MRJOB_CONF
  • ~/.mrjob.conf
  • ~/.mrjob (deprecated)
  • mrjob.conf in any directory in PYTHONPATH (deprecated)
  • /etc/mrjob.conf

If your mrjob.conf path is deprecated, use this table to fix it:

Old Location New Location
~/.mrjob ~/.mrjob.conf
somewhere in PYTHONPATH Specify in MRJOB_CONF

More info: mrjob.conf

Defining Jobs (MRJob)

Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.

The --hadoop-*-format switches are deprecated. Instead, set your job’s Hadoop formats with HADOOP_INPUT_FORMAT/HADOOP_OUTPUT_FORMAT or hadoop_input_format()/hadoop_output_format(). Hadoop formats can no longer be set from mrjob.conf.

In addition to --jobconf, you can now set jobconf values with the JOBCONF attribute or the jobconf() method. To read jobconf values back, use mrjob.compat.jobconf_from_env(), which ensures that the correct name is used depending on which version of Hadoop is active.

You can now set the Hadoop partioner class with --partitioner, the PARTITIONER attribute, or the partitioner() method.

More info: Hadoop configuration

Protocols

Protocols can now be anything with a read() and write() method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.

The --*protocol switches and DEFAULT_*PROTOCOL are deprecated. Instead, use the *_PROTOCOL attributes or redefine the *_protocol() methods.

Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.

More info: Protocols

Running Jobs

All Modes

All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.

All *_bin configuration options (hadoop_bin, python_bin, and ssh_bin) take lists instead of strings so you can add arguments (like ['python', '-v']). More info: Configuration quick reference

Cleanup options have been split into cleanup and cleanup_on_failure. There are more granular values for both of these options.

Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions. More info: Custom option types

The job_name_prefix option is gone (was deprecated).

All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.

Steps with no mapper use cat instead of going through a no-op mapper.

Compressed files can be streamed with the cat() method.

EMR Mode

The default Hadoop version on EMR is now 0.20 (was 0.18).

The instance_type option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.

Inline Mode

Inline mode now supports the cmdenv option.

Local Mode

Local mode now runs 2 mappers and 2 reducers in parallel by default.

There is preliminary support for simulating some jobconf variables. The current list of supported variables is:

  • mapreduce.job.cache.archives
  • mapreduce.job.cache.files
  • mapreduce.job.cache.local.archives
  • mapreduce.job.cache.local.files
  • mapreduce.job.id
  • mapreduce.job.local.dir
  • mapreduce.map.input.file
  • mapreduce.map.input.length
  • mapreduce.map.input.start
  • mapreduce.task.attempt.id
  • mapreduce.task.id
  • mapreduce.task.ismap
  • mapreduce.task.output.dir
  • mapreduce.task.partition

Other Stuff

boto 2.0+ is now required.

The Debian packaging has been removed from the repostory.