For a complete list of changes, see CHANGES.txt
The default runner is now inline instead of local. This change will speed up debugging for many users. Use local if you need to simulate more features of Hadoop.
The EMR tools can now be accessed more easily via the mrjob command. Learn more [here].
Job steps are much richer now:
If you Ctrl+C from the command line, your job will be terminated if you give it time. If you’re running on EMR, that should prevent most accidental runaway jobs. More info
mrjob v0.4 requires boto 2.2.
We removed all deprecated functionality from v0.2:
We love contributions, so we wrote some guidelines to help you help us. See you on Github!
The pool_wait_minutes (--pool-wait-minutes) option lets your job delay itself in case a job flow becomes available. Reference: Configuration quick reference
The JOB and JOB_FLOW cleanup options tell mrjob to clean up the job and/or the job flow on failure (including Ctrl+C). See CLEANUP_CHOICES for more information.
The EMR instance type/number options have changed to support spot instances:
There is also a new ami_version option to change the AMI your job flow uses for its nodes.
For more information, see mrjob.emr.EMRJobRunner.__init__().
The new report_long_jobs tool alerts on jobs that have run for more than X hours.
Support for Combiners
You can now use combiners in your job. Like mapper() and reducer(), you can redefine combiner() in your subclass to add a single combiner step to run after your mapper but before your reducer. (MRWordFreqCount does this to improve performance.) combiner_init() and combiner_final() are similar to their mapper and reducer equivalents.
You can also add combiners to custom steps by adding keyword argumens to your call to steps().
*_init(), *_final() for mappers, reducers, combiners
Custom Option Parsers
It is now possible to define your own option types and actions using a custom OptionParser subclass.
More info: Custom option types
Job Flow Pooling
EMR jobs can pull job flows out of a “pool” of similarly configured job flows. This can make it easier to use a small set of job flows across multiple automated jobs, save time and money while debugging, and generally make your life simpler.
More info: Pooling Job Flows
SSH Log Fetching
mrjob attempts to fetch counters and error logs for EMR jobs via SSH before trying to use S3. This method is faster, more reliable, and works with persistent job flows.
More info: Configuring SSH credentials
New EMR Tool: fetch_logs
If you want to fetch the counters or error logs for a job after the fact, you can use the new fetch_logs tool.
More info: mrjob.tools.emr.fetch_logs
New EMR Tool: mrboss
If you want to run a command on all nodes and inspect the output, perhaps to see what processes are running, you can use the new mrboss tool.
More info: mrjob.tools.emr.mrboss
The search path order for mrjob.conf has changed. The new order is:
- The location specified by MRJOB_CONF
- ~/.mrjob (deprecated)
- mrjob.conf in any directory in PYTHONPATH (deprecated)
If your mrjob.conf path is deprecated, use this table to fix it:
Old Location New Location ~/.mrjob ~/.mrjob.conf somewhere in PYTHONPATH Specify in MRJOB_CONF
More info: mrjob.conf
Defining Jobs (MRJob)
Mapper, combiner, and reducer methods no longer need to contain a yield statement if they emit no data.
The --hadoop-*-format switches are deprecated. Instead, set your job’s Hadoop formats with HADOOP_INPUT_FORMAT/HADOOP_OUTPUT_FORMAT or hadoop_input_format()/hadoop_output_format(). Hadoop formats can no longer be set from mrjob.conf.
In addition to --jobconf, you can now set jobconf values with the JOBCONF attribute or the jobconf() method. To read jobconf values back, use mrjob.compat.get_jobconf_value(), which ensures that the correct name is used depending on which version of Hadoop is active.
More info: Hadoop Configuration
Protocols can now be anything with a read() and write() method. Unlike previous versions of mrjob, they can be instance methods rather than class methods. You should use instance methods when defining your own protocols.
The --*protocol switches and DEFAULT_*PROTOCOL are deprecated. Instead, use the *_PROTOCOL attributes or redefine the *_protocol() methods.
Protocols now cache the decoded values of keys. Informal testing shows up to 30% speed improvements.
More info: Protocols
All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles. This change should decrease the number of warnings in Hadoop 0.20 environments.
All *_bin configuration options (hadoop_bin, python_bin, and ssh_bin) take lists instead of strings so you can add arguments (like ['python', '-v']). More info: Configuration quick reference
Cleanup options have been split into cleanup and cleanup_on_failure. There are more granular values for both of these options.
Most limitations have been lifted from passthrough options, including the former inability to use custom types and actions. More info: Custom option types
The job_name_prefix option is gone (was deprecated).
All URIs are passed through to Hadoop where possible. This should relax some requirements about what URIs you can use.
Steps with no mapper use cat instead of going through a no-op mapper.
Compressed files can be streamed with the cat() method.
The default Hadoop version on EMR is now 0.20 (was 0.18).
The ec2_instance_type option only sets the instance type for slave nodes when there are multiple EC2 instance. This is because the master node can usually remain small without affecting the performance of the job.
Inline ModeInline mode now supports the cmdenv option.
Local mode now runs 2 mappers and 2 reducers in parallel by default.
There is preliminary support for simulating some jobconf variables. The current list of supported variables is:
boto 2.0+ is now required.
The Debian packaging has been removed from the repostory.