Cookbook

Putting your source tree in the PYTHONPATH

If your job spans multiple files, you can create a tarball of your source tree and use python_archives to have it decompressed and added to the PYTHONPATH:

runners:
  emr:  # this will work for any runner
    python_archives:
    - my-src-tree.tar.gz

It will probably be convenient to have the tarball generated by your build process.

Increasing the task timeout

Warning

Some EMR AMIs appear to not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration.

If your mappers or reducers take a long time to process a single step, you may want to increase the amount of time Hadoop lets them run before failing them as timeouts. You can do this with jobconf and the version-appropriate Hadoop environment variable. For example, this configuration will set the timeout to one hour:

runners:
    hadoop: # this will work for both hadoop and emr
        jobconf:
            # Hadoop 0.18
            mapred.task.timeout: 3600000
            # Hadoop 0.21+
            mapreduce.task.timeout: 3600000

mrjob will convert your jobconf options between Hadoop versions if necessary. In this example, either jobconf line could be removed and the timeout would still be changed when using either version of Hadoop.

Writing compressed output

To save space, you can have Hadoop automatically save your job’s output as compressed files. This can be done using the same method as changing the task timeout, with jobconf and the appropriate environment variables. This example uses the Hadoop 0.21+ version:

runners:
    hadoop: # this will work for both hadoop and emr
        jobconf:
           # "true" must be a string argument, not a boolean! (#323)
           mapreduce.output.compress: "true"
           mapreduce.output.compression.codec: org.apache.hadoop.io.compress.GzipCodec

Table Of Contents

Previous topic

Configuration quick reference

Next topic

Testing jobs

This Page