If your job spans multiple files, you can create a tarball of your source tree and use python_archives to have it decompressed and added to the PYTHONPATH:
runners: emr: # this will work for any runner python_archives: - my-src-tree.tar.gz
It will probably be convenient to have the tarball generated by your build process.
Some EMR AMIs appear to not support setting parameters like timeout with jobconf at run time. Instead, you must use Bootstrap-time configuration.
If your mappers or reducers take a long time to process a single step, you may want to increase the amount of time Hadoop lets them run before failing them as timeouts. You can do this with jobconf and the version-appropriate Hadoop environment variable. For example, this configuration will set the timeout to one hour:
runners: hadoop: # this will work for both hadoop and emr jobconf: # Hadoop 0.18 mapred.task.timeout: 3600000 # Hadoop 0.21+ mapreduce.task.timeout: 3600000
mrjob will convert your jobconf options between Hadoop versions if necessary. In this example, either jobconf line could be removed and the timeout would still be changed when using either version of Hadoop.
To save space, you can have Hadoop automatically save your job’s output as compressed files. This can be done using the same method as changing the task timeout, with jobconf and the appropriate environment variables. This example uses the Hadoop 0.21+ version:
runners: hadoop: # this will work for both hadoop and emr jobconf: # "true" must be a string argument, not a boolean! (#323) mapreduce.output.compress: "true" mapreduce.output.compression.codec: org.apache.hadoop.io.compress.GzipCodec