Welcome to hadoop-manager’s documentation!

Contents:

hadoop-manager

Python wrapper around Hadoop streaming jar.

class hdpmanager.HadoopManager(hadoop_home, hadoop_fs_default_name=None, hadoop_job_tracker=None)

HadoopManager is a central object for managing hadoop jobs and hdfs

In order to perform proper temporary directory cleanup use HadoopManager with ‘with’ statement. with HadoopManager(...) as manager:

pass
Parameters:
  • hadoop_home – home folder of hadoop package
  • hadoop_fs_default_name – default hdfs home used when paths provided are relative
  • hadoop_job_tracker – hadoop job tracker host:port
create_job(**kwargs)

Create HadoopJob object

Parameters:
  • input_paths – list of input files for mapper
  • input_jobs – list of jobs that will be run when this job is run their output will be used as input for this job
  • output_path – path to the output dir, if not provided, tmp dir will be used
  • mapper – import path to the mapper class
  • reducer – import path to the reducer class
  • combiner – import path to the combiner class
  • root_package – import path to the subpackage in you app where the mapper/reducer/combiner import starts
  • num_reducers – number of reducers
  • conf – object that will be send to mapper, reducer and combiner it will be accessible as self.conf in job objects.
  • serialization – dict with configuration for input, output and internal serialization valid keys are input, output and inter, valid values are json, pickle and raw
  • job_env – dict which defines environment valid options are packages, package_data and requires if packages aren’t provided all packages returned by setuptools.find_packages in root_package will be included
  • skip_missing_input_paths – skip input paths with no matching files
fs

HadoopFs object for managing hdfs

class hdpmanager.HadoopJob(hdp_manager, input_paths=None, input_jobs=None, output_path=None, mapper=None, reducer=None, combiner=None, num_reducers=None, serialization=None, conf=None, job_env=None, root_package=None, skip_missing_input_paths=False)

HadoopJob object for managing mapreduce jobs Create it with the HadoopManager.create_job methos

cat_output()

Returns a generator over mapreduce output

get_output_path()

Returns path to the output file. Usefull when temporary dir is used

rm_output()

Remove output dir

run()

Run a mapreduce job

run_async()

Run a mapreduce job in background Returns a HadoopCmdPromise object

class hdpmanager.HadoopFs(hadoop_manager)
cat(path, serializer='raw', tab_separated=False)

Returns a generator over files defined by path

Parameters:
  • path – path to the files
  • serializer – input serializer. Options are json, pickle and raw(default)
  • tab_seperated – boolean if input is tab separated
exists(path)

Check if file on the path exists

Parameters:path – path to the file
ls(path, recursive=False)

Lists files on the path

Parameters:
  • path – path to the file
  • recursive – list subdirectories recursively
rm(path)

Recursively remove all files on the path

Parameters:path – path to the files
class hdpmanager.Mapper(*args, **kwargs)
line_grep()

Override this method to return a compiled regex, string or list of strings(matched with or) that each mapped line must match

map(line)

Override this methos for mapping the input line Output can be either returned or yielded as a key, value pair

Parameters:line – one line of the input file serialized by the input serializer
class hdpmanager.Reducer(input_stream=<open file '<stdin>', mode 'r' at 0x2b4a7320a0c0>, output_stream=<open file '<stdout>', mode 'w' at 0x2b4a7320a150>, conf=None)
reduce(key, values)

Override this methos for reducing the input Output can be either returned or yielded as a key, value pair

Parameters:
  • key – key returned by the mapper
  • values – generator over values retuned by the mapper for this key
class hdpmanager.Combiner(input_stream=<open file '<stdin>', mode 'r' at 0x2b4a7320a0c0>, output_stream=<open file '<stdout>', mode 'w' at 0x2b4a7320a150>, conf=None)
reduce(key, values)

Override this methos for reducing the input Output can be either returned or yielded as a key, value pair

Parameters:
  • key – key returned by the mapper
  • values – list of values retuned by the mapper
class hdpmanager.HadoopCmdPromise(subprocess)
join()

Block until command/job is completed

print_stdout()

Print command’s stdout

yield_stdout()

Yield command’s stdout

Indices and tables

Table Of Contents

This Page