Welcome to hadoop-manager’s documentation!¶

Contents:

hadoop-manager¶

Python wrapper around Hadoop streaming jar.

class hdpmanager.HadoopManager(hadoop_home, hadoop_fs_default_name=None, hadoop_job_tracker=None)¶

HadoopManager is a central object for managing hadoop jobs and hdfs

In order to perform proper temporary directory cleanup use HadoopManager with ‘with’ statement. with HadoopManager(...) as manager:

pass

Parameters:	hadoop_home – home folder of hadoop package hadoop_fs_default_name – default hdfs home used when paths provided are relative hadoop_job_tracker – hadoop job tracker host:port

create_job(**kwargs)¶

Create HadoopJob object

Parameters:

input_paths – list of input files for mapper
input_jobs – list of jobs that will be run when this job is run their output will be used as input for this job
output_path – path to the output dir, if not provided, tmp dir will be used
mapper – import path to the mapper class
reducer – import path to the reducer class
combiner – import path to the combiner class
root_package – import path to the subpackage in you app where the mapper/reducer/combiner import starts
num_reducers – number of reducers
conf – object that will be send to mapper, reducer and combiner it will be accessible as self.conf in job objects.
serialization – dict with configuration for input, output and internal serialization valid keys are input, output and inter, valid values are json, pickle and raw
job_env – dict which defines environment valid options are packages, package_data and requires if packages aren’t provided all packages returned by setuptools.find_packages in root_package will be included
skip_missing_input_paths – skip input paths with no matching files

fs¶: HadoopFs object for managing hdfs

class hdpmanager.HadoopJob(hdp_manager, input_paths=None, input_jobs=None, output_path=None, mapper=None, reducer=None, combiner=None, num_reducers=None, serialization=None, conf=None, job_env=None, root_package=None, skip_missing_input_paths=False)¶

HadoopJob object for managing mapreduce jobs Create it with the HadoopManager.create_job methos

cat_output()¶: Returns a generator over mapreduce output

get_output_path()¶: Returns path to the output file. Usefull when temporary dir is used

rm_output()¶: Remove output dir

run()¶: Run a mapreduce job

run_async()¶: Run a mapreduce job in background Returns a HadoopCmdPromise object

class hdpmanager.HadoopFs(hadoop_manager)¶

cat(path, serializer='raw', tab_separated=False)¶

Returns a generator over files defined by path

Parameters:	path – path to the files serializer – input serializer. Options are json, pickle and raw(default) tab_seperated – boolean if input is tab separated

exists(path)¶

Check if file on the path exists

Parameters:	path – path to the file

ls(path, recursive=False)¶

Lists files on the path

Parameters:	path – path to the file recursive – list subdirectories recursively

rm(path)¶

Recursively remove all files on the path

Parameters:	path – path to the files

class hdpmanager.Mapper(*args, **kwargs)¶

line_grep()¶: Override this method to return a compiled regex, string or list of strings(matched with or) that each mapped line must match

map(line)¶

Override this methos for mapping the input line Output can be either returned or yielded as a key, value pair

Parameters:	line – one line of the input file serialized by the input serializer

class hdpmanager.Reducer(input_stream=<open file '<stdin>', mode 'r' at 0x2b4a7320a0c0>, output_stream=<open file '<stdout>', mode 'w' at 0x2b4a7320a150>, conf=None)¶

reduce(key, values)¶

Override this methos for reducing the input Output can be either returned or yielded as a key, value pair

Parameters:	key – key returned by the mapper values – generator over values retuned by the mapper for this key

class hdpmanager.Combiner(input_stream=<open file '<stdin>', mode 'r' at 0x2b4a7320a0c0>, output_stream=<open file '<stdout>', mode 'w' at 0x2b4a7320a150>, conf=None)¶

reduce(key, values)¶

Override this methos for reducing the input Output can be either returned or yielded as a key, value pair

Parameters:	key – key returned by the mapper values – list of values retuned by the mapper

class hdpmanager.HadoopCmdPromise(subprocess)¶

join()¶: Block until command/job is completed

print_stdout()¶: Print command’s stdout

yield_stdout()¶: Yield command’s stdout

Welcome to hadoop-manager’s documentation!¶

hadoop-manager¶

Indices and tables¶

Table Of Contents

This Page

Navigation

Welcome to hadoop-manager’s documentation!¶

hadoop-manager¶

Indices and tables¶

Table Of Contents

This Page

Quick search

Navigation