turntable package

Submodules

turntable.press module

The press module is used to create Record Collections.

class turntable.press.Record(**kwargs)

Bases: turntable.press.RecordSetter, turntable.press.SeriesLoader

Record is a container object with the special property “series”. Any property added to Record will also be added to the pandas.Series

series : pandas.Series
container for parameters set to the instance

load : assigns items of a **kwargs to the class and to the series parameter _set_attributes : assigns items of a dictionary to the class and to the series parameter runMethod : runs a method by a string call

lets see how we can add a propertie to the record object

>>> record = Record(first_item = 'one')
>>> record.second_item = 'two'
>>> print record.series
mint = True
class turntable.press.RecordPress(pickle=True, pickle_path='./tmp')

Bases: object

This class auto-seralizes any attributes assigned to an instance and clears them from memmory when an attribute is called via the dot operator, it is read from disk

pickle : Boolean [True]
if False, the instance will behave as a normal class
pickle_path : string [‘./tmp’]
the path underwhich the files will be stored
clean_disk()
deletes all files stored by the instance
clean_memmory()
sets the in memory attribute values to None reducing the memory footprint

This class can be encapsulated to be used elsewhere

>>> class NewClass(RecordPress):
>>>
>>>  def __init__(self, pickle = True, pickle_path = './tmp'):
>>>     self.pickle = pickle
>>>     self.class_path = turntable.utils.path_to_filename(pickle_path+'/'+self.__class__.__name__)[0]
>>>     self.pickles = []
>>>
>>> newClass = NewClass()
>>> newClass.x = 10
>>> y = newClass.x
>>> newClass.clean_disk()
clean_disk()

clean removes all files and folders under the class_path directory

clean_memory()

sets all attribute values from the attribute list pickles to None -this is the default behaviure so this method is redundent

class turntable.press.RecordSetter(**kwargs)

RecordSetter provides a simple interface for initalizing arguments passed in kwargs and a run method for running a class method by name

kwargs : name : value

RecordSetter is a general python class that assigns **kwargs as instances of it self.

>>> obj = RecordSetter(name = 'me')
>>> print obj.name
me
load(**kwargs)

Takes an instance of Record() and named arguments from **kwargs returns the record instance with the named arguemnts added to the record

**kwargs : named arguments
first_arg = 1, second_arg = ‘two’

record.first_arg -> 1 record.second_arg -> ‘two’

>>> import turntable
>>> record = turntable.press.Record(first_arg = 1)
>>> record = record.load(second_arg = 'two')
>>> record.series
first_arg       1
second_arg    two
run_method(method)

Calls a specied method by name using run_method

class turntable.press.SeriesLoader

Bases: object

SeriesLoader assigns given properties to a special pandas.Series property: self.series.

series : all atributes of the class get added to an internal pandas series

This class can be encapsulated to be used elsewhere

>>> series_loader = SeriesLoader()
>>> series_loader.one = 'one'
>>> print series_loader.series
turntable.press.build_collection(df, **kwargs)

Generates a list of Record objects given a DataFrame. Each Record instance has a series attribute which is a pandas.Series of the same attributes in the DataFrame. Optional data can be passed in through kwargs which will be included by the name of each object.

df : pandas.DataFrame kwargs : alternate arguments to be saved by name to the series of each object

collection : list
list of Record objects where each Record represents one row from a dataframe

This is how we generate a Record Collection from a DataFrame.

>>> import pandas as pd
>>> import turntable
>>>
>>> df = pd.DataFrame({'Artist':"""Michael Jackson, Pink Floyd, Whitney Houston, Meat Loaf, 
    Eagles, Fleetwood Mac, Bee Gees, AC/DC""".split(', '),
>>> 'Album' :"""Thriller, The Dark Side of the Moon, The Bodyguard, Bat Out of Hell, 
    Their Greatest Hits (1971-1975), Rumours, Saturday Night Fever, Back in Black""".split(', ')})
>>> collection = turntable.press.build_collection(df, my_favorite_record = 'nevermind')
>>> record = collection[0]
>>> print record.series
turntable.press.collection_to_df(collection)

Converts a collection back into a pandas DataFrame

collection : list
list of Record objects where each Record represents one row from a dataframe
df : pandas.DataFrame
DataFrame of length=len(collection) where each row represents one Record
turntable.press.load_record(record, **kwargs)

Takes an instance of Record() and named arguments from **kwargs returns the record instance with the named arguemnts added to the record

record : Record()
either full or empty record object
**kwargs : named arguments
first_arg = 1, second_arg = ‘two’

record.first_arg -> 1 record.second_arg -> ‘two’

>>> import turntable
>>> record = load_record(turntable.press.Record(), first_arg = 1, second_arg = 'two')
>>> record.series
first_arg       1
second_arg    two
turntable.press.spin_frame(df, method)

Runs the full turntable process on a pandas DataFrame

df : pandas.DataFrame
each row represents a record
method : def method(record)
function used to process each row
df : pandas.DataFrame
DataFrame processed by method
>>> import pandas as pd
>>> import turntable
>>>
>>> df = pd.DataFrame({'Artist':"""Michael Jackson, Pink Floyd, Whitney Houston, Meat Loaf, Eagles, Fleetwood Mac, Bee Gees, AC/DC""".split(', '), 'Album':"""Thriller, The Dark Side of the Moon, The Bodyguard, Bat Out of Hell, Their Greatest Hits (1971–1975), Rumours, Saturday Night Fever, Back in Black""".split(', ')})
>>>
>>> def method(record):
>>>    record.cost = 40
>>>    return record
>>>
>>> turntable.press.spin_frame(df, method)

turntable.spin module

The spin module contains tools to process Record Collections in either series or parallel.

Thanks to chriskiehl http://chriskiehl.com/article/parallelism-in-one-line/

turntable.spin.batch(collection, method, processes=None, batch_size=None, quiet=False, kwargs_to_dump=None, args=None, **kwargs)

Processes a collection in parallel batches, each batch processes in series on a single process. Running batches in parallel can be more effficient that splitting a list across cores as in spin.parallel because of parallel processing has high IO requirements.

collection : list
i.e. list of Record objects

method : method to call on each Record processes : int

number of processes to run on [defaults to number of cores on machine]
batch_size : int
lenght of each batch [defaults to number of elements / number of processes]
collection : list
list of Record objects after going through method called

adding 2 to every number in a range

>>> import turntable
>>> collection = range(100)
>>> def jam(record):
>>>     return record + 2
>>> collection = turntable.spin.batch(collection, jam)

lambda functions do not work in parallel

turntable.spin.new_function_batch(sequence, method, *args, **kwargs)
turntable.spin.new_function_dumping(args_to_load_names, function, main_arg, *args, **kwargs)
turntable.spin.parallel(collection, method, processes=None, args=None, **kwargs)

Processes a collection in parallel.

collection : list
i.e. list of Record objects

method : method to call on each Record processes : int

number of processes to run on [defaults to number of cores on machine]
batch_size : int
lenght of each batch [defaults to number of elements / number of processes]
collection : list
list of Record objects after going through method called

adding 2 to every number in a range

>>> import turntable
>>> collection = range(100)
>>> def jam(record):
>>>     return record + 2
>>> collection = turntable.spin.parallel(collection, jam)

lambda functions do not work in parallel

turntable.spin.process_dump(collection, function, kwargs_to_dump, processes=None, args=None, **kwargs)
turntable.spin.series(collection, method, prints=15, *args, **kwargs)

Processes a collection in series

collection : list
list of Record objects

method : method to call on each Record prints : int

number of timer prints to the screen
collection : list
list of Record objects after going through method called

If more than one collection is given, the function is called with an argument list consisting of the corresponding item of each collection, substituting None for missing values when not all collection have the same length. If the function is None, return the original collection (or a list of tuples if multiple collections).

adding 2 to every number in a range

>>> import turntable
>>> collection = range(100)
>>> method = lambda x: x + 2
>>> collection = turntable.spin.series(collection, method)
turntable.spin.thread(function, sequence, cores=None, runSeries=False, quiet=False)

sets up the threadpool with map for parallel processing

turntable.utils module

The utils module provides a collection of methods used across the package or of general utility.

class turntable.utils.Timer(nLoops, numPrints=100, verbose=True)

Timer that calculates time remaining for a process and the percent complete

Todo

Ask for details about the usage

nLoops : integer numPrints : integer (default is 100) verbose : bool (default is True)

nLoops : integer numPrints : integer verbose : bool

if True, print values when loop is called

count : integer elapsed : float

elapsed time
est_end : float
estimated end
ti : float
initial time
tf : float
current time

display_amt : integer

fin()
loop()

Tracks the time in a loop. The estimated time to completion can be calculated and if verbose is set to True, the object will print estimated time to completion, and percent complete. Actived in every loop to keep track

turntable.utils.Walk(root='.', recurse=True, pattern='*')

Generator for walking a directory tree. Starts at specified root folder, returning files that match our pattern. Optionally will also recurse through sub-folders.

root : string (default is ‘.’)
Path for the root folder to look in.
recurse : bool (default is True)
If True, will also look in the subfolders.
pattern : string (default is ‘*’, which means all the files are concerned)
The pattern to look for in the files’ name.
generator
Walk yields a generator from the matching files paths.
turntable.utils.add_path_string(root_path='./results', path_string=None)
turntable.utils.batch_list(sequence, batch_size, mod=0, randomize=False)

Converts a list into a list of lists with equal batch_size.

sequence : list
list of items to be placed in batches
batch_size : int
length of each sub list
mod : int
remainder of list length devided by batch_size mod = len(sequence) % batch_size
randomize = bool
should the initial sequence be randomized before being batched
turntable.utils.catch(fcn, *args, **kwargs)
try:
retrun fcn(*args, **kwargs)
except:
print traceback
if ‘spit’ in kwargs.keys():
return kwargs[‘spit’]

fcn : function *args : unnamed parameters of fcn **kwargs : named parameters of fcn

spit : returns the parameter named return in the exception

The expected output of fcn or prints the exception traceback

turntable.utils.create_dir(path, dir_dict={})

Tries to create a new directory in the given path. create_dir can also create subfolders according to the dictionnary given as second argument.

path : string
string giving the path of the location to create the directory, either absolute or relative.
dir_dict : dictionary, optional
Dictionary ordering the creation of subfolders. Keys must be strings, and values either None or path dictionaries. the default is {}, which means that no subfolders will be created
>>> path = './project'
>>> dir_dict = {'dir1':None, 'dir2':{'subdir21':None}}
>>> utils.create_dir(path, dir_dict)

will create:

  • project/dir1
  • project/dir2/subdir21

in your parent directory.

turntable.utils.displayAll(elapsed, display_amt, est_end, nLoops, count, numPrints)

Displays time if verbose is true and count is within the display amount

turntable.utils.from_pickle(filename, clean_disk=False)
turntable.utils.path_to_filename(pathfile)

Takes a path filename string and returns the split between the path and the filename

if filename is not given, filename = ‘’ if path is not given, path = ‘./’

turntable.utils.scan_path(root='.', recurse=False, pattern='*')

Runs a loop over the Walk Generator to find all file paths in the root directory with the given pattern. If recurse is True: matching paths are identified for all sub directories.

root : string (default is ‘.’)
Path for the root folder to look in.
recurse : bool (default is True)
If True, will also look in the subfolders.
pattern : string (default is ‘*’, which means all the files are concerned)
The pattern to look for in the files’ name.
path_list : list
list of all the matching files paths.
turntable.utils.timeUnit(elapsed, avg, est_end)

calculates unit of time to display

turntable.utils.to_pickle(obj, filename, clean_memory=False)

http://stackoverflow.com/questions/7900944/read-write-classes-to-files-in-an-efficent-way

Module contents