VertexFrame add_columns¶
-
add_columns
(self, func, schema, columns_accessed=None)¶ Add columns to current frame.
Parameters: func : UDF
User-Defined Function (UDF) which takes the values in the row and produces a value, or collection of values, for the new cell(s).
schema : tuple | list of tuples
The schema for the results of the UDF, indicating the new column(s) to add. Each tuple provides the column name and data type, and is of the form (str, type).
columns_accessed : list (default=None)
List of columns which the UDF will access. This adds significant performance benefit if we know which column(s) will be needed to execute the UDF, especially when the frame has significantly more columns than those being used to evaluate the UDF.
Assigns data to column based on evaluating a function for each row.
Notes
- The row UDF (‘func’) must return a value in the same format as specified by the schema. See Python User Functions.
- Unicode in column names is not supported and will likely cause the drop_frames() method (and others) to fail!
Examples
Given our frame, let’s add a column which has how many years the person has been over 18
>>> frame.inspect() [#] name age tenure phone ==================================== [0] Fred 39 16 555-1234 [1] Susan 33 3 555-0202 [2] Thurston 65 26 555-4510 [3] Judy 44 14 555-2183 >>> frame.add_columns(lambda row: row.age - 18, ('adult_years', ta.int32)) [===Job Progress===] >>> frame.inspect() [#] name age tenure phone adult_years ================================================= [0] Fred 39 16 555-1234 21 [1] Susan 33 3 555-0202 15 [2] Thurston 65 26 555-4510 47 [3] Judy 44 14 555-2183 26
Multiple columns can be added at the same time. Let’s add percentage of life and percentage of adult life in one call, which is more efficient.
>>> frame.add_columns(lambda row: [row.tenure / float(row.age), row.tenure / float(row.adult_years)], [("of_age", ta.float32), ("of_adult", ta.float32)]) [===Job Progress===] >>> frame.inspect(round=2) [#] name age tenure phone adult_years of_age of_adult =================================================================== [0] Fred 39 16 555-1234 21 0.41 0.76 [1] Susan 33 3 555-0202 15 0.09 0.20 [2] Thurston 65 26 555-4510 47 0.40 0.55 [3] Judy 44 14 555-2183 26 0.32 0.54
Note that the function returns a list, and therefore the schema also needs to be a list.
It is not necessary to use lambda syntax, any function will do, as long as it takes a single row argument. We can also call other local functions within.
Let’s add a column which shows the amount of person’s name based on their adult tenure percentage.
>>> def percentage_of_string(string, percentage): ... '''returns a substring of the given string according to the given percentage''' ... substring_len = int(percentage * len(string)) ... return string[:substring_len]
>>> def add_name_by_adult_tenure(row): ... return percentage_of_string(row.name, row.of_adult)
>>> frame.add_columns(add_name_by_adult_tenure, ('tenured_name', unicode)) [===Job Progress===]
>>> frame Frame "example_frame" row_count = 4 schema = [name:unicode, age:int32, tenure:int32, phone:unicode, adult_years:int32, of_age:float32, of_adult:float32, tenured_name:unicode] status = ACTIVE (last_read_date = -etc-)
>>> frame.inspect(columns=['name', 'of_adult', 'tenured_name'], round=2) [#] name of_adult tenured_name ===================================== [0] Fred 0.76 Fre [1] Susan 0.20 S [2] Thurston 0.55 Thur [3] Judy 0.54 Ju
Optimization - If we know up front which columns our row function will access, we can tell add_columns to speed up the execution by working on only the limited feature set rather than the entire row.
Let’s add a name based on tenure percentage of age. We know we’re only going to use columns ‘name’ and ‘of_age’.
>>> frame.add_columns(lambda row: percentage_of_string(row.name, row.of_age), ... ('tenured_name_age', unicode), ... columns_accessed=['name', 'of_age']) [===Job Progress===] >>> frame.inspect(round=2) [#] name age tenure phone adult_years of_age of_adult =================================================================== [0] Fred 39 16 555-1234 21 0.41 0.76 [1] Susan 33 3 555-0202 15 0.09 0.20 [2] Thurston 65 26 555-4510 47 0.40 0.55 [3] Judy 44 14 555-2183 26 0.32 0.54 [#] tenured_name tenured_name_age =================================== [0] Fre F [1] S [2] Thur Thu [3] Ju J
More information on a row UDF can be found at Python User Functions