.. bops documentation master file, created by sphinx-quickstart on Wed Oct 26 10:02:22 2011. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to bops's documentation! ================================ **bops** stands for *boolean array operations*. **bops** uses numpy to do boolean operations on numpy arrays for faster data selection, grouping and sorting. **bops** also has map reduce functionality for data grouping and aggregation. This adds the capability to slice and dice data however you see fit. Examples / Support ------------------ Full-length examples can be found `here <./examples.html>`_. **Mailing List** A mailing list has been created to support the use of this module. You can join and follow the discussion on `Google groups `_. Any errors, issues and enhancements can be discussed here. Bops aims to be a top-notch data analysis module, but only with your help can this module actually be great. Please chime into the discussion. Your inputs are welcome as well as any suggested features, patches or fixes. Bops Methodology ---------------- Bops' goal is to simplify the selection (via numpy boolean arrays - ie. filtering) and grouping data by using numpy arrays to aggregate the unique values being grouped on. Bops' also attempts to simplify map reduce operations by using traditional python functions without the added complexity of a network protocol. Bops focuses on four principle methods in data analysis: * `Selection / Slicing <./bops.html#selection>`_ Data selection and slicing is a big part of analyzing only the data you want and removing data you don't. This is really easy in bops by using the `select <./bops.html#bops.bop.select>`_ function. * `Grouping <./bops.html#id5>`_ Grouping data on similar attributes is crucial sometimes, otherwise making complex analysis difficult. Luckily, bops has `this in mind <./bops.html#bops.bop.groupby>`_. * `Ordering <./bops.html#id8>`_ Ordering data can be a deal breaker in lots of algorithms. Sometimes sorting data can be a headache. Bops makes it `easy <./bops.html#bops.bop.orderby>`_. * `MapReduce <./bops.html#id9>`_ Bops was written to modularize the code that makes traditional, non-distributed MapReduce possible, without the headache of network protocols. There are some other python modules which shoot for a distributed MapReduce capability, but *bops* does not. MapReduce can simplify analysis code while producing greatly varying results simply by changing either the mapper or reducer functions. Bops' tries to make MapReduce as `simple as possible <./bops.html#bops.bop.mapreduce>`_ Four Data Analysis Principles ----------------------------- Before any data manipulation can take place, the data needs to be initialized in a way that bops can understand. This is via the `bop <./bops.html#bops.bop>`_ class. The `bop <./bops.html#bops.bop>`_ class wraps the data in a *numpy records array*. This adds the easy to use dot-syntax to the data for easy manipulation. Bops has functions to select, group, order, map and reduce data giving you the flexibility to do what you wish. .. note:: A `bop <./bops.html#bops.bop>`_ takes a **list of iterables** as the first argument. A list of tuples, lists, dicts, sets, numpy arrays and even strings. This list could go on, but those objects are *nearly* bound to work. For these examples / explanations a fake sample data file will be used which contains census data. Every data point has a *country, a state or province, first name, last name, age and gender*. A `bop <./bops.html#bops.bop>`_ is initialized like this: .. code-block:: python from bops import bop import numpy as np # open some data file fh = open('some_data.csv', 'r') # read lines and split on commas # this produces a list of lists, which is the data format that bops expects lines = [line.split(',') for line in fh.readlines()] # close data file fh.close() # This initializes the data with 6 columns (assuming the CSV had 6 columns). # Column names are turned into variables so they MUST be compatible with # Python syntax (ie. no special characters, other than '_'; no internal spaces, ect.). data = bop(lines, 'Country, Province_State, First_Name, Last_Name, Age, Gender') # this returns all the data in column 1 and so forth col1 = data.country # you'll notice that the variable name is lowercase. However, bops # auto-lowers all column names. But this is done at runtime, # therefore the programmer can type in camel case and it won't matter. col1 = data.Country .. warning:: Column names are denoted by a comma-delimited string. They **MUST** meet Python's variable name requirements. Otherwise, you may not be able to get to the data you need. Column names cannot have internal spaces: 'First Name' will **NOT** work. However, " First_Name " will work. Bops trims column names before using them. Column names cannot contains any special characters other than an underscore, '_'. Again, they MUST meet Python's variable name specification. Selection +++++++++ Data selection and slicing is a big part of analyzing only the data you want. This is really easy in bops by using the `select <./bops.html#bops.bop.select>`_ function. Here's how you would do it with bops: .. code-block:: python # Assuming your data is initialized as above... # The command below returns ONLY the rows where 'country' is = 'US'. united_states = data.select(data.country == 'US') # 'Filters' can be persisted like this male_filter = data.gender == 'M' female_filter = data.gender == 'F' # The above filters will select only the rows where the gender column matches either 'M' or 'F' US_males = united_states.select(male_filter) US_females = united_states.select(female_filter) # Filters can also be combined simply my multiplying them. # Selecting Females younger than 25 years old can be done like so. US_females_young = united_states.select(female_filter * (data.age < 25)) # Notice the parentheses around the age filter... Grouping ++++++++ Grouping data on similar attributes is crucial sometimes, otherwise making complex analysis difficult. Luckily, bops has this in mind. Grouping can be done like this: .. code-block:: python # Grouping by country # By default, groupby returns a dictionary countries = data.groupby('country') # Looping through the data can be done like so. # This will print the country name and the number of people listed for the country. for country, country_data in countries: print country, len(country_data) # The variable 'country_data' is another 'bop' instance. males = country_data.select(country_data.gender == 'M') females = country_data.select(country_data.gender == 'F') # This prints the average age of males and females for the country by calling numpy's mean function on the age data for each gender. print np.mean(males.age), np.mean(females.age) .. note:: By default, `groupby <./bops.html#bops.bop.groupby>`_ returns a list of tuples. The expand option has very little difference when only grouping by one column, it simply removes the need for the items() call on the dict. However, with multiple columns grouped, it can be advantageous. When expand is **False**, a dict is returned instead. The dict key is a **tuple** of the grouped columns. Therefore, grouping on multiple columns is allowed. The values of the dictionary is a `bop <./bops.html#bops.bop>`_ instance for all the rows that match the group. For instance, if you group on 'country' and 'state' (`data.groupby('country','state')`), then the key of the dictionary will be something like `('US','CA',)`. Where the values would be a `bop <./bops.html#bops.bop>`_ instance for all rows found with `country='US'` and `state='CA'`. Grouping by One column: .. code-block:: python # Grouping by country # By default, groupby returns a list of tuples countries = data.groupby('country') # Looping through the data can be done like so. # This will print the country name and the number of people listed for the country. for country, country_data in countries: print country, len(country_data) # The variable 'country_data' is another 'bop' instance. males = country_data.select(country_data.gender == 'M') females = country_data.select(country_data.gender == 'F') # This prints the average age of males and females for the country by calling numpy's mean function on the age data for each gender. print np.mean(males.age), np.mean(females.age) Grouping by Multiple columns: .. code-block:: python # Grouping by country # By default, groupby returns a dictionary countries_states = data.groupby('country', 'Province_State') # Looping through the data can be done like so. # This will print the country name and the number of people listed for the country. for country, state, country_state_data in countries_states: print country, state, len(country_state_data) # The variable 'country_state_data' is another 'bop' instance. males = country_state_data.select(country_state_data.gender == 'M') females = country_state_data.select(country_state_data.gender == 'F') # This prints the average age of males and females for the country by calling numpy's mean function on the age data for each gender. print np.mean(males.age), np.mean(females.age) .. note:: The 'groupby' function returns a list of tuples of the grouped columns and the data associated as the last item in the tuple. When the *expand* option is **False**, a dictionary is returned. Ordering ++++++++ Ordering data can be a deal breaker in lots of algorithms. Sometimes sorting data can be a headache. Bops makes it easy. Ordering is done in place by numpy like so. .. code-block:: python # Ordering made simple. data.orderby(data.age) youngest = data.age[0] oldest = data.age[-1] MapReduce +++++++++ Map Reduce is an ancient data analysis paradigm. However, Google wrote a paper on how they were using the method to reduce petabytes of data over several thousand commodity servers and thus a trend was started. Although Google uses this method in a distributed fashion across many servers, most data sets do not require the complexity of such a solution. This includes engineering and scientific data in the hundreds of MB range. This is why bops was written. To provide a MapReduce tool to assist in the reduction of data that did not require a distributed solution. Bops was written to modularize the code that makes traditional, non-distributed MapReduce possible, without the headache of network protocols. There are some other python modules which shoot for a distributed MapReduce capability, but *bops* does not. Bops attempts to fill a gap where distributed computing is not required. With that disclaimer, here's how I decided to do it. Bops uses functions called 'mappers' and 'reducers'. These are ordinary Python functions which follow a certain convention. Mappers ''''''' Mapper functions are simply functions which are called for every row in a dataset. Mappers get **ONE** argument, which is the row. Mappers return **TWO** arguments, a key and a value. All rows for which a mapper function returns the same key, the values will then be shoved together. This is how you would 'map' by gender and age. .. code-block:: python # The simple mapper function... def gender_age_mapper(row): return (row.gender, row.age), row # 'Map' the data by gender and age gender_age_groups = data.map(gender_age_mapper) # This will print the gender, age and number of people in the group for (gender, age), rows in gender_age_groups.items(): print gender, age, len(rows) This can be simplified even further by creating a reduce function. Reducers '''''''' Like `mappers <./bops.html#mappers>`_, reducers are simple Python functions. However, unlike mapper functions reducers are given the entire dataset for every map result. For example, you can simplify the above mapreduce code like this: .. code-block:: python # The simple mapper function... def gender_age_mapper(row): return (row.gender, row.age), row # 'Map' the data by gender and age, and reduce it with the built-in 'len' function. gender_age_groups = data.mapreduce(gender_age_mapper, len) # This will print the gender, age and number of people in the group for gender, age, count in gender_age_groups: print gender, age, count Notice that the mapreduce function does NOT return a dict like `map <./bops.html#bops.bop.map>`_. Also, inside the for-loop 'len(rows)' is replaced with 'count'. This is because the reducer, 'len' was already called for the mapped data set. In order for this to happen, the entire dataset has to be passed to the reducer. Here's another example which finds the total years in college for a gender age group. .. code-block:: python # The simple mapper function... def gender_age_mapper(row): return (row.gender, row.age), row # Sums the years of higher education for each gender, age group def total_college_years(rows): return sum([row.college for row in rows]) # 'Map' the data by gender and age, and reduce it with the 'total_college_years' function. gender_age_groups = data.mapreduce(gender_age_mapper, total_college_years) # This will print the gender, age and number of people in the group for gender, age, college_years in gender_age_groups: print gender, age, college_years I have used the built-in 'len' and 'sum' functions several times to create histograms and statistics based on how the mapper function maps the data. Here's a simpler and faster way to do the same as the code above. .. code-block:: python # The simple mapper function... def gender_age_mapper(row): return (row.gender, row.age), row.college # 'Map' the data by gender and age, and reduce it with the 'total_college_years' function. gender_age_groups = data.mapreduce(gender_age_mapper, sum) # This will print the gender, age and number of people in the group for gender, age, college_years in gender_age_groups: print gender, age, college_years This is simpler because there's no need to write an additional reducer function. I simply changed the mapper function to return the `row.college` instead of the entire `row`. Then by calling Python's 'sum' function as the reducer, it sums the college years. This solution is faster as well because the array doesn't have to be iterated over inside the reducer. Numpy with Sugar ---------------- Bops also has some 'syntactical-sugar' as well. After the data has been initialized you can forgo calling numpy functions by changing the variable name. For instance: .. code-block:: python >>> from bops import bop >>> import numpy as np >>> >>> # Read file, query database, ect. >>> # ... >>> >>> data = bop(results, 'name,gender,age,years_in_college') >>> >>> # Normal function calls >>> oldest = np.max(data.age) >>> >>> # with sugar >>> sugar = data.age_max >>> >>> oldest == sugar True >>> Bops intelligently figures out that 'age_max' is not a column, then tries to call `np.max` on `data.age`. This will work for all numpy functions that don't have underscores in the name which accept numpy arrays. The function call *MUST* be on the end and separated with an ' _ '. For example, this will still work: .. code-block:: python >>> max_education = data.years_in_college_max .. note:: This only works if no extra `**kwargs` are needed.. You can find a list of numpy functions `by category here `_ and `alphabetically here `_. Some of the perhaps most useful are: *min*, *max*, *std*, *mean*, *size* and *histogram*. Aliasing -------- Addition to the Numpy sugar, you can also do the same with the `alias <./bops.html#bops.bop.alias>`_ function. There are some numpy functions that contain underscores in the name (ie. `float_`, `int_`, ect.). This is how bops mitigates this issue. The alias function can be used to rename Numpy functions, as well as traditional Python functions, and then inform bops of the aliased name. For example: .. code-block:: python # Simply takes a numpy array of 'F'\'s and 'M'\'s and turns it into a list of 'Female'/'Male' strings def full_gender(array): gender = [] for g in array: if g in 'F': gender.append('Female') else: gender.append('Male') return np.asarray(gender) # This aliases the function name so it can be used with the underscore shortcut functionality # NOTE: The keyword CANNOT have underscores in the name, however, the function name can. data.alias(fullgender=full_gender) # An example of this functionality full_gender = data.gender_fullgender Bops does have some built-in aliases: * **avg**: `np.mean` * **len**: `np.size` * **float**: `np.float_` * **int**: `np.int_` * **bool**: `np.bool_` * **str**: `np.str_` * **unicode**: `np.unicode_` * **complex**: `np.complex_` bop: With great power comes ... faster data analysis ---------------------------------------------------- This class provides tremendous power for data analysis. With a back end of Numpy, this class allows for lightening fast data filtering and grouping. This class also has built in MapReduce functionality! .. autoclass:: bops.bop :members: select, groupby, orderby, map, reduce, complexreduce, mapreduce, mapreducebatch, alias Exceptions ---------- .. note:: These exceptions need more documentation!! .. automodule:: bops.exceptions :members: Legacy methods -------------- These methods were written mainly because the author wasn't very familiar with numpy and didn't have any idea what he was doing. He also was tired of wrapping list after list after list in a numpy.array, so bops was started. He soon discovered the errors (Yes, that is plural) of his ways and has now seen the light. These legacy methods accept lists or numpy arrays, but always return a numpy array. **The main usefulness of these methods are the fact that they accept pure python lists.** .. automodule:: bops :members: eq, false, gt, gtoe, l2a, logand, logor, lt, ltoe, oband, obor, true Indexes and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`