Basic concepts

In this section we describe some of the basic underlying concepts of Ibidas.

Representation

In Ibidas one works with data sources by packaging them within a ‘representation object’.

Such data sources can be python objects, files, databases, and so on. Once they are wihtin a representation object they are all handled similarly.

We start with a simple example, in which we package an integer object:

>>> Rep(3)
Slices: | data
---------------
Type:   | int64
Dims:   |
Data:   |
        | 3

Tip

The concepts slices, types and dims will be explained in the next sections

One can perform all kind of operations with data that is in a representation object, e.g:

>>> Rep([1,2,3]) + 3
Slices: | data
---------------
Type:   | int64
Dims:   | d1:3
Data:   |
        | 4
        | 5
        | 6
Summary:
  • A representor object encapsulates a data source.
  • Data sources can be python objects, but also files or databases.

Query execution

You might have noted that executing the previous commands resulted immediatly in a printout of the contents of the representation object. This is due to the IPython interpreter, which will print a representation of the result of all non-assignment operations. So, if we instead would have executed:

>>> r = Rep([1,2,3]) + 3

no output would have been printed. More importantly however, Ibidas would also not have performed the requested addition operation. I.e. r would have only been a representation of the requested operations and data sources, and thus not the result of those operations.

Important

Operations are executed lazily. I.e. only when output is requested by the user.

The reason for this behaviour is that it allows optimizations which are otherwise not possible. For example, it enables the system to translate queries against a database (partially) into the query language SQL. This way, instead of moving unnecessary data to Ibidas, operations can be moved to the database.

Note

For brevity, we will sometimes print output in this tutorial after assignments (e.g. r = Rep(3). In reality this does not happen. One can still get the output after such statements, by simply executing r

Often, one wants to just see a description of the contents of a representation object, not the actual data result itself. This can be done using the information attribute I:

>>> r = Rep(["geneA","geneB"])
>>> r.I
Slices: data
Types:  bytes[5]
Dims:   d1:2

Note that the data is not printed. Especially in case of slow operations or data sources this can be useful.

On the other hand, there are also situations in which one wants to force the query to execute, e.g. because its result is used multiple times (which would otherwise result in a part of the query being executed multiple times). Execution can be forced by using the Copy() command:

>>> r = Rep([1,2,3]) + 4
>>> r = r.Copy()

Getting data out of a representor object is simple, one simply appends () to a query to let it return the results as normal numpy or python objects:

>>> r()
array(['geneA', 'geneB'], dtype=object)

As you can see, Ibidas has packaged the data in a numpy array.

Summary:
  • Operations are only executed when needed, to allow for optimizations
  • One can ask for a description of the representor contents using the I attribute.
  • Use the Copy command to force execution of a query.
  • One can get the data results by transforming the query into a function call by appending ()

Types

When one executes Rep without specifying a type, the type is detected automatically. For example, in the first example, the detect type was int64.

Note

Depending on the platform you use, the type can also be int32.

The type determines how operations on the representor are handled. For example, with an integer type, one can perform standard integer operations on the representor:

>>> r = Rep(3)
>>> (r + 3) * r
Slices: | data
---------------
Type:   | int64
Dims:   |
Data:   |
        | 18

Similarly, in case of the string type, the addition operation becomes concatenation:

>>> Rep(["geneA", "geneB"])  + "_geneC"
Slices: | data
---------------------
Type:   | bytes[11]
Dims:   | d1:2
Data:   |
        | geneA_geneC
        | geneB_geneC

One might have noted that, although we now represent a list of thins, the type still represents the type of the list elements.

This is because Rep (by default) unpacks the data. By unpacking, operations will not be performed at the list level, but instead at the list elements level. Unpacking/packing will be explained further in one of the next sections.

Summary:
  • A type is assigned automatically when packaging data using rep
  • The type indicates at which data nesting level operations are executed.

Slices

Whereas lists in Ibidas are used to denote collections of data with the same type, tuples are used to describe data compositions of different types.

You might have know such compositions as records, or simply as table rows.

So, lets load a simple table:

>>> data = [('gene1',0.5), ('gene2', 0.3), ('gene3', 0.8)]

>>> r = Rep(data)
Slices: | f0       | f1
---------------------------
Type:   | bytes[5] | real64
Dims:   | d1:3     | d1:3
Data:   |          |
        | gene1    | 0.5
        | gene2    | 0.3
        | gene3    | 0.8

If we compare the output to earlier output, we now see that there are more than one columns in the data description.

These columns represented slices. Slices are one of the main concepts in Ibidas. They can be compared to columns/fields in a table, but are more general.

Note

Note that we use tuples to describe records, and a list to store the records of the table. In Ibidas, tuples are used to relate attributes that describe some common object, while lists are used to gather elements with the same type / role.

Selecting a slice can be done using simple attribute lookup:

>>> r.f0
Slices: | f0
------------------
Type:   | bytes[5]
Dims:   | d1:3
Data:   |
        | gene1
        | gene2
        | gene3

Each slice has a name (the first row), a type (second row) and a dimension (third row). Dimensions will be explained later. For now, it is important that each slice has a common type. This means that all data elements adressed by it can be handled in the same way. Slices could thus also be seen as a kind of cursor in your data structure. Performing operations on this cursor will perform the operations on a subset of your data. For example:

>>> r.f0 == "gene2"
Slices: | f0
---------------
Type:   | bool
Dims:   | d1:3
Data:   |
        | False
        | True
        | False

To select multiple slices, one can use the ibidas.representor.Representor.get() function:

>>> r.Get("f1", "f0")
Slices: | f1     | f0
---------------------------
Type:   | real64 | bytes[5]
Dims:   | d1:3   | d1:3
Data:   |        |
        | 0.5    | gene1
        | 0.3    | gene2
        | 0.8    | gene3

>>> r.Get(1, 0)
Slices: | f1     | f0
---------------------------
Type:   | real64 | bytes[5]
Dims:   | d1:3   | d1:3


>>> r.Get(r.f1, r.f0)
Slices: | f1     | f0
---------------------------
Type:   | real64 | bytes[5]
Dims:   | d1:3   | d1:3


>>> r.Get(_.f1, _.f0)
Slices: | f1     | f0
---------------------------
Type:   | real64 | bytes[5]
Dims:   | d1:3   | d1:3

The last option shows the use of the context operator _, which adresses the enclosing data representor (in this case r). This allows one to refer easily to slices in longer commands. One can also combine this with other operations:

>>> r.Get(_.f1 + 3, _.f0)
Slices: | f1     | f0
---------------------------
Type:   | real64 | bytes[5]
Dims:   | d1:3   | d1:3
Data:   |        |
        | 3.5    | gene1
        | 3.3    | gene2
        | 3.8    | gene3

One can also use this function to combine slices, e.g:

>>> r.Get(_.f0, Rep("cancer_genes"))
Slices: | f0       | data
---------------------------------
Type:   | bytes[5] | bytes[12]
Dims:   | d1:3     |
Data:   |          |
        | gene1    | cancer_genes
        | gene2    |
        | gene3    |

When loading data from databases or files, often slice names are assigned as given in the data source. In case of loading from Python data, slice names are however autoassigned, and thus not very informative. To rename slices, one can use the ibidas.representor.Representor.rename() function:

>>> r.Rename("genes","scores")
Slices: | genes    | scores

>>> r.Rename(f1="scores")
Slices: | f0   | scores

As this functionality is used often, a shorter version is available:

>>> r/("genes","scores")
Slices: genes    scores

>>> r/{f1:"scores"}
Slices: f0    scores

>>> r.Get(_.f0/"genes",
          Rep("cancer_genes")/"listname")
Slices: genes       listname
Summary:
  • Slices can be compared to columns/fields in a table, or to data cursors which indicate on which data elements operations will be applied.
  • A representor object is a collection of slices
  • Attribute lookup can be used to select a single slice.
  • More advanced selection can be performed using the get function, allowing multiple slice selection, slice modifications and slice combination.

Dimensions

Up to now, our data model was very similar to ones used in other software. Dimensions allow Ibidas to handle more complex data structures.

Lets replace the first field of the table with a nested, variable length list::
>>> data = [([1,2],0.5), ([3,4,5], 0.3), ([6,7,8,9], 0.8)]
>>> r = Rep(data)
Slices: | f0        | f1
----------------------------
Type:   | int64     | real64
Dims:   | d1:3<d2:~ | d1:3
Data:   |           |
        | [1 2]     | 0.5
        | [3 4 5]   | 0.3
        | [6 7 8 9] | 0.8

Compare this to dims in the previous sections. Dimensions indicate which nesting arrays have been unpacked. So, in case of slice f0, we are working at the level of int32 (the type). These int32 elements are nested two levels deep in arrays, which are indicated by dimension d1 and d2.

The dim names are accompanied by a shape attribute. The 3 means that the dimension has a fixed size of 3, while the ~ means that the dimension has a variable size. One might also encounter *, which means that the dimension has a fixed, but unspecified, size.

Dimensions are used in operations to determine how data is mapped w.r.t to each other:

>>> r.f0 + r.f1
Slices: | result
-------------------------------
Type:   | real64
Dims:   | d1:3<d2:~
Data:   |
        | [ 1.5  2.5]
        | [ 3.3  4.3  5.3]
        | [ 6.8  7.8  8.8  9.8]

To perform an addition, both operands would need the same dimension normally. As this is not the case, we perform broadcasting. As you can see, elements from slice f1 are broadcasted along dimension d2 to enable the addition of the two slices.

In case of type autodetection, dimension names are also assigned automatically (Starting from d1). Dimension can however be renamed in a similar way as slices, using ibidas.representor.Representor.DimRename():

>>> r.DimRename("pathways","genes")
Slices: | f0                 | f1
-----------------------------------------
Type:   | int64              | real64
Dims:   | pathways:3<genes:~ | pathways:3
Data:   |                    |
        | [1 2]              | 0.5
        | [3 4 5]            | 0.3
        | [6 7 8 9]          | 0.8

When performed without keywords, new dim names are mapped to dimensions by ordering dimensions on their nesting depth. If there are multiple choices possible, an error will be returned. Similar to slice renaming, a shorthand is available using the % operator.

Summary:

  • The use of dimensions allow one to have slices with different dimensions within the same representor object
  • The use of broadcasting allows these slices to still interact (e.g. to perform a comparison or other operation)
  • Dimensions have a name and a shape
  • Dimensions are separated by the < symbol
  • Dimensions can be renamed using the DimRename function or the % shorthand operation.

Dimension navigation

Specifying how operations should be mapped to dimensions is done using two mechanisms. The first specifies at what type an operation should be executed (i.e. should an operation work on the level of the list or the level of the list element?). The second specifies across which dimension an operation should be executed (should we sum values along dimension d1 or d2?).

The first mechanism is accessible through pack and unpack operations. There are two types of these operations:

  • pack/unpack from tuple type to slices and vice versa
  • pack/unpack from array type to dimensions and vice versa

Packing

The two most basic pack operations are respectively Tuple and Array.

An example of the Array function:

>>> data = [([1,2],0.5), ([3,4,5], 0.3), ([6,7,8,9], 0.8)]
>>> r = Rep(data)

>>> r.Array()
Slices: | f0           | f1
-----------------------------------------
Type:   | [d2:~]:int64 | [d1:3]:real64
Dims:   | d1:3         |
Data:   |              |
        | [1, 2]       | [ 0.5  0.3  0.8]
        | [3, 4, 5]    |
        | [6, 7, 8, 9] |

The influence of the array operation is that the dimensions are moved into the type. So subsequent operations are performed at the level of the arrays:

>>> r.Array().Get(_.f0 + _.f1)
Slices: | result
-------------------------------
Type:   | [d2_d1:~]:real64
Dims:   | d1:3
Data:   |
        | [1 2 0.5 0.3 0.8]
        | [3 4 5 0.5 0.3 0.8]
        | [6 7 8 9 0.5 0.3 0.8

Note how an addition performed on arrays concatenates them.

Arrays can also be packed with other aggregate operations. For example, the Set function:

>>> r.f0.Set()
Slices: | f0
---------------------------
Type:   | {sd2:~}<int64
Dims:   | d1:3
Data:   |
        | set([1, 2])
        | set([3, 4, 5])
        | set([8, 9, 6, 7])


>>> r.f0.Set() | set([1])
Slices: | f0
------------------------------
Type:   | {sd2_d124:~}<int64
Dims:   | d1:3
Data:   |
        | set([1, 2])
        | set([1, 3, 4, 5])
        | set([8, 9, 1, 6, 7])

On sets, an or operation will take the union of two sets.

Performing the Tuple operation gives:

Slices: | data
---------------------------------------------
Type:   | (f0=[d2:~]:int64, f1=real64)
Dims:   | d1:3
Data:   |
        | ([1, 2], 0.5)
        | ([3, 4, 5], 0.29999999999999999)
        | ([6, 7, 8, 9], 0.80000000000000004)

As you can see, slice types are combined into a single slice tuple type.

Summary:
  • Packing moves dimensions or slices into types
  • Dimensions can be packed using Array and Set
  • Slices can be packed using Tuple

Unpacking

The reverse operations for Array and Tuple are respectively Elems and Fields:

>>> r.Tuple().Fields()
Slices: | f0        | f1
----------------------------
Type:   | int64     | real64
Dims:   | d1:3<d2:~ | d1:3
Data:   |           |
        | [1 2]     | 0.5
        | [3 4 5]   | 0.3
        | [6 7 8 9] | 0.8

These are used less commonly as by default dat is unpacked by representation.

Summary:
  • Unpacking unpacks types, moving the data structure into dimensions and slices
  • tuple/dictionary types can be unpacked using Fields
  • array/set types can be unpacked using Elems

Dimension selection

Operations such as Sum are 1-dimensional: that is, they operate across a single dimension, summing the elements. Normally, this is done on the last dimension:

>>> r.Sum()
Slices: | f0    | f1
------------------------
Type:   | int64 | real64
Dims:   | d1:3  |
Data:   |       |
        | 3     | 1.6
        | 12    |
        | 30    |

Note how for both slices the last dimension has been collapsed (summed over).

Of course, we cannot directly sum over dimension d1 in slice f0, as the elements do not line up. But in case of a matrix this is possible:

>>> data = [([6,2],0.5), ([3,4], 0.3), ([6,4], 0.8)]
>>> r = Rep(data)
>>> r.Sum(dim='d1')
Slices: | f0    | f1
------------------------
Type:   | int64 | real64
Dims:   | d2:2  |
Data:   |       |
        | 15    | 1.6
        | 10    |

Note how only dimension d2 remains, dimension d1 has been aggrated over by the Sum function. Other functions which can be used in this way include Max, Min, Argmax, Argmin, Any, All, CumSum, Mean, Sort, Argsort, Rank and Std.

Broadcasting on dimension

We already saw some examples of broadcasting in action. A simple example is this one:

>>> Rep([1,2,3]) + 3
Slices: | data
---------------
Type:   | int64
Dims:   | d1:3
Data:   |
        | 4
        | 5
        | 6

The value 3 is repeated along dimension d1 to enable the (0-dimensional) addition operation. This ‘repeating’ is called ‘broadcasting’.

It also works with more complicated data structures. Assume that we want to normalize the arrays in r, such that he mean value is equal to 0.0.

We can do that as follows:

>>> m = r.Mean()
Slices: | f0     | f1
---------------------------------
Type:   | real64 | real64
Dims:   | d1:3   |
Data:   |        |
        | 4.0    | 0.533333333333
        | 3.5    |
        | 5.0    |

>>> r - m
Slices: | f0          | f1
----------------------------------------
Type:   | real64      | real64
Dims:   | d1:3<d2:2   | d1:3
Data:   |             |
        | [ 2. -2.]   | -0.0333333333333
        | [-0.5  0.5] | -0.233333333333
        | [ 1. -1.]   | 0.266666666667

The Mean calculates the average value across the last dimension, i.e. dimension d2 for slice f0, and dimension d1 for slice f1. Using broadcasting, we can directly subtract this from the whole dataset.

If we now want to normalize only across dimension d1, this can be simply accomplished using:

>>>  r - r.Mean(dim='d1')

Slices: | f0                        | f1
------------------------------------------------------
Type:   | real64                    | real64
Dims:   | d1:3<d2:2                 | d1:3
Data:   |                           |
        | [ 1.         -1.33333333] | -0.0333333333333
        | [-2.          0.66666667] | -0.233333333333
        | [ 1.          0.66666667] | 0.266666666667

To also divide by the standard deviation, we simply add:

>>> (r - r.Mean(dim='d1')) / r.Std(dim='d1')

A simple shortcut is:

>>> Alg.scaling.Whiten(r,dim='d1')

Broadcasting works by matching dimensions in the operands to each other, expanding dimensions that are not available in either operand.

There are some noteworthy special cases.

First, the ordering of the dimensions. Suppose we have a dataset:

>>> data = [([6,2],[0.5]), ([3,4], [0.3,0.4]), ([6,4], [0.8,0.2])]
>>> r = Rep(data)
Slices: | f0        | f1
---------------------------------
Type:   | int64     | real64
Dims:   | d1:3<d2:2 | d1:3<d3:~
Data:   |           |
        | [6 2]     | [ 0.5]
        | [3 4]     | [ 0.3  0.4]
        | [6 4]     | [ 0.8  0.2]

What happens if we add f0 and f1? ri

The first is the case in which dimensions in both operands are available, but not ordered correctly. E.g. can we add something with dimensions a<b to a slice with dimensions b<a? Ibidas does not reorder dimensions in such cases. So, suppose we have:

>>> r.f0
Slices: | f0
-------------------
Type:   | int64
Dims:   | d1:3<d2:2
Data:   |
        | [6 2]
        | [3 4]
        | [6 4]


>>> r.f0.Transpose()
Slices: | f0
-------------------
Type:   | int64
Dims:   | d2:2<d1:3
Data:   |
        | [6 3 6]
        | [2 4 4]


>>> r.f0 + r.f0.Transpose()
Slices: | f0
---------------------------------
Type:   | int64
Dims:   | d2:2<d1:3<d2:2
Data:   |
        | [[12  8] [6 7] [12 10]]
        | [[8 4] [7 8] [10  8]]

(Note that the Transpose operation reverses the dimension ordering)

As you can see, the output is b<a<b. The rules for these types of broadcasting are as follows::
  • one starts with the rightmost operand, and the last dimension
  • one maps this dimension where possible to the other operands, taking the first matching dimension up from the most nested dimension.
  • next, one maps the next dimension in the rightmost operand. However, one only looks further upward from the last matched dimenson in the other operand.
  • if a dimension cannot be matched, it is broadcasted.

E.g. in this case one start with dimension d1 in the right operand, this one is matched to the dimension d1 in the left operand. The next dimension in the rightmost operand (d2) cannot be matched in the leftmost operand, as there is no more upward dimension than d1, so it is broadcasted. The rightmost operand is finished, so we move on to the next operand, and as the first d2 dimension is not yet matched, we broadcast it to the rightmost operand.

Due to the ordering dependence of operands, this means that one can influence the dimension ordering by rearranging operands. For example:

>>> r.f0.Transpose() + r.f0
Slices: | f0
------------------------------
Type:   | int64
Dims:   | d1:3<d2:2<d1:3
Data:   |
        | [[12  9 12] [4 6 6]]
        | [[9 6 9] [6 8 8]]
        | [[12  9 12] [6 8 8]]

Now the ordering is a<b<a. In reality, these situations in which the operand position matters do not occur that often.

Summary:
  • Broadcasting maps dimensions in operands to each other, repeating across dimensions that do not occur in any of the operands
  • Broadcasting does not reorder dimensions.
  • In some cases, the ordering of the operands can influence the dimension ordering.

Broadcasting on position

Ibidas matches operands normally on dimension identity. In cases these dimensions do not match, they are broadcasted. This means that if one has: