Basic concepts ============== In this section we describe some of the basic underlying concepts of Ibidas. Representation -------------- In Ibidas one works with data sources by packaging them within a 'representation object'. Such data sources can be python objects, files, databases, and so on. Once they are wihtin a representation object they are all handled similarly. We start with a simple example, in which we package an integer object:: >>> Rep(3) Slices: | data --------------- Type: | int64 Dims: | Data: | | 3 .. tip:: The concepts ``slices``, ``types`` and ``dims`` will be explained in the next sections One can perform all kind of operations with data that is in a representation object, e.g:: >>> Rep([1,2,3]) + 3 Slices: | data --------------- Type: | int64 Dims: | d1:3 Data: | | 4 | 5 | 6 Summary: * A representor object encapsulates a data source. * Data sources can be python objects, but also files or databases. Query execution --------------- You might have noted that executing the previous commands resulted immediatly in a printout of the contents of the representation object. This is due to the IPython interpreter, which will print a representation of the result of all non-assignment operations. So, if we instead would have executed:: >>> r = Rep([1,2,3]) + 3 no output would have been printed. More importantly however, Ibidas would also not have performed the requested addition operation. I.e. ``r`` would have only been a representation of the requested operations and data sources, and thus not the result of those operations. .. important:: Operations are executed lazily. I.e. only when output is requested by the user. The reason for this behaviour is that it allows optimizations which are otherwise not possible. For example, it enables the system to translate queries against a database (partially) into the query language SQL. This way, instead of moving unnecessary data to Ibidas, operations can be moved to the database. .. note:: For brevity, we will sometimes print output in this tutorial after assignments (e.g. ``r = Rep(3)``. In reality this does not happen. One can still get the output after such statements, by simply executing ``r`` Often, one wants to just see a description of the contents of a representation object, not the actual data result itself. This can be done using the information attribute ``I``:: >>> r = Rep(["geneA","geneB"]) >>> r.I Slices: data Types: bytes[5] Dims: d1:2 Note that the data is not printed. Especially in case of slow operations or data sources this can be useful. On the other hand, there are also situations in which one wants to force the query to execute, e.g. because its result is used multiple times (which would otherwise result in a part of the query being executed multiple times). Execution can be forced by using the ``Copy()`` command:: >>> r = Rep([1,2,3]) + 4 >>> r = r.Copy() Getting data out of a representor object is simple, one simply appends ``()`` to a query to let it return the results as normal ``numpy`` or python objects:: >>> r() array(['geneA', 'geneB'], dtype=object) As you can see, Ibidas has packaged the data in a numpy array. Summary: * Operations are only executed when needed, to allow for optimizations * One can ask for a description of the representor contents using the ``I`` attribute. * Use the ``Copy`` command to force execution of a query. * One can get the data results by transforming the query into a function call by appending ``()`` Types ----- When one executes ``Rep`` without specifying a type, the type is detected automatically. For example, in the first example, the detect type was ``int64``. .. note:: Depending on the platform you use, the type can also be ``int32``. The type determines how operations on the representor are handled. For example, with an integer type, one can perform standard integer operations on the representor:: >>> r = Rep(3) >>> (r + 3) * r Slices: | data --------------- Type: | int64 Dims: | Data: | | 18 Similarly, in case of the string type, the addition operation becomes concatenation:: >>> Rep(["geneA", "geneB"]) + "_geneC" Slices: | data --------------------- Type: | bytes[11] Dims: | d1:2 Data: | | geneA_geneC | geneB_geneC One might have noted that, although we now represent a list of thins, the type still represents the type of the list elements. This is because ``Rep`` (by default) **unpacks** the data. By unpacking, operations will not be performed at the *list* level, but instead at the *list elements* level. Unpacking/packing will be explained further in one of the next sections. Summary: * A type is assigned automatically when packaging data using ``rep`` * The type indicates at which data nesting level operations are executed. Slices ------ Whereas lists in Ibidas are used to denote collections of data with the same type, tuples are used to describe data compositions of different types. You might have know such compositions as *records*, or simply as table rows. So, lets load a simple table:: >>> data = [('gene1',0.5), ('gene2', 0.3), ('gene3', 0.8)] >>> r = Rep(data) Slices: | f0 | f1 --------------------------- Type: | bytes[5] | real64 Dims: | d1:3 | d1:3 Data: | | | gene1 | 0.5 | gene2 | 0.3 | gene3 | 0.8 If we compare the output to earlier output, we now see that there are more than one columns in the data description. These columns represented *slices*. Slices are one of the main concepts in Ibidas. They can be compared to columns/fields in a table, but are more general. .. note:: Note that we use tuples to describe records, and a list to store the records of the table. In Ibidas, tuples are used to relate attributes that describe some common object, while lists are used to gather elements with the same type / role. Selecting a slice can be done using simple attribute lookup:: >>> r.f0 Slices: | f0 ------------------ Type: | bytes[5] Dims: | d1:3 Data: | | gene1 | gene2 | gene3 Each slice has a name (the first row), a type (second row) and a dimension (third row). Dimensions will be explained later. For now, it is important that each slice has a common type. This means that all data elements adressed by it can be handled in the same way. Slices could thus also be seen as a kind of cursor in your data structure. Performing operations on this cursor will perform the operations on a subset of your data. For example:: >>> r.f0 == "gene2" Slices: | f0 --------------- Type: | bool Dims: | d1:3 Data: | | False | True | False To select multiple slices, one can use the :py:meth:`ibidas.representor.Representor.get` function:: >>> r.Get("f1", "f0") Slices: | f1 | f0 --------------------------- Type: | real64 | bytes[5] Dims: | d1:3 | d1:3 Data: | | | 0.5 | gene1 | 0.3 | gene2 | 0.8 | gene3 >>> r.Get(1, 0) Slices: | f1 | f0 --------------------------- Type: | real64 | bytes[5] Dims: | d1:3 | d1:3 >>> r.Get(r.f1, r.f0) Slices: | f1 | f0 --------------------------- Type: | real64 | bytes[5] Dims: | d1:3 | d1:3 >>> r.Get(_.f1, _.f0) Slices: | f1 | f0 --------------------------- Type: | real64 | bytes[5] Dims: | d1:3 | d1:3 The last option shows the use of the context operator ``_``, which adresses the enclosing data representor (in this case `r`). This allows one to refer easily to slices in longer commands. One can also combine this with other operations:: >>> r.Get(_.f1 + 3, _.f0) Slices: | f1 | f0 --------------------------- Type: | real64 | bytes[5] Dims: | d1:3 | d1:3 Data: | | | 3.5 | gene1 | 3.3 | gene2 | 3.8 | gene3 One can also use this function to combine slices, e.g:: >>> r.Get(_.f0, Rep("cancer_genes")) Slices: | f0 | data --------------------------------- Type: | bytes[5] | bytes[12] Dims: | d1:3 | Data: | | | gene1 | cancer_genes | gene2 | | gene3 | When loading data from databases or files, often slice names are assigned as given in the data source. In case of loading from Python data, slice names are however autoassigned, and thus not very informative. To rename slices, one can use the :py:meth:`ibidas.representor.Representor.rename` function:: >>> r.Rename("genes","scores") Slices: | genes | scores >>> r.Rename(f1="scores") Slices: | f0 | scores As this functionality is used often, a shorter version is available:: >>> r/("genes","scores") Slices: genes scores >>> r/{f1:"scores"} Slices: f0 scores >>> r.Get(_.f0/"genes", Rep("cancer_genes")/"listname") Slices: genes listname Summary: * Slices can be compared to columns/fields in a table, or to data cursors which indicate on which data elements operations will be applied. * A representor object is a collection of slices * Attribute lookup can be used to select a single slice. * More advanced selection can be performed using the ``get`` function, allowing multiple slice selection, slice modifications and slice combination. Dimensions ---------- Up to now, our data model was very similar to ones used in other software. *Dimensions* allow Ibidas to handle more complex data structures. Lets replace the first field of the table with a nested, variable length list:: >>> data = [([1,2],0.5), ([3,4,5], 0.3), ([6,7,8,9], 0.8)] >>> r = Rep(data) Slices: | f0 | f1 ---------------------------- Type: | int64 | real64 Dims: | d1:3>> r.f0 + r.f1 Slices: | result ------------------------------- Type: | real64 Dims: | d1:3>> r.DimRename("pathways","genes") Slices: | f0 | f1 ----------------------------------------- Type: | int64 | real64 Dims: | pathways:3>> data = [([1,2],0.5), ([3,4,5], 0.3), ([6,7,8,9], 0.8)] >>> r = Rep(data) >>> r.Array() Slices: | f0 | f1 ----------------------------------------- Type: | [d2:~]:int64 | [d1:3]:real64 Dims: | d1:3 | Data: | | | [1, 2] | [ 0.5 0.3 0.8] | [3, 4, 5] | | [6, 7, 8, 9] | The influence of the array operation is that the dimensions are moved into the type. So subsequent operations are performed at the level of the arrays:: >>> r.Array().Get(_.f0 + _.f1) Slices: | result ------------------------------- Type: | [d2_d1:~]:real64 Dims: | d1:3 Data: | | [1 2 0.5 0.3 0.8] | [3 4 5 0.5 0.3 0.8] | [6 7 8 9 0.5 0.3 0.8 Note how an addition performed on arrays concatenates them. Arrays can also be packed with other aggregate operations. For example, the ``Set`` function:: >>> r.f0.Set() Slices: | f0 --------------------------- Type: | {sd2:~}>> r.f0.Set() | set([1]) Slices: | f0 ------------------------------ Type: | {sd2_d124:~}>> r.Tuple().Fields() Slices: | f0 | f1 ---------------------------- Type: | int64 | real64 Dims: | d1:3>> r.Sum() Slices: | f0 | f1 ------------------------ Type: | int64 | real64 Dims: | d1:3 | Data: | | | 3 | 1.6 | 12 | | 30 | Note how for both slices the last dimension has been collapsed (summed over). Of course, we cannot directly sum over dimension ``d1`` in slice ``f0``, as the elements do not line up. But in case of a matrix this is possible:: >>> data = [([6,2],0.5), ([3,4], 0.3), ([6,4], 0.8)] >>> r = Rep(data) >>> r.Sum(dim='d1') Slices: | f0 | f1 ------------------------ Type: | int64 | real64 Dims: | d2:2 | Data: | | | 15 | 1.6 | 10 | Note how only dimension ``d2`` remains, dimension ``d1`` has been aggrated over by the ``Sum`` function. Other functions which can be used in this way include ``Max``, ``Min``, ``Argmax``, ``Argmin``, ``Any``, ``All``, ``CumSum``, ``Mean``, ``Sort``, ``Argsort``, ``Rank`` and ``Std``. Broadcasting on dimension ------------------------- We already saw some examples of broadcasting in action. A simple example is this one:: >>> Rep([1,2,3]) + 3 Slices: | data --------------- Type: | int64 Dims: | d1:3 Data: | | 4 | 5 | 6 The value 3 is repeated along dimension ``d1`` to enable the (0-dimensional) addition operation. This 'repeating' is called 'broadcasting'. It also works with more complicated data structures. Assume that we want to normalize the arrays in ``r``, such that he mean value is equal to 0.0. We can do that as follows:: >>> m = r.Mean() Slices: | f0 | f1 --------------------------------- Type: | real64 | real64 Dims: | d1:3 | Data: | | | 4.0 | 0.533333333333 | 3.5 | | 5.0 | >>> r - m Slices: | f0 | f1 ---------------------------------------- Type: | real64 | real64 Dims: | d1:3>> r - r.Mean(dim='d1') Slices: | f0 | f1 ------------------------------------------------------ Type: | real64 | real64 Dims: | d1:3>> (r - r.Mean(dim='d1')) / r.Std(dim='d1') A simple shortcut is:: >>> Alg.scaling.Whiten(r,dim='d1') Broadcasting works by matching dimensions in the operands to each other, expanding dimensions that are not available in either operand. There are some noteworthy special cases. First, the ordering of the dimensions. Suppose we have a dataset:: >>> data = [([6,2],[0.5]), ([3,4], [0.3,0.4]), ([6,4], [0.8,0.2])] >>> r = Rep(data) Slices: | f0 | f1 --------------------------------- Type: | int64 | real64 Dims: | d1:3>> r.f0 Slices: | f0 ------------------- Type: | int64 Dims: | d1:3>> r.f0.Transpose() Slices: | f0 ------------------- Type: | int64 Dims: | d2:2>> r.f0 + r.f0.Transpose() Slices: | f0 --------------------------------- Type: | int64 Dims: | d2:2>> r.f0.Transpose() + r.f0 Slices: | f0 ------------------------------ Type: | int64 Dims: | d1:3