.. code-block:: python

    import glypy
    from glypy.io import glycoct
    from glypy.structure import (glycan, monosaccharide, substituent)
    
    from glypy import plot
    from glypy.algorithms import subtree_search, database
    
    %matplotlib inline
    from matplotlib import rcParams
    rcParams["figure.figsize"] = 10, 8
    from IPython.display import display, Image

The Database Module
===================

When we are interested in finding out what kinds of structures we might
have in some unknown data, the first place we start is with "what do we
think is possible?". This usually involves reading articles and
hand-tuning a hypothesis to test. Alternatively, we can throw everything
possible at the data and to see what sticks. In both of these cases we
need to actually have a collection of structures that we can curate,
annotate, and test.

The ``database`` module is here for provide a convenient and performant
container for glycan structures, implemented on top of ``sqlite3``.
We've pre-package all of the definite structures in Glycome-DB, which
you can download from PyPI or with the ``glypy.io.glycomedb`` module,
directly from `www.glycome-db.org <www.glycome-db.org>`__.

Let's assume you've downloaded the database and it's in your current
directory

.. code-block:: python

    glycomedb = database.dbopen("./glycomedb.db")
    glycomedb




.. parsed-literal::

    <RecordDatabase 28431 records>



Depending upon the version of ``glypy`` you have, you may see fewer or
more records as the library matures to handle more irregular structures.
As of this moment, it doesn't attempt to represent structures with
``REP`` repeating elements or ``UND`` undefined subgraphs. This version
was created as strictly as possible, so if we didn't represent it
exactly, it wasn't included.

If you're familiar with SQL, this block of code explains the structure
of the table as it is stored on disk in an SQLite database file.

.. code-block:: sql

    CREATE TABLE GlycanRecord(
        glycan_id INTEGER UNIQUE PRIMARY KEY NOT NULL, 
        mass float NOT NULL,
        structure TEXT NOT NULL, glycoct TEXT, /* Special */
        is_n_glycan BOOLEAN,
        composition VARCHAR(120)
    );

The ``structure`` field is special, it contains all of the data about
this record as a ``Python`` object,
`pickled <https://docs.python.org/2/library/pickle.html>`__ and
serialized in the database. This means we can add anything we like to
the Python object without needing to map it to a column in the database.
However, if we can map something to a column in the table, we can search
against it quickly.

Searching the Database
----------------------

The database object, ``glycomedb`` here, wraps an SQLite connection. In
other words, it can do everything
`sqlite3.Connection <https://docs.python.org/2/library/sqlite3.html?highlight=sqlite3#connection-objects>`__
does, execute SQL commands to quickly search, filter, add, or remove
records.

Here we'll search for every entry in the database that has the N-Glycan
Core motif, which has been mapped to the ``is_n_glycan`` column.

.. code:: python

    import time
    start = time.time()
    i = len(list(glycomedb.from_sql(glycomedb.execute("SELECT * FROM GlycanRecord WHERE is_n_glycan=1;"))))
    end = time.time()
    i, end - start




.. parsed-literal::

    (5443, 33.504000186920166)



It took 33 seconds to search nearly 28,000 records and extract those
that match the criterion. This is good if we routinely want to select
something depending upon whether or not it's an N-Glycan.

Some structures in ``Glycome-DB`` are reduced, meaning they have an
alditol group on their reducing end. This isn't mapped to the table
schema above, but we can still search the database using pure
``Python``. The database class, ``RecordDatabase``, will by default
iterate over all of the records in the ``GlycanRecord`` table in the
order they were inserted. Each record is an object of the class
``GlycanRecord``, which holds a ``Glycan`` object in its ``structure``
attribute, and lots of other annotations to be shown later.

.. code:: python

    start = time.time()
    i = 0
    for rec in glycomedb:
        if rec.structure.reducing_end is not None:
            i += 1
    end = time.time()
    i, end - start




.. parsed-literal::

    (3040, 77.43799996376038)



Looping over all 28,000 records in Python was over 2x slower, over a
minute to run. Though this comparison is not totally fair, it is a good
argument for using SQL to search. The slowest step is constructing the
object from the database. This keeps only one object in memory time.

We can also access records by their database ID number, if we know it.
For example
`Glycome-DB:183 <http://www.glycome-db.org/database/showStructure.action?glycomeId=183>`__

.. code:: python

    import urllib2
    Image(urllib2.urlopen("http://www.glycome-db.org/getSugarImage.action?id=183&type=cfg").read())




.. image:: output_10_0.png



.. code:: python

    record183 = glycomedb[183]
    draw_tree, axes = plot.plot(record183, label=True, scale=0.15)
    
    print record183
    print record183.taxa


.. parsed-literal::

    <GlycanRecord 183 2005.72436779>
    RES
    1b:b-dglc-HEX-1:5
    2s:n-acetyl
    3b:b-dglc-HEX-1:5
    4s:n-acetyl
    5b:b-dman-HEX-1:5
    6b:a-dman-HEX-1:5
    7b:b-dglc-HEX-1:5
    8s:n-acetyl
    9b:b-dgal-HEX-1:5
    10b:b-dglc-HEX-1:5
    11s:n-acetyl
    12b:b-dgal-HEX-1:5
    13b:a-dman-HEX-1:5
    14b:b-dglc-HEX-1:5
    15s:n-acetyl
    16b:b-dgal-HEX-1:5
    LIN
    1:1d(2+1)2n
    2:1o(4+1)3d
    3:3d(2+1)4n
    4:3o(4+1)5d
    5:5o(3+1)6d
    6:5o(6+1)13d
    7:6o(2+1)10d
    8:6o(4+1)7d
    9:7d(2+1)8n
    10:7o(4+1)9d
    11:10d(2+1)11n
    12:10o(4+1)12d
    13:13o(2+1)14d
    14:14d(2+1)15n
    15:14o(4+1)16d
    
    [<Taxon tax_id=9031 name=None entries=None>, <Taxon tax_id=9913 name=None entries=None>, <Taxon tax_id=11033 name=None entries=None>, <Taxon tax_id=9606 name=None entries=None>, <Taxon tax_id=9940 name=None entries=None>]
    


.. image:: output_11_1.png


If you look at the picture, this is the same structure, but you also see
it's ``GlycoCT``\ representation when the object is printed out. Another
of the facets of the ``GlycanRecord`` class is the storage for
provenance information, like species the structure is associated with.
The data-dump from ``Glycome-DB`` only contains taxon id numbers. You
might recognize 9606 as Human, but the others are probably unfamiliar.

It's outside the scope of this project to automatically include that
sort of information, however the
`taxonomylite <http://pythonhosted.org//taxonomylite/>`__ package
(disclaimer: I am the an author and the maintainer) does this job
nicely. If we also have downloaded or built a ``Taxonomy`` database from
``taxonomylite``, we can put names to those numbers:

.. code:: python

    import taxonomylite
    taxa_db = taxonomylite.Taxonomy("taxonomy.db")
    [(taxa_db.tid_to_name(taxon.tax_id), taxon.tax_id) for taxon in record183.taxa]




.. parsed-literal::

    [(u'Gallus gallus', '9031'),
     (u'Bos taurus', '9913'),
     (u'Semliki Forest virus', '11033'),
     (u'Homo sapiens', '9606'),
     (u'Ovis aries', '9940')]



We could also mix the ``Taxonomy`` database directly into ``glycomedb``,
but that would quadruple the file size.

Because ``GlycanRecord`` objects are *just* ``Python`` objects, we can
add new attributes to them and save them to the database for later.

.. code:: python

    record183.fragments = list(record183.structure.fragments("AXBYCZ"))
    print len(record183.fragments)
    record183.update()


.. parsed-literal::

    220
    

Here we've added a new attribute, and called the record's ``update()``
method, which writes its current state to the database. If we go load
the record from disk again, the new attribute should still be present.

.. code:: python

    len(glycomedb[183].fragments)




.. parsed-literal::

    220



An Application
--------------

What if we wanted to do something like build a database of human
N-Glycan structures? We could do it easily in memory by doing something
like this:

.. code:: python

    human_n_glycans = []
    for row in glycomedb.execute("SELECT * FROM GlycanRecord WHERE is_n_glycan=1;"):
        record = glycomedb.record_type.from_sql(row, glycomedb) # Convert each raw row into GlycanRecord instance
        for taxon in record.taxa:
            if taxon.tax_id == "9606":
                human_n_glycans.append(record)
                break
    print len(human_n_glycans)


.. parsed-literal::

    888
    

1. This first reduces the number of records to search in ``Python`` by
   using SQL to quickly pull out all N-Glycans, then convert those rows
   of the database to python objects using ``from_sql()``.
2. For each record retrieved, test if any of its taxa are Human. If so,
   add them to the list and move on to the next record

So we have 888 records in memory. While we're at it, we'll say the
experiment we have in mind will be on permethylated, reduced glycans, so
let's reduce them and derivatize them.

.. code:: python

    from glypy.composition.composition_transform import derivatize
    
    for record in human_n_glycans:
        record.structure.set_reducing_end(True)
        derivatize(record.structure, "methyl")
    

The records are still in memory. We can write them to disk in
``glycomedb``, but that's probably not what we want, since we've
transformed these structures and we want to keep our reference database
clean. We can create a new database object to save them in easily
though.

First, let's remember that we really wanted to be able to tell easily if
a structure was "high mannose" or not. We'll say something is "high
mannose" if it has more than 5 Hexose in its composition. This
classification may be dubious, but for some applications, it may be
valid. We could make it a new attribute on the object, but that would
probably take too long to filter by. Let's try adding it to the new
database's table schema.

To do that, we first need to create a new record type, derived from
``GlycanRecord``

.. code:: python

    def is_high_mannose(record):
        return int(record.monosaccharides['Hex'] > 4)
    
    @database.column_data("is_high_mannose", "BOOLEAN NOT NULL", is_high_mannose)
    class IsHighMannoseGlycanRecord(database.GlycanRecord):
        pass
    
    experiment_db = database.dbopen("experiment.db", record_type=IsHighMannoseGlycanRecord, flag='w')
    experiment_db.load_data(human_n_glycans, set_id=False)
    experiment_db.apply_indices()
    print len(experiment_db)
    


.. parsed-literal::

    888
    

This new class is a straight copy of the ``GlycanRecord`` class's
internal logic, except that it now includes a new column in the mapped
SQL schema. The column, ``is_high_mannose``, is declared as a
``BOOLEAN`` data type, and it is mapped by this function:

.. code:: python

    def is_high_mannose(record):
        return int(record.monosaccharides['Hex'] > 4)

You might ask why the result is cast to an ``int`` instead of left as
``True`` or ``False``. The reason is that SQLite doesn't have a key word
literal for boolean values, and just treates 0 as ``False`` and any
other number as ``True``.

The resulting table schema looks like:

.. code:: sql

    /* Generaged by calling '\n'.join(IsHighMannoseGlycanRecord.sql_schema()) */
    CREATE TABLE GlycanRecord(
        glycan_id INTEGER UNIQUE PRIMARY KEY NOT NULL,
        mass float NOT NULL,
        structure TEXT NOT NULL, glycoct TEXT,
        is_high_mannose BOOLEAN, /* Newly created */
        is_n_glycan BOOLEAN,
        composition VARCHAR(120));

Notice the new column added. We can now quickly filter structures by the
``is_high_mannose`` criterion. After loading the data, we call
``apply_indices`` to index the database by mass. We could add more
indices if we wished to by executing raw SQL.

.. code:: python

    start = time.time()
    res = len(list(experiment_db.from_sql(experiment_db.execute("SELECT * FROM GlycanRecord WHERE is_high_mannose=1;"))))
    res, time.time() - start




.. parsed-literal::

    (595, 10.51199984550476)



.. code:: python

    i = 0
    start = time.time()
    for record in experiment_db:
        if is_high_mannose(record):
            i += 1
    i, time.time() - start




.. parsed-literal::

    (595, 105.88299989700317)



.. code:: python

    start = time.time()
    i = 0
    for record in human_n_glycans:
        if is_high_mannose(record):
            i += 1
    i, time.time() - start 




.. parsed-literal::

    (595, 108.72199988365173)



Using the precomputed SQL field, we can find and extract all 595 records
in 10 seconds. Doing it with a database iterator takes 10 times longer.
This is because it needs to construct every record and executes the test
over and over again.

Mass Searching
--------------

If we know we're looking for all the entries in our database which are
near a particular mass value, it's straight-forward to query

.. code:: python

    for match in (experiment_db.ppm_match_tolerance_search(2063.0773, 1e-5)):
        plot.plot(match, label=True, scale=0.135)



.. image:: output_29_0.png



.. image:: output_29_1.png



.. image:: output_29_2.png



.. image:: output_29_3.png



.. image:: output_29_4.png



.. image:: output_29_5.png



.. image:: output_29_6.png



.. image:: output_29_7.png



.. image:: output_29_8.png



.. image:: output_29_9.png



.. image:: output_29_10.png



.. image:: output_29_11.png



.. image:: output_29_12.png



.. image:: output_29_13.png



.. image:: output_29_14.png



.. image:: output_29_15.png



.. image:: output_29_16.png



.. image:: output_29_17.png



.. image:: output_29_18.png


It looks like there are several linkage variants of the many similar
topologies and composition in the database that match that mass.

If we had tandem mass spectra with the critical fragment, say a cross
ring cleavage along the central Mannose of the N-Glycan Core motif, we
could discern which broad topology we had. With MSn, determining linkage
would be doable too.