.. _indexes:

Indexes
=======

.. testsetup:: *

   import seqan

An index is the result of pre-processing a text to allow fast look-up
and search operations. The enhanced suffix array (ESA) is an index that allows
retrieval of substrings based on their content. We can build an ESA from a string
set easily.

..  doctest::

    >>> seqs = seqan.StringDNASet(('ACGT', 'AAAA', 'GGGG', 'AC'))
    >>> index = seqan.IndexStringDNASetESA(seqs)


.. _iterators:

Iterators
---------

Once we have the index we can find substrings in the string set using a top-down iterator.

..  doctest::

    >>> iterator = index.topdown()
    >>> if iterator.goDown('AC'):
    ...     print '"{0}" starts a suffix in our index'.format(iterator.representative)
    ...     for occ in iterator.occurrences:
    ...         print 'Found "{0}" in sequence {1} at position {2}'.format(iterator.representative, occ.i1, occ.i2)
    "AC" starts a suffix in our index
    Found "AC" in sequence 3 at position 0
    Found "AC" in sequence 0 at position 0

An iterator points to a particular location in an index. In the case of an ESA,
it identifies all the suffixes that start with a given representative substring.
Here we used the *occurrences* property of the top-down iterator to locate all the
occurrences of the substring AC in the original string set. We also used the
*representative* property to access the substring the iterator refers to. The *replength*
property is an efficient way to access the length of the representative substring.

..  doctest::

    >>> print '"{0}" has length {1}'.format(iterator.representative, iterator.repLength)
    "AC" has length 2

The *numOccurrences* property holds the number of occurrences of the substring in the
entire string set.

..  doctest::

    >>> print '"{0}" has {1} occurrences'.format(iterator.representative, iterator.numOccurrences)
    "AC" has 2 occurrences


The *numChildren* property holds the number of occurrences in the children of this iterator, that
is it shows how many of the representative substring for this node do not terminate here.

..  doctest::

    >>> print '"{0}" has {1} occurrence in child nodes'.format(iterator.representative, iterator.numChildren)
    "AC" has 1 occurrence in child nodes

Top-down iterators can go down and right to traverse the entire index from the root node.
SeqAn also provides a top-down-history iterator that can go up.

..  doctest::

    >>> historyiterator = index.topdownhistory()
    >>> if historyiterator.goDown('AC'):
    ...     print '"{0}" starts a suffix in our index'.format(historyiterator.representative)
    ...     if historyiterator.goUp():
    ...         print '"{0}" starts a suffix in our index'.format(historyiterator.representative)
    "AC" starts a suffix in our index
    "A" starts a suffix in our index


.. _traversal:

Traversal
---------

Top-down iterators can traverse the entire index in a top-down manner using recursion.

..  doctest::

    >>> def descend(it):
    ...     print 'Index has {0} occurrence(s) of representative "{1}"'.format(
    ...         it.numOccurrences, it.representative)
    ...     if it.goDown():
    ...         while True:
    ...             descend(it.copy())
    ...             if not it.goRight():
    ...                 break
    >>> descend(index.topdown())
    Index has 14 occurrence(s) of representative ""
    Index has 6 occurrence(s) of representative "A"
    Index has 3 occurrence(s) of representative "AA"
    Index has 2 occurrence(s) of representative "AAA"
    Index has 1 occurrence(s) of representative "AAAA"
    Index has 2 occurrence(s) of representative "AC"
    Index has 1 occurrence(s) of representative "ACGT"
    Index has 2 occurrence(s) of representative "C"
    Index has 1 occurrence(s) of representative "CGT"
    Index has 5 occurrence(s) of representative "G"
    Index has 3 occurrence(s) of representative "GG"
    Index has 2 occurrence(s) of representative "GGG"
    Index has 1 occurrence(s) of representative "GGGG"
    Index has 1 occurrence(s) of representative "GT"
    Index has 1 occurrence(s) of representative "T"

A similar descent with top-down-history iterators is best done using *goUp()* rather
than *copy()*. In fact this method seems to be slightly more efficient than using *copy()*
with top-down iterators.

..  doctest::

    >>> def descendhistory(it):
    ...     print 'Index has {0} occurrence(s) of representative "{1}"'.format(
    ...         it.numOccurrences, it.representative)
    ...     if it.goDown():
    ...         while True:
    ...             descendhistory(it)
    ...             if not it.goRight():
    ...                 break
    ...         it.goUp()
    >>> descend(index.topdown())
    Index has 14 occurrence(s) of representative ""
    Index has 6 occurrence(s) of representative "A"
    Index has 3 occurrence(s) of representative "AA"
    Index has 2 occurrence(s) of representative "AAA"
    Index has 1 occurrence(s) of representative "AAAA"
    Index has 2 occurrence(s) of representative "AC"
    Index has 1 occurrence(s) of representative "ACGT"
    Index has 2 occurrence(s) of representative "C"
    Index has 1 occurrence(s) of representative "CGT"
    Index has 5 occurrence(s) of representative "G"
    Index has 3 occurrence(s) of representative "GG"
    Index has 2 occurrence(s) of representative "GGG"
    Index has 1 occurrence(s) of representative "GGGG"
    Index has 1 occurrence(s) of representative "GT"
    Index has 1 occurrence(s) of representative "T"


Top-down-history iterators also provide *goBegin()* and *goNext()* methods that allow
the following depth-first method of traversal. This seems to be the most efficient
traversal method.

..  doctest::

    >>> it = index.topdownhistory()
    >>> it.goBegin()
    >>> while not it.atEnd:
    ...     print 'Index has {0} occurrence(s) of representative "{1}"'.format(
    ...         it.numOccurrences, it.representative)
    ...     it.goNext()
    Index has 1 occurrence(s) of representative "AAAA"
    Index has 2 occurrence(s) of representative "AAA"
    Index has 3 occurrence(s) of representative "AA"
    Index has 1 occurrence(s) of representative "ACGT"
    Index has 2 occurrence(s) of representative "AC"
    Index has 6 occurrence(s) of representative "A"
    Index has 1 occurrence(s) of representative "CGT"
    Index has 2 occurrence(s) of representative "C"
    Index has 1 occurrence(s) of representative "GGGG"
    Index has 2 occurrence(s) of representative "GGG"
    Index has 3 occurrence(s) of representative "GG"
    Index has 1 occurrence(s) of representative "GT"
    Index has 5 occurrence(s) of representative "G"
    Index has 1 occurrence(s) of representative "T"
    Index has 14 occurrence(s) of representative ""


Lazy initialisation
-------------------

Many of the seqan data structures are initialised lazily, that is they are not
constructed until they are first required. Sometimes you might want to ensure
the structures are initialised before saving or some other operation. This can
be achieved via the *traverse()* function.

..  doctest::

    >>> import seqan.traverse
    >>> seqan.traverse.depthfirsttraversal(index, lambda it: True)
    <...>


Vertexes
--------

At a low level in the SeqAn library each iterator refers to a vertex in the ESA. The values
of these vertexes can be accessed through the *value* property.

..  doctest::

    >>> print 'The ID of the vertex for "{0}" is {1}'.format(iterator.representative, iterator.value.id)
    The ID of the vertex for "AC" is 8

These values can be used to refer to vertexes in a stable way, providing the underlying ESA does
not change.

..  doctest::

    >>> it = index.topdown()
    >>> it.goDown('AC')
    True
    >>> print it.representative
    AC
    >>> vertex = it.value  # Save vertex for "AC"
    >>> it.goDown('GT')
    True
    >>> print it.representative
    ACGT
    >>> it.value = vertex  # Reset to vertex for "AC"
    >>> print it.representative
    AC