xml - utilities for XML parsing¶
This module is not intended for end users. It implements the abstract classes
for all XML parsers, XML
and IndexedXML
, and some utility functions.
Dependencies¶
This module requres lxml
and numpy
.
-
class
pyteomics.xml.
ByteCountingXMLScanner
(source, indexed_tags, block_size=1000000)[source]¶ Bases:
pyteomics.auxiliary._file_obj
Carry out the construction of a byte offset index for source XML file for each type of tag in
indexed_tags
.Inheris from
pyteomics.auxiliary._file_obj
to support the object-oriented_keep_state()
interface.Methods
build_byte_index
(*args, **kwargs)Builds a byte offset index for one or more types of tags. scan
(source, indexed_tags)-
__init__
(source, indexed_tags, block_size=1000000)[source]¶ Parameters: indexed_tags : iterable of bytes
The XML tags (without namespaces) to build indices for.
block_size : int, optional
The size of the each chunk or “block” of the file to hold in memory as a partitioned string at any given time. Defaults to 1000000.
-
build_byte_index
(*args, **kwargs)[source]¶ Builds a byte offset index for one or more types of tags.
Parameters: lookup_id_key_mapping : Mapping, optional
A mapping from tag name to the attribute to look up the identity for each entity of that type to be extracted. Defaults to ‘id’ for each type of tag.
Returns: defaultdict(ByteEncodingOrderedDict) :
Mapping from tag type to ByteEncodingOrderedDict from identifier to byte offset
-
-
class
pyteomics.xml.
FlatTagSpecificXMLByteIndex
(source, indexed_tags=None, keys=None)[source]¶ Bases:
pyteomics.xml.TagSpecificXMLByteIndex
An alternative interface on top of
TagSpecificXMLByteIndex
that assumes that identifiers across different tags are globally unique, as in MzIdentML.Attributes
offsets ByteEncodingOrderedDict The mapping between ids and byte offsets. Methods
build_index
()items
()keys
()-
__init__
(source, indexed_tags=None, keys=None)¶
-
-
class
pyteomics.xml.
IndexedXML
(*args, **kwargs)[source]¶ Bases:
pyteomics.xml.XML
Subclass of
XML
which uses an index of byte offsets for some elements for quick random access.Methods
build_id_cache
(*args, **kwargs)Construct a cache for each element in the document, indexed by id build_tree
(*args, **kwargs)Build and store the ElementTree
instanceclear_id_cache
()Clear the element ID cache clear_tree
()Remove the saved ElementTree
.get_by_id
(*args, **kwargs)Retrieve the requested entity by its id. iterfind
(*args, **kwargs)Parse the XML and yield info on elements with specified local name or by specified “XPath”. next
()reset
()-
__init__
(*args, **kwargs)[source]¶ Create an XML parser object.
Parameters: source : str or file
File name or file-like object corresponding to an XML file.
read_schema : bool, optional
Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is
True
.iterative : bool, optional
Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
.use_index : bool, optional
Defines whether an index of byte offsets needs to be created for elements listed in indexed_tags. This is useful for random access to spectra in mzML or elements of mzIdentML files, or for iterative parsing of mzIdentML with
retrieve_refs=True
. IfTrue
, build_id_cache is ignored. IfFalse
, the object acts exactly likeXML
. Default isTrue
.indexed_tags : container of bytes, optional
If use_index is
True
, elements listed in this parameter will be indexed. Empty set by default.
-
build_id_cache
(*args, **kwargs)¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
(*args, **kwargs)¶ Build and store the
ElementTree
instance for the underlying file
-
clear_id_cache
()¶ Clear the element ID cache
-
clear_tree
()¶ Remove the saved
ElementTree
.
-
get_by_id
(*args, **kwargs)[source]¶ Retrieve the requested entity by its id. If the entity is a spectrum described in the offset index, it will be retrieved by immediately seeking to the starting position of the entry, otherwise falling back to parsing from the start of the file.
Parameters: elem_id : str
The id value of the entity to retrieve.
Returns: dict :
-
iterfind
(*args, **kwargs)¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: path : str
Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.**kwargs : passed to
self._get_info_smart()
.Returns: out : iterator
-
-
class
pyteomics.xml.
TagSpecificXMLByteIndex
(source, indexed_tags=None, keys=None)[source]¶ Bases:
object
Encapsulates the construction and querying of a byte offset index for a set of XML tags.
This type mimics an immutable Mapping.
Parameters: index_tags: iterable of bytes :
The tag names to include in the index
Attributes
indexed_tags iterable of bytes The tag names to index, not including a namespace offsets defaultdict(OrderedDict(str, int)) The hierarchy of byte offsets organized {"tag_type": {"id": byte_offset}}
indexed_tag_keys: dict(str, str) A mapping from tag name to unique identifier attribute Methods
build_index
()Perform the byte offset index building for py:attr:source. items
()keys
()
-
class
pyteomics.xml.
XML
(source, read_schema=True, iterative=True, build_id_cache=False, **kwargs)[source]¶ Bases:
pyteomics.auxiliary.FileReader
Base class for all format-specific XML parsers. The instances can be used as context managers and as iterators.
Methods
build_id_cache
(*args, **kwargs)Construct a cache for each element in the document, indexed by id build_tree
(*args, **kwargs)Build and store the ElementTree
instanceclear_id_cache
()Clear the element ID cache clear_tree
()Remove the saved ElementTree
.get_by_id
(*args, **kwargs)Parse the file and return the element with id attribute equal to elem_id. iterfind
(*args, **kwargs)Parse the XML and yield info on elements with specified local name or by specified “XPath”. next
()reset
()-
__init__
(source, read_schema=True, iterative=True, build_id_cache=False, **kwargs)[source]¶ Create an XML parser object.
Parameters: source : str or file
File name or file-like object corresponding to an XML file.
read_schema : bool, optional
Defines whether schema file referenced in the file header should be used to extract information about value conversion. Default is
True
.iterative : bool, optional
Defines whether an
ElementTree
object should be constructed and stored on the instance or if iterative parsing should be used instead. Iterative parsing keeps the memory usage low for large XML files. Default isTrue
.build_id_cache : bool, optional
Defines whether a dictionary mapping IDs to XML tree elements should be built and stored on the instance. It is used in
XML.get_by_id()
, e.g. when usingpyteomics.mzid.MzIdentML
withretrieve_refs=True
.
-
build_id_cache
(*args, **kwargs)[source]¶ Construct a cache for each element in the document, indexed by id attribute
-
build_tree
(*args, **kwargs)[source]¶ Build and store the
ElementTree
instance for the underlying file
-
get_by_id
(*args, **kwargs)[source]¶ Parse the file and return the element with id attribute equal to elem_id. Returns
None
if no such element is found.Parameters: elem_id : str
The value of the id attribute to match.
Returns: out :
dict
orNone
-
iterfind
(*args, **kwargs)[source]¶ Parse the XML and yield info on elements with specified local name or by specified “XPath”.
Parameters: path : str
Element name or XPath-like expression. Only local names separated with slashes are accepted. An asterisk (*) means any element. You can specify a single condition in the end, such as:
"/path/to/element[some_value>1.5]"
Note: you can do much more powerful filtering using plain Python. The path can be absolute or “free”. Please don’t specify namespaces.**kwargs : passed to
self._get_info_smart()
.Returns: out : iterator
-