XML Dump Iteration

These classes form the basis of iterative processing of XML dumps. These datatypes are based on those found in http://pythonhosted.org/mwtypes

class mwxml.Dump(site_info, items)[source]

XML Dump Iterator. Dump file meta data and a Page iterator. Instances of this class can be called as an iterator directly. Usually, you’ll want to construct this class using from_file().

Parameters:
site_info
: SiteInfo

The data from the <siteinfo> block

pages
: iterable

An iterable of Page in the order they appear in the XML

Example:
from mwxml import Dump, Page

# Construct dump file iterator
dump = Dump.from_file(open("example/dump.xml"))

# Iterate through pages
for page in dump.pages:
    # Iterate through a page's revisions
    for revision in page:
        print(revision.id)
Attributes:
site_info = Information from the <siteinfo> block : mwxml.SiteInfo

Metadata from the <siteinfo> tag : SiteInfo

pages = The mwxml.Page that appear in the dump : iterator

An iterator of mwxml.Page elements

items = The mwxml.Page and/or mwxml.LogItem that appear in the dump : iterator

An iterator of mwxml.Page and/or mwxml.LogItem elements

log_items = The mwxml.LogItem that appear in the dump : iterator

An iterator of mwxml.LogItem elements

classmethod from_file(f)[source]

Constructs a Dump from a file pointer.

Parameters:
f
: file

A plain text file pointer containing XML to process

classmethod from_page_xml(page_xml)[source]

Constructs a Dump from a <page> block.

Parameters:
page_xml
: str | file

Either a plain string or a file containing <page> block XML to process

items

An iterator of mwxml.Page and/or mwxml.LogItem elements

log_items

An iterator of mwxml.LogItem elements

pages

An iterator of mwxml.Page elements

site_info

Metadata from the <siteinfo> tag : SiteInfo

class mwxml.SiteInfo(*args, **kwargs)[source]

Represents the data from the <siteinfo> in a MediaWiki XML dump.

name = The name of the site. : str | None
dbname = The database name of the site. : str | None
base = TODO: ??? : str | None
generator = TODO: ??? : str | None
case = TODO: ??? : str | None
namespaces = list(mwxml.Namespace) | None
class mwxml.Page(*args, **kwargs)[source]

Page meta data and a Revision iterator. Instances of this class can be called as iterators directly. See mwtypes.Page for a description of fields.

Example:
page = mwxml.Page( ... )

for revision in page:
    print("{0} {1}".format(revision.id, page.id))
class mwxml.LogItem(*args, **kwargs)[source]

LogItem meta data. See mwtypes.LogItem for a description of fields.

Example:
dump = mwxml.Dump( ... )

for log_item in dump.log_items:
    print("{0} {1}".format(log_item.id, log_item.type))
class Deleted(*args, **kwargs)

Represents information about the deleted/suppressed status of a log item and it’s associated data.

Attributes:
Deleted.action = Is the action of this log item deleted/suppressed? : bool | None
Deleted.comment = Is the text of this log item deleted/suppressed? : bool | None
Deleted.user = Is the user of this log item deleted/suppressed? : bool | None
Deleted.restricted = Is the log item restricted? : bool | None
classmethod from_int(integer)

Constructs a Deleted using the tinyint value of the log_deleted column of the logging MariaDB table.

  • DELETED_ACTION = 1
  • DELETED_COMMENT = 2
  • DELETED_USER = 4
  • DELETED_RESTRICTED = 8
class LogItem.Page(*args, **kwargs)

Log item page information

Attributes:
Page.namespace = namespace ID : int
Page.title = title : str
class mwxml.Revision(*args, **kwargs)[source]

Revision metadata and text. See mwtypes.Revision for a description of fields.

class Deleted(*args, **kwargs)

Represents information about the deleted/suppressed status of a revision and it’s associated data.

Attributes:
Deleted.text = Is the text of this revision deleted/suppressed? : bool | None
Deleted.comment = Is the text of this revision deleted/suppressed? : bool | None
Deleted.user = Is the user of this revision deleted/suppressed? : bool | None
Deleted.restricted = Is the revision restricted? : bool | None
classmethod from_int(integer)

Constructs a Deleted using the tinyint value of the rev_deleted column of the revision MariaDB table.

  • DELETED_TEXT = 1
  • DELETED_COMMENT = 2
  • DELETED_USER = 4
  • DELETED_RESTRICTED = 8
class Revision.User(*args, **kwargs)

Contributing user metadata.

Attributes:
User.id = Contributing user's identifier : int | None
User.text = Username or IP address of the user at the time of the edit : str | None
class mwxml.Namespace(*args, **kwargs)[source]

See mwtypes.Namespace for a description of fields