Package genshi :: Module input :: Class HTMLParser

Class HTMLParser

markupbase.ParserBase --+    
                        |    
    HTMLParser.HTMLParser --+
                            |
                   object --+
                            |
                           HTMLParser

Parser for HTML input based on the Python HTMLParser module.

This class provides the same interface for generating stream events as XMLParser, and attempts to automatically balance tags.

The parsing is initiated by iterating over the parser object:

>>> parser = HTMLParser(StringIO('<UL compact><LI>Foo</UL>'))
>>> for kind, data, pos in parser:
...     print('%s %s' % (kind, data))
START (QName('ul'), Attrs([(QName('compact'), u'compact')]))
START (QName('li'), Attrs())
TEXT Foo
END li
END ul
Instance Methods
 
__init__(self, source, filename=None, encoding='utf-8')
Initialize the parser for the given HTML input.
 
parse(self)
Generator that parses the HTML source, yielding markup events.
 
__iter__(self)
 
handle_starttag(self, tag, attrib)
 
handle_endtag(self, tag)
 
handle_data(self, text)
 
handle_charref(self, name)
 
handle_entityref(self, name)
 
handle_pi(self, data)
 
handle_comment(self, text)

Inherited from HTMLParser.HTMLParser: check_for_whole_start_tag, clear_cdata_mode, close, error, feed, get_starttag_text, goahead, handle_decl, handle_startendtag, parse_endtag, parse_pi, parse_starttag, reset, set_cdata_mode, unescape, unknown_decl

Inherited from markupbase.ParserBase: getpos, parse_comment, parse_declaration, parse_marked_section, updatepos

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables

Inherited from HTMLParser.HTMLParser: CDATA_CONTENT_ELEMENTS, entitydefs

Properties

Inherited from object: __class__

Method Details

__init__(self, source, filename=None, encoding='utf-8')
(Constructor)

 
Initialize the parser for the given HTML input.
Parameters:
  • source - the HTML text as a file-like object
  • filename - the name of the file, if known
  • filename - encoding of the file; ignored if the input is unicode
Overrides: object.__init__

parse(self)

 
Generator that parses the HTML source, yielding markup events.
Returns:
a markup event stream
Raises:
  • ParseError - if the HTML text is not well formed

handle_starttag(self, tag, attrib)

 
Overrides: HTMLParser.HTMLParser.handle_starttag

handle_endtag(self, tag)

 
Overrides: HTMLParser.HTMLParser.handle_endtag

handle_data(self, text)

 
Overrides: HTMLParser.HTMLParser.handle_data

handle_charref(self, name)

 
Overrides: HTMLParser.HTMLParser.handle_charref

handle_entityref(self, name)

 
Overrides: HTMLParser.HTMLParser.handle_entityref

handle_pi(self, data)

 
Overrides: HTMLParser.HTMLParser.handle_pi

handle_comment(self, text)

 
Overrides: HTMLParser.HTMLParser.handle_comment