builds Package

BeautifulSoup Module

Beautiful Soup Elixir and Tonic “The Screen-Scraper’s Friend” v3.0.0 http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree.

A well-formed XML/HTML document yields a well-formed data structure. An ill-formed XML/HTML document yields a correspondingly ill-formed data structure. If your document is only locally well-formed, you can use this library to find and process the well-formed part of it.

Beautiful Soup works with Python 2.2 and up. It has no external dependencies, but you’ll have more success at converting data to UTF-8 if you also install these three packages:

Beautiful Soup defines classes for two main parsing strategies:

  • BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific language that kind of looks like XML.
  • BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid or invalid. This class has web browser-like heuristics for obtaining a sensible parse tree in the face of common HTML errors.

Beautiful Soup also defines a class (UnicodeDammit) for autodetecting the encoding of an HTML or XML document, and converting it to Unicode. Much of this code is taken from Mark Pilgrim’s Universal Feed Parser.

For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html

class cbtestlib.builds.BeautifulSoup.BeautifulSOAP(markup='', parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo='xml', convertEntities=None, selfClosingTags=None)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup

This class will push a tag with only a single string child into the tag’s parent as an attribute. The attribute’s name is the tag name, and the value is the string child. An example should give the flavor of the change:

<foo><bar>baz</bar></foo>
=>

<foo bar=”baz”><bar>baz</bar></foo>

You can then access fooTag[‘bar’] instead of fooTag.barTag.string.

This is, of course, useful for scraping structures that tend to use subelements instead of attributes, such as SOAP messages. Note that it modifies its input, so don’t print the modified version out.

I’m not sure how many people really want to use this class; let me know if you do. Mainly I like the name.

popTag()
class cbtestlib.builds.BeautifulSoup.BeautifulSoup(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup

This parser knows the following facts about HTML:

  • Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered.

  • The text inside some tags (ie. ‘script’) may contain tags which are not really part of the document and which should be parsed as text, not tags. If you want to parse the text as tags, you can always fetch it and parse it explicitly.

  • Tag nesting rules:

    Most tags can’t be nested at all. For instance, the occurance of a <p> tag should implicitly close the previous <p> tag.

    <p>Para1<p>Para2

    should be transformed into:

    <p>Para1</p><p>Para2

    Some tags can be nested arbitrarily. For instance, the occurance of a <blockquote> tag should _not_ implicitly close the previous <blockquote> tag.

    Alice said: <blockquote>Bob said: <blockquote>Blah

    should NOT be transformed into:

    Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah

    Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a <tr> tag should implicitly close the previous <tr> tag within the same <table>, but not close a <tr> tag in another table.

    <table><tr>Blah<tr>Blah

    should be transformed into:

    <table><tr>Blah</tr><tr>Blah

    but,

    <tr>Blah<table><tr>Blah

    should NOT be transformed into

    <tr>Blah<table></tr><tr>Blah

Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup, MinimalSoup, or BeautifulStoneSoup before writing your own subclass.

CHARSET_RE = <_sre.SRE_Pattern object at 0x21d6490>
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
NESTABLE_INLINE_TAGS = ['span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 'center']
NESTABLE_LIST_TAGS = {'dl': [], 'ol': [], 'dd': ['dl'], 'li': ['ul', 'ol'], 'ul': [], 'dt': ['dl']}
NESTABLE_TABLE_TAGS = {'tr': ['table', 'tbody', 'tfoot', 'thead'], 'tbody': ['table'], 'tfoot': ['table'], 'th': ['tr'], 'table': [], 'td': ['tr'], 'thead': ['table']}
NESTABLE_TAGS = {'ins': [], 'table': [], 'font': [], 'span': [], 'sub': [], 'bdo': [], 'tr': ['table', 'tbody', 'tfoot', 'thead'], 'tbody': ['table'], 'li': ['ul', 'ol'], 'tfoot': ['table'], 'th': ['tr'], 'sup': [], 'td': ['tr'], 'thead': ['table'], 'dl': [], 'blockquote': [], 'fieldset': [], 'dd': ['dl'], 'object': [], 'dt': ['dl'], 'ol': [], 'center': [], 'q': [], 'ul': [], 'del': [], 'div': []}
NON_NESTABLE_BLOCK_TAGS = ['address', 'form', 'p', 'pre']
QUOTE_TAGS = {'script': None}
RESET_NESTING_TAGS = {'pre': None, 'ins': None, 'table': [], 'noscript': None, 'p': None, 'tr': ['table', 'tbody', 'tfoot', 'thead'], 'tbody': ['table'], 'li': ['ul', 'ol'], 'tfoot': ['table'], 'th': ['tr'], 'td': ['tr'], 'thead': ['table'], 'dl': [], 'blockquote': None, 'fieldset': None, 'form': None, 'dd': ['dl'], 'address': None, 'dt': ['dl'], 'ol': [], 'ul': [], 'del': None, 'div': None}
SELF_CLOSING_TAGS = {'img': None, 'hr': None, 'frame': None, 'spacer': None, 'meta': None, 'link': None, 'br': None, 'input': None, 'base': None}
start_meta(attrs)

Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.

class cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup(markup='', parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo='xml', convertEntities=None, selfClosingTags=None)

Bases: cbtestlib.builds.BeautifulSoup.Tag, sgmllib.SGMLParser

This class contains the basic parser and search code. It defines a parser that knows nothing about tag behavior except for the following:

You can’t close a tag without closing all the tags it encloses. That is, “<foo><bar></foo>” actually means “<foo><bar></bar></foo>”.

[Another possible explanation is “<foo><bar /></foo>”, but since this class defines no SELF_CLOSING_TAGS, it will never use that explanation.]

This class is useful for parsing XML or made-up markup languages, or when BeautifulSoup makes an assumption counter to what you were expecting.

HTML_ENTITIES = 'html'
MARKUP_MASSAGE = [(<_sre.SRE_Pattern object at 0x1dd96f0>, <function <lambda> at 0x21a42a8>), (<_sre.SRE_Pattern object at 0x1f32320>, <function <lambda> at 0x21a4320>)]
NESTABLE_TAGS = {}
QUOTE_TAGS = {}
RESET_NESTING_TAGS = {}
ROOT_TAG_NAME = u'[document]'
SELF_CLOSING_TAGS = {}
XML_ENTITIES = 'xml'
XML_ENTITY_LIST = {'amp': True, 'lt': True, 'gt': True, 'apos': True, 'quot': True}
endData(containerClass=<class 'cbtestlib.builds.BeautifulSoup.NavigableString'>)
handle_charref(ref)

Handle character references as data.

handle_comment(text)

Handle comments as Comment objects.

handle_data(data)
handle_decl(data)

Handle DOCTYPEs and the like as Declaration objects.

handle_entityref(ref)

Handle entity references as data, possibly converting known HTML entity references to the corresponding Unicode characters.

handle_pi(text)

Handle a processing instruction as a ProcessingInstruction object, possibly one with a %SOUP-ENCODING% slot into which an encoding will be plugged later.

i = 'gt'
isSelfClosingTag(name)

Returns true iff the given string is the name of a self-closing tag according to this parser.

parse_declaration(i)

Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.

popTag()
pushTag(tag)
reset()
unknown_endtag(name)
unknown_starttag(name, attrs, selfClosing=0)
class cbtestlib.builds.BeautifulSoup.CData

Bases: cbtestlib.builds.BeautifulSoup.NavigableString

class cbtestlib.builds.BeautifulSoup.Comment

Bases: cbtestlib.builds.BeautifulSoup.NavigableString

class cbtestlib.builds.BeautifulSoup.Declaration

Bases: cbtestlib.builds.BeautifulSoup.NavigableString

class cbtestlib.builds.BeautifulSoup.ICantBelieveItsBeautifulSoup(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulSoup

The BeautifulSoup class is oriented towards skipping over common HTML errors like unclosed tags. However, sometimes it makes errors of its own. For instance, consider this fragment:

<b>Foo<b>Bar</b></b>

This is perfectly valid (if bizarre) HTML. However, the BeautifulSoup class will implicitly close the first b tag when it encounters the second ‘b’. It will think the author wrote “<b>Foo<b>Bar”, and didn’t close the first ‘b’ tag, because there’s no real-world reason to bold something that’s already bold. When it encounters ‘</b></b>’ it will close two more ‘b’ tags, for a grand total of three tags closed instead of two. This can throw off the rest of your document structure. The same is true of a number of other tags, listed below.

It’s much more common for someone to forget to close a ‘b’ tag than to actually use nested ‘b’ tags, and the BeautifulSoup class handles the common case. This class handles the not-co-common case: where you can’t believe someone wrote what they did, but it’s valid HTML and BeautifulSoup screwed up by assuming it wouldn’t be.

I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ['noscript']
I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = ['em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', 'big']
NESTABLE_TAGS = {'em': [], 'code': [], 'kbd': [], 'ins': [], 'table': [], 'font': [], 'noscript': [], 'span': [], 'sub': [], 'bdo': [], 'tr': ['table', 'tbody', 'tfoot', 'thead'], 'tbody': ['table'], 'li': ['ul', 'ol'], 'dfn': [], 'tfoot': ['table'], 'th': ['tr'], 'sup': [], 'var': [], 'td': ['tr'], 'samp': [], 'cite': [], 'thead': ['table'], 'dl': [], 'blockquote': [], 'fieldset': [], 'acronym': [], 'big': [], 'dd': ['dl'], 'object': [], 'b': [], 'abbr': [], 'dt': ['dl'], 'strong': [], 'ol': [], 'center': [], 'i': [], 'q': [], 'ul': [], 'del': [], 'small': [], 'div': [], 'tt': []}
class cbtestlib.builds.BeautifulSoup.MinimalSoup(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulSoup

The MinimalSoup class is for parsing HTML that contains pathologically bad markup. It makes no assumptions about tag nesting, but it does know which tags are self-closing, that <script> tags contain Javascript and should not be parsed, that META tags may contain encoding information, and so on.

This also makes it better for subclassing than BeautifulStoneSoup or BeautifulSoup.

NESTABLE_TAGS = {}
RESET_NESTING_TAGS = {}
class cbtestlib.builds.BeautifulSoup.NavigableString

Bases: unicode, cbtestlib.builds.BeautifulSoup.PageElement

class cbtestlib.builds.BeautifulSoup.PageElement

Contains the navigational information for some part of the page (either a tag or a piece of text)

extract()

Destructively rips this element out of the tree.

findAllNext(name=None, attrs={}, text=None, limit=None, **kwargs)

Returns all items that match the given criteria and appear before after Tag in the document.

findAllPrevious(name=None, attrs={}, text=None, limit=None, **kwargs)

Returns all items that match the given criteria and appear before this Tag in the document.

findNext(name=None, attrs={}, text=None, **kwargs)

Returns the first item that matches the given criteria and appears after this Tag in the document.

findNextSibling(name=None, attrs={}, text=None, **kwargs)

Returns the closest sibling to this Tag that matches the given criteria and appears after this Tag in the document.

findNextSiblings(name=None, attrs={}, text=None, limit=None, **kwargs)

Returns the siblings of this Tag that match the given criteria and appear after this Tag in the document.

findParent(name=None, attrs={}, **kwargs)

Returns the closest parent of this Tag that matches the given criteria.

findParents(name=None, attrs={}, limit=None, **kwargs)

Returns the parents of this Tag that match the given criteria.

findPrevious(name=None, attrs={}, text=None, **kwargs)

Returns the first item that matches the given criteria and appears before this Tag in the document.

findPreviousSibling(name=None, attrs={}, text=None, **kwargs)

Returns the closest sibling to this Tag that matches the given criteria and appears before this Tag in the document.

findPreviousSiblings(name=None, attrs={}, text=None, limit=None, **kwargs)

Returns the siblings of this Tag that match the given criteria and appear before this Tag in the document.

insert(position, newChild)
nextGenerator()
nextSiblingGenerator()
parentGenerator()
previousGenerator()
previousSiblingGenerator()
replaceWith(replaceWith)
setup(parent=None, previous=None)

Sets up the initial relations between this element and other elements.

substituteEncoding(str, encoding=None)
toEncoding(s, encoding=None)

Encodes an object to a string in some encoding, or to Unicode. .

class cbtestlib.builds.BeautifulSoup.ProcessingInstruction

Bases: cbtestlib.builds.BeautifulSoup.NavigableString

class cbtestlib.builds.BeautifulSoup.ResultSet(source)

Bases: list

A ResultSet is just a list that keeps track of the SoupStrainer that created it.

class cbtestlib.builds.BeautifulSoup.RobustHTMLParser(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulSoup

class cbtestlib.builds.BeautifulSoup.RobustInsanelyWackAssHTMLParser(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.MinimalSoup

class cbtestlib.builds.BeautifulSoup.RobustWackAssHTMLParser(*args, **kwargs)

Bases: cbtestlib.builds.BeautifulSoup.ICantBelieveItsBeautifulSoup

class cbtestlib.builds.BeautifulSoup.RobustXMLParser(markup='', parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo='xml', convertEntities=None, selfClosingTags=None)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup

class cbtestlib.builds.BeautifulSoup.SimplifyingSOAPParser(markup='', parseOnlyThese=None, fromEncoding=None, markupMassage=True, smartQuotesTo='xml', convertEntities=None, selfClosingTags=None)

Bases: cbtestlib.builds.BeautifulSoup.BeautifulSOAP

class cbtestlib.builds.BeautifulSoup.SoupStrainer(name=None, attrs={}, text=None, **kwargs)

Encapsulates a number of ways of matching a markup element (tag or text).

search(markup)
searchTag(markupName=None, markupAttrs={})
exception cbtestlib.builds.BeautifulSoup.StopParsing

Bases: exceptions.Exception

class cbtestlib.builds.BeautifulSoup.Tag(parser, name, attrs=None, parent=None, previous=None)

Bases: cbtestlib.builds.BeautifulSoup.PageElement

Represents a found HTML tag with its attributes and contents.

append(tag)

Appends the given tag to the contents of this tag.

childGenerator()
find(name=None, attrs={}, recursive=True, text=None, **kwargs)

Return only the first child of this Tag matching the given criteria.

findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.

The value of a key-value pair in the ‘attrs’ map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of ‘matches’. The same is true of the tag name.

findAllChildren(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.

The value of a key-value pair in the ‘attrs’ map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of ‘matches’. The same is true of the tag name.

findChild(name=None, attrs={}, recursive=True, text=None, **kwargs)

Return only the first child of this Tag matching the given criteria.

get(key, default=None)

Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute.

has_key(key)
prettify(encoding='utf-8')
recursiveChildGenerator()
renderContents(encoding='utf-8', prettyPrint=False, indentLevel=0)

Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string..

class cbtestlib.builds.BeautifulSoup.UnicodeDammit(markup, overrideEncodings=[], smartQuotesTo='xml')

A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.

CHARSET_ALIASES = {'x-sjis': 'shift-jis', 'macintosh': 'mac-roman'}
EBCDIC_TO_ASCII_MAP = None
MS_CHARS = {'\x81': ' ', '\x80': ('euro', '20AC'), '\x83': ('fnof', '192'), '\x82': ('sbquo', '201A'), '\x85': ('hellip', '2026'), '\x84': ('bdquo', '201E'), '\x87': ('Dagger', '2021'), '\x86': ('dagger', '2020'), '\x89': ('permil', '2030'), '\x88': ('circ', '2C6'), '\x8b': ('lsaquo', '2039'), '\x8a': ('Scaron', '160'), '\x8d': '?', '\x8c': ('OElig', '152'), '\x8f': '?', '\x8e': ('#x17D', '17D'), '\x91': ('lsquo', '2018'), '\x90': '?', '\x93': ('ldquo', '201C'), '\x92': ('rsquo', '2019'), '\x95': ('bull', '2022'), '\x94': ('rdquo', '201D'), '\x97': ('mdash', '2014'), '\x96': ('ndash', '2013'), '\x99': ('trade', '2122'), '\x98': ('tilde', '2DC'), '\x9b': ('rsaquo', '203A'), '\x9a': ('scaron', '161'), '\x9d': '?', '\x9c': ('oelig', '153'), '\x9f': ('Yuml', ''), '\x9e': ('#x17E', '17E')}
find_codec(charset)
cbtestlib.builds.BeautifulSoup.buildTagMap(default, *args)

Turns a list of maps, lists, or scalars into a single map. Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and NESTING_RESET_TAGS maps out of lists and partial maps.

cbtestlib.builds.BeautifulSoup.isList(l)

Convenience method that works with all 2.x versions of Python to determine whether or not something is listlike.

cbtestlib.builds.BeautifulSoup.isString(s)

Convenience method that works with all 2.x versions of Python to determine whether or not something is stringlike.

build_query Module

class cbtestlib.builds.build_query.BuildQuery

Bases: object

create_build_info(build_id, build_decription)
create_change_info(build_id, build_decription)
find_build(builds, product, type, arch, version, toy='')
find_membase_build(builds, product, deliverable_type, os_architecture, build_version, is_amazon=False)
find_membase_build_with_version(builds, build_version)
find_membase_release_build(product, deliverable_type, os_architecture, build_version, is_amazon=False)
get_all_builds()
get_latest_builds()
get_sustaining_latest_builds()
parse_builds()
sort_builds_by_time(builds)
sort_builds_by_version(builds)
class cbtestlib.builds.build_query.MembaseBuild

Bases: object

class cbtestlib.builds.build_query.MembaseChange

Bases: object

Table Of Contents

Previous topic

cbtestlib documentation

Next topic

cbkarma Package

This Page