Beautiful Soup Elixir and Tonic “The Screen-Scraper’s Friend” v3.0.0 http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup parses a (possibly invalid) XML or HTML document into a tree representation. It provides methods and Pythonic idioms that make it easy to navigate, search, and modify the tree.
A well-formed XML/HTML document yields a well-formed data structure. An ill-formed XML/HTML document yields a correspondingly ill-formed data structure. If your document is only locally well-formed, you can use this library to find and process the well-formed part of it.
Beautiful Soup works with Python 2.2 and up. It has no external dependencies, but you’ll have more success at converting data to UTF-8 if you also install these three packages:
Beautiful Soup defines classes for two main parsing strategies:
- BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific language that kind of looks like XML.
- BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid or invalid. This class has web browser-like heuristics for obtaining a sensible parse tree in the face of common HTML errors.
Beautiful Soup also defines a class (UnicodeDammit) for autodetecting the encoding of an HTML or XML document, and converting it to Unicode. Much of this code is taken from Mark Pilgrim’s Universal Feed Parser.
For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html
Bases: cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup
This class will push a tag with only a single string child into the tag’s parent as an attribute. The attribute’s name is the tag name, and the value is the string child. An example should give the flavor of the change:
<foo bar=”baz”><bar>baz</bar></foo>
You can then access fooTag[‘bar’] instead of fooTag.barTag.string.
This is, of course, useful for scraping structures that tend to use subelements instead of attributes, such as SOAP messages. Note that it modifies its input, so don’t print the modified version out.
I’m not sure how many people really want to use this class; let me know if you do. Mainly I like the name.
Bases: cbtestlib.builds.BeautifulSoup.BeautifulStoneSoup
This parser knows the following facts about HTML:
Some tags have no closing tag and should be interpreted as being closed as soon as they are encountered.
The text inside some tags (ie. ‘script’) may contain tags which are not really part of the document and which should be parsed as text, not tags. If you want to parse the text as tags, you can always fetch it and parse it explicitly.
Tag nesting rules:
Most tags can’t be nested at all. For instance, the occurance of a <p> tag should implicitly close the previous <p> tag.
- <p>Para1<p>Para2
should be transformed into:
<p>Para1</p><p>Para2
Some tags can be nested arbitrarily. For instance, the occurance of a <blockquote> tag should _not_ implicitly close the previous <blockquote> tag.
- Alice said: <blockquote>Bob said: <blockquote>Blah
should NOT be transformed into:
Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
Some tags can be nested, but the nesting is reset by the interposition of other tags. For instance, a <tr> tag should implicitly close the previous <tr> tag within the same <table>, but not close a <tr> tag in another table.
- <table><tr>Blah<tr>Blah
should be transformed into:
- <table><tr>Blah</tr><tr>Blah
but,
- <tr>Blah<table><tr>Blah
should NOT be transformed into
<tr>Blah<table></tr><tr>Blah
Differing assumptions about tag nesting rules are a major source of problems with the BeautifulSoup class. If BeautifulSoup is not treating as nestable a tag your page author treats as nestable, try ICantBelieveItsBeautifulSoup, MinimalSoup, or BeautifulStoneSoup before writing your own subclass.
Beautiful Soup can detect a charset included in a META tag, try to convert the document to that charset, and re-parse the document from the beginning.
Bases: cbtestlib.builds.BeautifulSoup.Tag, sgmllib.SGMLParser
This class contains the basic parser and search code. It defines a parser that knows nothing about tag behavior except for the following:
You can’t close a tag without closing all the tags it encloses. That is, “<foo><bar></foo>” actually means “<foo><bar></bar></foo>”.
[Another possible explanation is “<foo><bar /></foo>”, but since this class defines no SELF_CLOSING_TAGS, it will never use that explanation.]
This class is useful for parsing XML or made-up markup languages, or when BeautifulSoup makes an assumption counter to what you were expecting.
Handle character references as data.
Handle comments as Comment objects.
Handle DOCTYPEs and the like as Declaration objects.
Handle entity references as data, possibly converting known HTML entity references to the corresponding Unicode characters.
Handle a processing instruction as a ProcessingInstruction object, possibly one with a %SOUP-ENCODING% slot into which an encoding will be plugged later.
Returns true iff the given string is the name of a self-closing tag according to this parser.
Treat a bogus SGML declaration as raw data. Treat a CDATA declaration as a CData object.
Bases: cbtestlib.builds.BeautifulSoup.BeautifulSoup
The BeautifulSoup class is oriented towards skipping over common HTML errors like unclosed tags. However, sometimes it makes errors of its own. For instance, consider this fragment:
<b>Foo<b>Bar</b></b>
This is perfectly valid (if bizarre) HTML. However, the BeautifulSoup class will implicitly close the first b tag when it encounters the second ‘b’. It will think the author wrote “<b>Foo<b>Bar”, and didn’t close the first ‘b’ tag, because there’s no real-world reason to bold something that’s already bold. When it encounters ‘</b></b>’ it will close two more ‘b’ tags, for a grand total of three tags closed instead of two. This can throw off the rest of your document structure. The same is true of a number of other tags, listed below.
It’s much more common for someone to forget to close a ‘b’ tag than to actually use nested ‘b’ tags, and the BeautifulSoup class handles the common case. This class handles the not-co-common case: where you can’t believe someone wrote what they did, but it’s valid HTML and BeautifulSoup screwed up by assuming it wouldn’t be.
Bases: cbtestlib.builds.BeautifulSoup.BeautifulSoup
The MinimalSoup class is for parsing HTML that contains pathologically bad markup. It makes no assumptions about tag nesting, but it does know which tags are self-closing, that <script> tags contain Javascript and should not be parsed, that META tags may contain encoding information, and so on.
This also makes it better for subclassing than BeautifulStoneSoup or BeautifulSoup.
Bases: unicode, cbtestlib.builds.BeautifulSoup.PageElement
Contains the navigational information for some part of the page (either a tag or a piece of text)
Destructively rips this element out of the tree.
Returns all items that match the given criteria and appear before after Tag in the document.
Returns all items that match the given criteria and appear before this Tag in the document.
Returns the first item that matches the given criteria and appears after this Tag in the document.
Returns the closest sibling to this Tag that matches the given criteria and appears after this Tag in the document.
Returns the siblings of this Tag that match the given criteria and appear after this Tag in the document.
Returns the closest parent of this Tag that matches the given criteria.
Returns the parents of this Tag that match the given criteria.
Returns the first item that matches the given criteria and appears before this Tag in the document.
Returns the closest sibling to this Tag that matches the given criteria and appears before this Tag in the document.
Returns the siblings of this Tag that match the given criteria and appear before this Tag in the document.
Sets up the initial relations between this element and other elements.
Encodes an object to a string in some encoding, or to Unicode. .
Bases: list
A ResultSet is just a list that keeps track of the SoupStrainer that created it.
Bases: cbtestlib.builds.BeautifulSoup.ICantBelieveItsBeautifulSoup
Encapsulates a number of ways of matching a markup element (tag or text).
Bases: exceptions.Exception
Bases: cbtestlib.builds.BeautifulSoup.PageElement
Represents a found HTML tag with its attributes and contents.
Appends the given tag to the contents of this tag.
Return only the first child of this Tag matching the given criteria.
Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
The value of a key-value pair in the ‘attrs’ map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of ‘matches’. The same is true of the tag name.
Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
The value of a key-value pair in the ‘attrs’ map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of ‘matches’. The same is true of the tag name.
Return only the first child of this Tag matching the given criteria.
Returns the value of the ‘key’ attribute for the tag, or the value given for ‘default’ if it doesn’t have that attribute.
Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string..
A class for detecting the encoding of a *ML document and converting it to a Unicode string. If the source encoding is windows-1252, can replace MS smart quotes with their HTML or XML equivalents.
Turns a list of maps, lists, or scalars into a single map. Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and NESTING_RESET_TAGS maps out of lists and partial maps.
Convenience method that works with all 2.x versions of Python to determine whether or not something is listlike.
Convenience method that works with all 2.x versions of Python to determine whether or not something is stringlike.
Bases: object
Bases: object
Bases: object