Package genshi :: Package filters :: Module html :: Class HTMLSanitizer

Class HTMLSanitizer

object --+
         |
        HTMLSanitizer

A filter that removes potentially dangerous HTML tags and attributes from the stream.

>>> from genshi import HTML
>>> html = HTML('<div><script>alert(document.cookie)</script></div>')
>>> print(html | HTMLSanitizer())
<div/>

The default set of safe tags and attributes can be modified when the filter is instantiated. For example, to allow inline style attributes, the following instantation would work:

>>> html = HTML('<div style="background: #000"></div>')
>>> sanitizer = HTMLSanitizer(safe_attrs=HTMLSanitizer.SAFE_ATTRS | set(['style']))
>>> print(html | sanitizer)
<div style="background: #000"/>

Note that even in this case, the filter does attempt to remove dangerous constructs from style attributes:

>>> html = HTML('<div style="background: url(javascript:void); color: #000"></div>')
>>> print(html | sanitizer)
<div style="color: #000"/>

This handles HTML entities, unicode escapes in CSS and Javascript text, as well as a lot of other things. However, the style tag is still excluded by default because it is very hard for such sanitizing to be completely safe, especially considering how much error recovery current web browsers perform.

It also does some basic filtering of CSS properties that may be used for typical phishing attacks. For more sophisticated filtering, this class provides a couple of hooks that can be overridden in sub-classes.

Warning: Note that this special processing of CSS is currently only applied to style attributes, not style elements.

Instance Methods

__init__(self, safe_tags=frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b', 'bi..., safe_attrs=frozenset(['abbr', 'accept', 'accept-charset', 'accesskey', 'a..., safe_schemes=frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']), uri_attrs=frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',...)
Create the sanitizer.

__call__(self, stream)
Apply the filter to the given stream.

bool

is_safe_css(self, propname, value)
Determine whether the given css property declaration is to be considered safe for inclusion in the output.

bool

is_safe_elem(self, tag, attrs)
Determine whether the given element should be considered safe for inclusion in the output.

bool

is_safe_uri(self, uri)
Determine whether the given URI is to be considered safe for inclusion in the output.

list

sanitize_css(self, text)
Remove potentially dangerous property declarations from CSS code.

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables
	SAFE_TAGS = `frozenset(['a', 'abbr', 'acronym', 'address', 'are...`
	SAFE_ATTRS = `frozenset(['abbr', 'accept', 'accept-charset', 'a...`
	SAFE_SCHEMES = `frozenset([None, 'file', 'ftp', 'http', 'https'...`
	URI_ATTRS = `frozenset(['action', 'background', 'dynsrc', 'href...`

Instance Variables
	safe_tags The set of tag names that are considered safe.
	safe_attrs The set of attribute names that are considered safe.
	uri_attrs The set of names of attributes that may contain URIs.
	safe_schemes The set of URI schemes that are considered safe.

Properties
Inherited from `object`: `__class__`

Method Details

init(self, safe_tags=`frozenset(['a',` `'abbr',` `'acronym',` `'address',` `'area',` `'b',` `'bi...`, safe_attrs=`frozenset(['abbr',` `'accept',` `'accept-charset',` `'accesskey',` `'a...`, safe_schemes=`frozenset([`None`,` `'file',` `'ftp',` `'http',` `'https',` `'mailto'])`, uri_attrs=`frozenset(['action',` `'background',` `'dynsrc',` `'href',` `'lowsrc',...`)
(Constructor)

Create the sanitizer.

The exact set of allowed elements and attributes can be configured.

Parameters:

safe_tags - a set of tag names that are considered safe
safe_attrs - a set of attribute names that are considered safe
safe_schemes - a set of URI schemes that are considered safe
uri_attrs - a set of names of attributes that contain URIs

Overrides: object.__init__

call(self, stream)
(Call operator)

Apply the filter to the given stream.

Parameters:

stream - the markup event stream to filter

is_safe_css(self, propname, value)

Determine whether the given css property declaration is to be considered safe for inclusion in the output.

Parameters:

propname - the CSS property name
value - the value of the property

Returns: bool

whether the property value should be considered safe

Since: version 0.6

is_safe_elem(self, tag, attrs)

Determine whether the given element should be considered safe for inclusion in the output.

Parameters:

tag (QName) - the tag name of the element
attrs (Attrs) - the element attributes

Returns: bool

whether the element should be considered safe

Since: version 0.6

is_safe_uri(self, uri)

Determine whether the given URI is to be considered safe for inclusion in the output.

The default implementation checks whether the scheme of the URI is in the set of allowed URIs (safe_schemes).

>>> sanitizer = HTMLSanitizer()
>>> sanitizer.is_safe_uri('http://example.org/')
True
>>> sanitizer.is_safe_uri('javascript:alert(document.cookie)')
False

Parameters:

uri - the URI to check

Returns: bool

True if the URI can be considered safe, False otherwise

Since: version 0.4.3

sanitize_css(self, text)

Remove potentially dangerous property declarations from CSS code.

In particular, properties using the CSS url() function with a scheme that is not considered safe are removed:

>>> sanitizer = HTMLSanitizer()
>>> sanitizer.sanitize_css(u'''
...   background: url(javascript:alert("foo"));
...   color: #000;
... ''')
[u'color: #000']

Also, the proprietary Internet Explorer function expression() is always stripped:

>>> sanitizer.sanitize_css(u'''
...   background: #fff;
...   color: #000;
...   width: e/**/xpression(alert("foo"));
... ''')
[u'background: #fff', u'color: #000']

Parameters:

text - the CSS text; this is expected to be unicode and to not contain any character or numeric references

Returns: list

a list of declarations that are considered safe

Since: version 0.4.3

Class Variable Details

SAFE_TAGS

Value:

frozenset(['a',
           'abbr',
           'acronym',
           'address',
           'area',
           'b',
           'big',
           'blockquote',
...

SAFE_ATTRS

Value:

frozenset(['abbr',
           'accept',
           'accept-charset',
           'accesskey',
           'action',
           'align',
           'alt',
           'axis',
...

SAFE_SCHEMES

Value:

frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto'])

URI_ATTRS

Value:

frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc', 'src'])

Class HTMLSanitizer

__call__(self, stream) (Call operator)

is_safe_css(self, propname, value)

is_safe_elem(self, tag, attrs)

is_safe_uri(self, uri)

sanitize_css(self, text)

SAFE_TAGS

SAFE_ATTRS

SAFE_SCHEMES

URI_ATTRS

call(self, stream)
(Call operator)