Sanitizing text fragments¶

bleach.clean() is Bleach’s HTML sanitization method.

Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing algorithm and sanitize any disallowed tags or attributes. This algorithm also takes care of things like unclosed and (some) misnested tags.

Note

You may pass in a string or a unicode object, but Bleach will always return unicode.

bleach.clean(text, tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True)[source]¶

Clean an HTML fragment of malicious content and return it

This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.

This function is not designed to use to transform content to be used in non-web-page contexts.

Example:

import bleach

better_text = bleach.clean(yucky_text)

Note

If you’re cleaning a lot of text and passing the same argument values or you want more configurability, consider using a bleach.sanitizer.Cleaner instance.

Parameters:

text (str) – the text to clean
tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
strip (bool) – whether or not to strip disallowed elements
strip_comments (bool) – whether or not to strip HTML comments

Returns:

cleaned text as unicode

Allowed tags (`tags`)¶

The tags kwarg specifies the allowed set of HTML tags. It should be a list, tuple, or other iterable. Any HTML tags not in this list will be escaped or stripped from the text.

For example:

>>> import bleach

>>> bleach.clean(
...     u'<b><i>an example</i></b>',
...     tags=['b'],
... )
u'<b>&lt;i&gt;an example&lt;/i&gt;</b>'

The default value is a relatively conservative list found in bleach.sanitizer.ALLOWED_TAGS.

bleach.sanitizer.ALLOWED_TAGS = [u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul']¶: List of allowed tags

Allowed Attributes (`attributes`)¶

The attributes kwarg lets you specify which attributes are allowed. The value can be a list, a callable or a map of tag name to list or callable.

The default value is also a conservative dict found in bleach.sanitizer.ALLOWED_ATTRIBUTES.

bleach.sanitizer.ALLOWED_ATTRIBUTES = {u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}¶: Map of allowed attributes by tag

Changed in version 2.0: Prior to 2.0, the attributes kwarg value could only be a list or a map.

As a list¶

The attributes value can be a list which specifies the list of attributes allowed for any tag.

For example:

>>> import bleach

>>> bleach.clean(
...     u'<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>',
...     tags=['p'],
...     attributes=['style'],
...     styles=['color'],
... )
u'<p style="color: red;">blah blah blah</p>'

As a dict¶

The attributes value can be a dict which maps tags to what attributes they can have.

You can also specify *, which will match any tag.

For example, this allows “href” and “rel” for “a” tags, “alt” for the “img” tag and “class” for any tag (including “a” and “img”):

>>> import bleach

>>> attrs = {
...     '*': ['class'],
...     'a': ['href', 'rel'],
...     'img': ['alt'],
... }

>>> bleach.clean(
...    u'<img alt="an example" width=500>',
...    tags=['img'],
...    attributes=attrs
... )
u'<img alt="an example">'

Using functions¶

You can also use callables that take the tag, attribute name and attribute value and returns True to keep the attribute or False to drop it.

You can pass a callable as the attributes argument value and it’ll run for every tag/attr.

For example:

>>> import bleach

>>> def allow_h(tag, name, value):
...     return name[0] == 'h'

>>> bleach.clean(
...    u'<a href="http://example.com" title="link">link</a>',
...    tags=['a'],
...    attributes=allow_h,
... )
u'<a href="http://example.com">link</a>'

You can also pass a callable as a value in an attributes dict and it’ll run for attributes for specified tags:

>>> from urlparse import urlparse
>>> import bleach

>>> def allow_src(tag, name, value):
...     if name in ('alt', 'height', 'width'):
...         return True
...     if name == 'src':
...         p = urlparse(value)
...         return (not p.netloc) or p.netloc == 'mydomain.com'
...     return False

>>> bleach.clean(
...    u'<img src="http://example.com" alt="an example">',
...    tags=['img'],
...    attributes={
...        'img': allow_src
...    }
... )
u'<img alt="an example">'

Changed in version 2.0: In previous versions of Bleach, the callable took an attribute name and a attribute value. Now it takes a tag, an attribute name and an attribute value.

Allowed styles (`styles`)¶

If you allow the style attribute, you will also need to specify the allowed styles users are allowed to set, for example color and background-color.

The default value is an empty list. In other words, the style attribute will be allowed but no style declaration names will be allowed.

For example, to allow users to set the color and font-weight of text:

>>> import bleach

>>> tags = ['p', 'em', 'strong']
>>> attrs = {
...     '*': ['style']
... }
>>> styles = ['color', 'font-weight']

>>> bleach.clean(
...     u'<p style="font-weight: heavy;">my html</p>',
...     tags=tags,
...     attributes=attrs,
...     styles=styles
... )
u'<p style="font-weight: heavy;">my html</p>'

Default styles are stored in bleach.sanitizer.ALLOWED_STYLES.

bleach.sanitizer.ALLOWED_STYLES = []¶: List of allowed styles

Allowed protocols (`protocols`)¶

If you allow tags that have attributes containing a URI value (like the href attribute of an anchor tag, you may want to adapt the accepted protocols.

For example, this sets allowed protocols to http, https and smb:

>>> import bleach

>>> bleach.clean(
...     '<a href="smb://more_text">allowed protocol</a>',
...     protocols=['http', 'https', 'smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'

This adds smb to the Bleach-specified set of allowed protocols:

>>> import bleach

>>> bleach.clean(
...     '<a href="smb://more_text">allowed protocol</a>',
...     protocols=bleach.ALLOWED_PROTOCOLS + ['smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'

Default protocols are in bleach.sanitizer.ALLOWED_PROTOCOLS.

bleach.sanitizer.ALLOWED_PROTOCOLS = [u'http', u'https', u'mailto']¶: List of allowed protocols

Stripping markup (`strip`)¶

By default, Bleach escapes tags that aren’t specified in the allowed tags list and invalid markup. For example:

>>> import bleach

>>> bleach.clean('<span>is not allowed</span>')
u'&lt;span&gt;is not allowed&lt;/span&gt;'

>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'])
u'<b>&lt;span&gt;is not allowed&lt;/span&gt;</b>'

If you would rather Bleach stripped this markup entirely, you can pass strip=True:

>>> import bleach

>>> bleach.clean('<span>is not allowed</span>', strip=True)
u'is not allowed'

>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True)
u'<b>is not allowed</b>'

Stripping comments (`strip_comments`)¶

By default, Bleach will strip out HTML comments. To disable this behavior, set strip_comments=False:

>>> import bleach

>>> html = 'my<!-- commented --> html'

>>> bleach.clean(html)
u'my html'

>>> bleach.clean(html, strip_comments=False)
u'my<!-- commented --> html'

Using `bleach.sanitizer.Cleaner`¶

If you’re cleaning a lot of text or you need better control of things, you should create a bleach.sanitizer.Cleaner instance.

class bleach.sanitizer.Cleaner(tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True, filters=None)[source]¶

Cleaner for cleaning HTML fragments of malicious content

This cleaner is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.

This cleaner is not designed to use to transform content to be used in non-web-page contexts.

To use:

from bleach.sanitizer import Cleaner

cleaner = Cleaner()

for text in all_the_yucky_things:
    sanitized = cleaner.clean(text)

Initializes a Cleaner

Parameters:

tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
strip (bool) – whether or not to strip disallowed elements
strip_comments (bool) – whether or not to strip HTML comments
filters (list) –
list of html5lib Filter classes to pass streamed content through

See also

http://html5lib.readthedocs.io/en/latest/movingparts.html#filters

Warning

Using filters changes the output of bleach.Cleaner.clean. Make sure the way the filters change the output are secure.

clean(text)[source]¶

Cleans text and returns sanitized result as unicode

Parameters:	text (str) – text to be cleaned
Returns:	sanitized text as unicode

New in version 2.0.

html5lib Filters (`filters`)¶

Bleach sanitizing is implemented as an html5lib filter. The consequence of this is that we can pass the streamed content through additional specified filters after the bleach.sanitizer.BleachSanitizingFilter filter has run.

This lets you add data, drop data and change data as it is being serialized back to a unicode.

Documentation on html5lib Filters is here: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters

Trivial Filter example:

>>> from bleach.sanitizer import Cleaner
>>> from html5lib.filters.base import Filter

>>> class MooFilter(Filter):
...     def __iter__(self):
...         for token in Filter.__iter__(self):
...             if token['type'] in ['StartTag', 'EmptyTag'] and token['data']:
...                 for attr, value in token['data'].items():
...                     token['data'][attr] = 'moo'
...             yield token
...
>>> ATTRS = {
...     'img': ['rel', 'src']
... }
...
>>> TAGS = ['img']
>>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter])
>>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">'
>>> cleaner.clean(dirty)
u'this is cute! <img rel="moo" src="moo">'

Warning

Filters change the output of cleaning. Make sure that whatever changes the filter is applying maintain the safety guarantees of the output.

New in version 2.0.

Using `bleach.sanitizer.BleachSanitizerFilter`¶

bleach.clean creates a bleach.sanitizer.Cleaner which creates a bleach.sanitizer.BleachSanitizerFilter which does the sanitizing work.

BleachSanitizerFilter is an html5lib filter and can be used anywhere you can use an html5lib filter.

class bleach.sanitizer.BleachSanitizerFilter(source, attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, strip_disallowed_elements=False, strip_html_comments=True, **kwargs)[source]¶

html5lib Filter that sanitizes text

This filter can be used anywhere html5lib filters can be used.

Creates a BleachSanitizerFilter instance

Parameters:

source (Treewalker) – stream
tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
strip_disallowed_elements (bool) – whether or not to strip disallowed elements
strip_html_comments (bool) – whether or not to strip HTML comments

New in version 2.0.

Sanitizing text fragments¶

Allowed tags (`tags`)¶

Allowed Attributes (`attributes`)¶

As a list¶

As a dict¶

Using functions¶

Allowed styles (`styles`)¶

Allowed protocols (`protocols`)¶

Stripping markup (`strip`)¶

Stripping comments (`strip_comments`)¶

Using `bleach.sanitizer.Cleaner`¶

html5lib Filters (`filters`)¶

Using `bleach.sanitizer.BleachSanitizerFilter`¶

Bleach

Navigation

Related Topics

Sanitizing text fragments¶

Allowed tags (tags)¶

Allowed Attributes (attributes)¶

As a list¶

As a dict¶

Using functions¶

Allowed styles (styles)¶

Allowed protocols (protocols)¶

Stripping markup (strip)¶

Stripping comments (strip_comments)¶

Using bleach.sanitizer.Cleaner¶

html5lib Filters (filters)¶

Using bleach.sanitizer.BleachSanitizerFilter¶

Allowed tags (`tags`)¶

Allowed Attributes (`attributes`)¶

Allowed styles (`styles`)¶

Allowed protocols (`protocols`)¶

Stripping markup (`strip`)¶

Stripping comments (`strip_comments`)¶

Using `bleach.sanitizer.Cleaner`¶

html5lib Filters (`filters`)¶

Using `bleach.sanitizer.BleachSanitizerFilter`¶