Sanitizing text fragments

bleach.clean() is Bleach’s HTML sanitization method.

Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing algorithm and sanitize any disallowed tags or attributes. This algorithm also takes care of things like unclosed and (some) misnested tags.

Note

You may pass in a string or a unicode object, but Bleach will always return unicode.

bleach.clean(text, tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True)[source]

Clean an HTML fragment of malicious content and return it

This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.

This function is not designed to use to transform content to be used in non-web-page contexts.

Example:

import bleach

better_text = bleach.clean(yucky_text)

Note

If you’re cleaning a lot of text and passing the same argument values or you want more configurability, consider using a bleach.sanitizer.Cleaner instance.

Parameters:
  • text (str) – the text to clean
  • tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
  • attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
  • styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
  • protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
  • strip (bool) – whether or not to strip disallowed elements
  • strip_comments (bool) – whether or not to strip HTML comments
Returns:

cleaned text as unicode

Allowed tags (tags)

The tags kwarg specifies the allowed set of HTML tags. It should be a list, tuple, or other iterable. Any HTML tags not in this list will be escaped or stripped from the text.

For example:

>>> import bleach

>>> bleach.clean(
...     u'<b><i>an example</i></b>',
...     tags=['b'],
... )
u'<b>&lt;i&gt;an example&lt;/i&gt;</b>'

The default value is a relatively conservative list found in bleach.sanitizer.ALLOWED_TAGS.

bleach.sanitizer.ALLOWED_TAGS = [u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul']

List of allowed tags

Allowed Attributes (attributes)

The attributes kwarg lets you specify which attributes are allowed. The value can be a list, a callable or a map of tag name to list or callable.

The default value is also a conservative dict found in bleach.sanitizer.ALLOWED_ATTRIBUTES.

bleach.sanitizer.ALLOWED_ATTRIBUTES = {u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}

Map of allowed attributes by tag

Changed in version 2.0: Prior to 2.0, the attributes kwarg value could only be a list or a map.

As a list

The attributes value can be a list which specifies the list of attributes allowed for any tag.

For example:

>>> import bleach

>>> bleach.clean(
...     u'<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>',
...     tags=['p'],
...     attributes=['style'],
...     styles=['color'],
... )
u'<p style="color: red;">blah blah blah</p>'

As a dict

The attributes value can be a dict which maps tags to what attributes they can have.

You can also specify *, which will match any tag.

For example, this allows “href” and “rel” for “a” tags, “alt” for the “img” tag and “class” for any tag (including “a” and “img”):

>>> import bleach

>>> attrs = {
...     '*': ['class'],
...     'a': ['href', 'rel'],
...     'img': ['alt'],
... }

>>> bleach.clean(
...    u'<img alt="an example" width=500>',
...    tags=['img'],
...    attributes=attrs
... )
u'<img alt="an example">'

Using functions

You can also use callables that take the tag, attribute name and attribute value and returns True to keep the attribute or False to drop it.

You can pass a callable as the attributes argument value and it’ll run for every tag/attr.

For example:

>>> import bleach

>>> def allow_h(tag, name, value):
...     return name[0] == 'h'

>>> bleach.clean(
...    u'<a href="http://example.com" title="link">link</a>',
...    tags=['a'],
...    attributes=allow_h,
... )
u'<a href="http://example.com">link</a>'

You can also pass a callable as a value in an attributes dict and it’ll run for attributes for specified tags:

>>> from urlparse import urlparse
>>> import bleach

>>> def allow_src(tag, name, value):
...     if name in ('alt', 'height', 'width'):
...         return True
...     if name == 'src':
...         p = urlparse(value)
...         return (not p.netloc) or p.netloc == 'mydomain.com'
...     return False

>>> bleach.clean(
...    u'<img src="http://example.com" alt="an example">',
...    tags=['img'],
...    attributes={
...        'img': allow_src
...    }
... )
u'<img alt="an example">'

Changed in version 2.0: In previous versions of Bleach, the callable took an attribute name and a attribute value. Now it takes a tag, an attribute name and an attribute value.

Allowed styles (styles)

If you allow the style attribute, you will also need to specify the allowed styles users are allowed to set, for example color and background-color.

The default value is an empty list. In other words, the style attribute will be allowed but no style declaration names will be allowed.

For example, to allow users to set the color and font-weight of text:

>>> import bleach

>>> tags = ['p', 'em', 'strong']
>>> attrs = {
...     '*': ['style']
... }
>>> styles = ['color', 'font-weight']

>>> bleach.clean(
...     u'<p style="font-weight: heavy;">my html</p>',
...     tags=tags,
...     attributes=attrs,
...     styles=styles
... )
u'<p style="font-weight: heavy;">my html</p>'

Default styles are stored in bleach.sanitizer.ALLOWED_STYLES.

bleach.sanitizer.ALLOWED_STYLES = []

List of allowed styles

Allowed protocols (protocols)

If you allow tags that have attributes containing a URI value (like the href attribute of an anchor tag, you may want to adapt the accepted protocols.

For example, this sets allowed protocols to http, https and smb:

>>> import bleach

>>> bleach.clean(
...     '<a href="smb://more_text">allowed protocol</a>',
...     protocols=['http', 'https', 'smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'

This adds smb to the Bleach-specified set of allowed protocols:

>>> import bleach

>>> bleach.clean(
...     '<a href="smb://more_text">allowed protocol</a>',
...     protocols=bleach.ALLOWED_PROTOCOLS + ['smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'

Default protocols are in bleach.sanitizer.ALLOWED_PROTOCOLS.

bleach.sanitizer.ALLOWED_PROTOCOLS = [u'http', u'https', u'mailto']

List of allowed protocols

Stripping markup (strip)

By default, Bleach escapes tags that aren’t specified in the allowed tags list and invalid markup. For example:

>>> import bleach

>>> bleach.clean('<span>is not allowed</span>')
u'&lt;span&gt;is not allowed&lt;/span&gt;'

>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'])
u'<b>&lt;span&gt;is not allowed&lt;/span&gt;</b>'

If you would rather Bleach stripped this markup entirely, you can pass strip=True:

>>> import bleach

>>> bleach.clean('<span>is not allowed</span>', strip=True)
u'is not allowed'

>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True)
u'<b>is not allowed</b>'

Stripping comments (strip_comments)

By default, Bleach will strip out HTML comments. To disable this behavior, set strip_comments=False:

>>> import bleach

>>> html = 'my<!-- commented --> html'

>>> bleach.clean(html)
u'my html'

>>> bleach.clean(html, strip_comments=False)
u'my<!-- commented --> html'

Using bleach.sanitizer.Cleaner

If you’re cleaning a lot of text or you need better control of things, you should create a bleach.sanitizer.Cleaner instance.

class bleach.sanitizer.Cleaner(tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True, filters=None)[source]

Cleaner for cleaning HTML fragments of malicious content

This cleaner is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.

This cleaner is not designed to use to transform content to be used in non-web-page contexts.

To use:

from bleach.sanitizer import Cleaner

cleaner = Cleaner()

for text in all_the_yucky_things:
    sanitized = cleaner.clean(text)

Initializes a Cleaner

Parameters:
  • tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
  • attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
  • styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
  • protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
  • strip (bool) – whether or not to strip disallowed elements
  • strip_comments (bool) – whether or not to strip HTML comments
  • filters (list) –

    list of html5lib Filter classes to pass streamed content through

    Warning

    Using filters changes the output of bleach.Cleaner.clean. Make sure the way the filters change the output are secure.

clean(text)[source]

Cleans text and returns sanitized result as unicode

Parameters:text (str) – text to be cleaned
Returns:sanitized text as unicode

New in version 2.0.

html5lib Filters (filters)

Bleach sanitizing is implemented as an html5lib filter. The consequence of this is that we can pass the streamed content through additional specified filters after the bleach.sanitizer.BleachSanitizingFilter filter has run.

This lets you add data, drop data and change data as it is being serialized back to a unicode.

Documentation on html5lib Filters is here: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters

Trivial Filter example:

>>> from bleach.sanitizer import Cleaner
>>> from html5lib.filters.base import Filter

>>> class MooFilter(Filter):
...     def __iter__(self):
...         for token in Filter.__iter__(self):
...             if token['type'] in ['StartTag', 'EmptyTag'] and token['data']:
...                 for attr, value in token['data'].items():
...                     token['data'][attr] = 'moo'
...             yield token
...
>>> ATTRS = {
...     'img': ['rel', 'src']
... }
...
>>> TAGS = ['img']
>>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter])
>>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">'
>>> cleaner.clean(dirty)
u'this is cute! <img rel="moo" src="moo">'

Warning

Filters change the output of cleaning. Make sure that whatever changes the filter is applying maintain the safety guarantees of the output.

New in version 2.0.

Using bleach.sanitizer.BleachSanitizerFilter

bleach.clean creates a bleach.sanitizer.Cleaner which creates a bleach.sanitizer.BleachSanitizerFilter which does the sanitizing work.

BleachSanitizerFilter is an html5lib filter and can be used anywhere you can use an html5lib filter.

class bleach.sanitizer.BleachSanitizerFilter(source, attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, strip_disallowed_elements=False, strip_html_comments=True, **kwargs)[source]

html5lib Filter that sanitizes text

This filter can be used anywhere html5lib filters can be used.

Creates a BleachSanitizerFilter instance

Parameters:
  • source (Treewalker) – stream
  • tags (list) – allowed list of tags; defaults to bleach.sanitizer.ALLOWED_TAGS
  • attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to bleach.sanitizer.ALLOWED_ATTRIBUTES
  • styles (list) – allowed list of css styles; defaults to bleach.sanitizer.ALLOWED_STYLES
  • protocols (list) – allowed list of protocols for links; defaults to bleach.sanitizer.ALLOWED_PROTOCOLS
  • strip_disallowed_elements (bool) – whether or not to strip disallowed elements
  • strip_html_comments (bool) – whether or not to strip HTML comments

New in version 2.0.