Sanitizing text fragments¶
bleach.clean()
is Bleach’s HTML sanitization method.
Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing algorithm and sanitize any disallowed tags or attributes. This algorithm also takes care of things like unclosed and (some) misnested tags.
Note
You may pass in a string
or a unicode
object, but Bleach will
always return unicode
.
-
bleach.
clean
(text, tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True)[source]¶ Clean an HTML fragment of malicious content and return it
This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.
This function is not designed to use to transform content to be used in non-web-page contexts.
Example:
import bleach better_text = bleach.clean(yucky_text)
Note
If you’re cleaning a lot of text and passing the same argument values or you want more configurability, consider using a
bleach.sanitizer.Cleaner
instance.Parameters: - text (str) – the text to clean
- tags (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
- attributes (dict) – allowed attributes; can be a callable, list or dict;
defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
- styles (list) – allowed list of css styles; defaults to
bleach.sanitizer.ALLOWED_STYLES
- protocols (list) – allowed list of protocols for links; defaults
to
bleach.sanitizer.ALLOWED_PROTOCOLS
- strip (bool) – whether or not to strip disallowed elements
- strip_comments (bool) – whether or not to strip HTML comments
Returns: cleaned text as unicode
Allowed tags (tags
)¶
The tags
kwarg specifies the allowed set of HTML tags. It should be a list,
tuple, or other iterable. Any HTML tags not in this list will be escaped or
stripped from the text.
For example:
>>> import bleach
>>> bleach.clean(
... u'<b><i>an example</i></b>',
... tags=['b'],
... )
u'<b><i>an example</i></b>'
The default value is a relatively conservative list found in
bleach.sanitizer.ALLOWED_TAGS
.
-
bleach.sanitizer.
ALLOWED_TAGS
= [u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul']¶ List of allowed tags
Allowed Attributes (attributes
)¶
The attributes
kwarg lets you specify which attributes are allowed. The
value can be a list, a callable or a map of tag name to list or callable.
The default value is also a conservative dict found in
bleach.sanitizer.ALLOWED_ATTRIBUTES
.
-
bleach.sanitizer.
ALLOWED_ATTRIBUTES
= {u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}¶ Map of allowed attributes by tag
Changed in version 2.0: Prior to 2.0, the attributes
kwarg value could only be a list or a map.
As a list¶
The attributes
value can be a list which specifies the list of attributes
allowed for any tag.
For example:
>>> import bleach
>>> bleach.clean(
... u'<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>',
... tags=['p'],
... attributes=['style'],
... styles=['color'],
... )
u'<p style="color: red;">blah blah blah</p>'
As a dict¶
The attributes
value can be a dict which maps tags to what attributes they can have.
You can also specify *
, which will match any tag.
For example, this allows “href” and “rel” for “a” tags, “alt” for the “img” tag and “class” for any tag (including “a” and “img”):
>>> import bleach
>>> attrs = {
... '*': ['class'],
... 'a': ['href', 'rel'],
... 'img': ['alt'],
... }
>>> bleach.clean(
... u'<img alt="an example" width=500>',
... tags=['img'],
... attributes=attrs
... )
u'<img alt="an example">'
Using functions¶
You can also use callables that take the tag, attribute name and attribute value
and returns True
to keep the attribute or False
to drop it.
You can pass a callable as the attributes argument value and it’ll run for every tag/attr.
For example:
>>> import bleach
>>> def allow_h(tag, name, value):
... return name[0] == 'h'
>>> bleach.clean(
... u'<a href="http://example.com" title="link">link</a>',
... tags=['a'],
... attributes=allow_h,
... )
u'<a href="http://example.com">link</a>'
You can also pass a callable as a value in an attributes dict and it’ll run for attributes for specified tags:
>>> from urlparse import urlparse
>>> import bleach
>>> def allow_src(tag, name, value):
... if name in ('alt', 'height', 'width'):
... return True
... if name == 'src':
... p = urlparse(value)
... return (not p.netloc) or p.netloc == 'mydomain.com'
... return False
>>> bleach.clean(
... u'<img src="http://example.com" alt="an example">',
... tags=['img'],
... attributes={
... 'img': allow_src
... }
... )
u'<img alt="an example">'
Changed in version 2.0: In previous versions of Bleach, the callable took an attribute name and a attribute value. Now it takes a tag, an attribute name and an attribute value.
Allowed styles (styles
)¶
If you allow the style
attribute, you will also need to specify the allowed
styles users are allowed to set, for example color
and background-color
.
The default value is an empty list. In other words, the style
attribute will
be allowed but no style declaration names will be allowed.
For example, to allow users to set the color and font-weight of text:
>>> import bleach
>>> tags = ['p', 'em', 'strong']
>>> attrs = {
... '*': ['style']
... }
>>> styles = ['color', 'font-weight']
>>> bleach.clean(
... u'<p style="font-weight: heavy;">my html</p>',
... tags=tags,
... attributes=attrs,
... styles=styles
... )
u'<p style="font-weight: heavy;">my html</p>'
Default styles are stored in bleach.sanitizer.ALLOWED_STYLES
.
-
bleach.sanitizer.
ALLOWED_STYLES
= []¶ List of allowed styles
Allowed protocols (protocols
)¶
If you allow tags that have attributes containing a URI value (like the href
attribute of an anchor tag, you may want to adapt the accepted protocols.
For example, this sets allowed protocols to http, https and smb:
>>> import bleach
>>> bleach.clean(
... '<a href="smb://more_text">allowed protocol</a>',
... protocols=['http', 'https', 'smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'
This adds smb to the Bleach-specified set of allowed protocols:
>>> import bleach
>>> bleach.clean(
... '<a href="smb://more_text">allowed protocol</a>',
... protocols=bleach.ALLOWED_PROTOCOLS + ['smb']
... )
u'<a href="smb://more_text">allowed protocol</a>'
Default protocols are in bleach.sanitizer.ALLOWED_PROTOCOLS
.
-
bleach.sanitizer.
ALLOWED_PROTOCOLS
= [u'http', u'https', u'mailto']¶ List of allowed protocols
Stripping markup (strip
)¶
By default, Bleach escapes tags that aren’t specified in the allowed tags list and invalid markup. For example:
>>> import bleach
>>> bleach.clean('<span>is not allowed</span>')
u'<span>is not allowed</span>'
>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'])
u'<b><span>is not allowed</span></b>'
If you would rather Bleach stripped this markup entirely, you can pass
strip=True
:
>>> import bleach
>>> bleach.clean('<span>is not allowed</span>', strip=True)
u'is not allowed'
>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True)
u'<b>is not allowed</b>'
Stripping comments (strip_comments
)¶
By default, Bleach will strip out HTML comments. To disable this behavior, set
strip_comments=False
:
>>> import bleach
>>> html = 'my<!-- commented --> html'
>>> bleach.clean(html)
u'my html'
>>> bleach.clean(html, strip_comments=False)
u'my<!-- commented --> html'
Using bleach.sanitizer.Cleaner
¶
If you’re cleaning a lot of text or you need better control of things, you
should create a bleach.sanitizer.Cleaner
instance.
-
class
bleach.sanitizer.
Cleaner
(tags=[u'a', u'abbr', u'acronym', u'b', u'blockquote', u'code', u'em', u'i', u'li', u'ol', u'strong', u'ul'], attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, styles=[], protocols=[u'http', u'https', u'mailto'], strip=False, strip_comments=True, filters=None)[source]¶ Cleaner for cleaning HTML fragments of malicious content
This cleaner is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.
This cleaner is not designed to use to transform content to be used in non-web-page contexts.
To use:
from bleach.sanitizer import Cleaner cleaner = Cleaner() for text in all_the_yucky_things: sanitized = cleaner.clean(text)
Initializes a Cleaner
Parameters: - tags (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
- attributes (dict) – allowed attributes; can be a callable, list or dict;
defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
- styles (list) – allowed list of css styles; defaults to
bleach.sanitizer.ALLOWED_STYLES
- protocols (list) – allowed list of protocols for links; defaults
to
bleach.sanitizer.ALLOWED_PROTOCOLS
- strip (bool) – whether or not to strip disallowed elements
- strip_comments (bool) – whether or not to strip HTML comments
- filters (list) –
list of html5lib Filter classes to pass streamed content through
Warning
Using filters changes the output of
bleach.Cleaner.clean
. Make sure the way the filters change the output are secure.
- tags (list) – allowed list of tags; defaults to
New in version 2.0.
html5lib Filters (filters
)¶
Bleach sanitizing is implemented as an html5lib filter. The consequence of this
is that we can pass the streamed content through additional specified filters
after the bleach.sanitizer.BleachSanitizingFilter
filter has run.
This lets you add data, drop data and change data as it is being serialized back to a unicode.
Documentation on html5lib Filters is here: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters
Trivial Filter example:
>>> from bleach.sanitizer import Cleaner
>>> from html5lib.filters.base import Filter
>>> class MooFilter(Filter):
... def __iter__(self):
... for token in Filter.__iter__(self):
... if token['type'] in ['StartTag', 'EmptyTag'] and token['data']:
... for attr, value in token['data'].items():
... token['data'][attr] = 'moo'
... yield token
...
>>> ATTRS = {
... 'img': ['rel', 'src']
... }
...
>>> TAGS = ['img']
>>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter])
>>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">'
>>> cleaner.clean(dirty)
u'this is cute! <img rel="moo" src="moo">'
Warning
Filters change the output of cleaning. Make sure that whatever changes the filter is applying maintain the safety guarantees of the output.
New in version 2.0.
Using bleach.sanitizer.BleachSanitizerFilter
¶
bleach.clean
creates a bleach.sanitizer.Cleaner
which creates a
bleach.sanitizer.BleachSanitizerFilter
which does the sanitizing work.
BleachSanitizerFilter
is an html5lib filter and can be used anywhere you can
use an html5lib filter.
-
class
bleach.sanitizer.
BleachSanitizerFilter
(source, attributes={u'a': [u'href', u'title'], u'acronym': [u'title'], u'abbr': [u'title']}, strip_disallowed_elements=False, strip_html_comments=True, **kwargs)[source]¶ html5lib Filter that sanitizes text
This filter can be used anywhere html5lib filters can be used.
Creates a BleachSanitizerFilter instance
Parameters: - source (Treewalker) – stream
- tags (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
- attributes (dict) – allowed attributes; can be a callable, list or dict;
defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
- styles (list) – allowed list of css styles; defaults to
bleach.sanitizer.ALLOWED_STYLES
- protocols (list) – allowed list of protocols for links; defaults
to
bleach.sanitizer.ALLOWED_PROTOCOLS
- strip_disallowed_elements (bool) – whether or not to strip disallowed elements
- strip_html_comments (bool) – whether or not to strip HTML comments
New in version 2.0.