Sanitization¶
Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.
These elements are sanitized by default:
- entries[i].content
- entries[i].summary
- entries[i].title
- feed.info
- feed.rights
- feed.subtitle
- feed.title
Note
If the content is declared to be (or is determined to be)
text/plain, it will not be sanitized. This is to avoid data loss.
It is recommended that you check the content type in e.g.
entries[i].summary_detail.type
. If it is text/plain then
it has not been sanitized (and you should perform HTML escaping before
rendering the content).
HTML Sanitization¶
The following HTML elements are allowed by default (all others are stripped):
|
|
|
The following HTML attributes are allowed by default (all others are stripped):
|
|
|
SVG Sanitization¶
The following SVG elements are allowed by default (all others are stripped):
|
|
|
The following SVG attributes are allowed by default (all others are stripped):
|
|
|
MathML Sanitization¶
The following MathML elements are allowed by default (all others are stripped):
|
|
|
The following MathML attributes are allowed by default (all others are stripped):
|
|
|
CSS Sanitization¶
The following CSS properties are allowed by default in style attributes (all others are stripped):
|
|
|
Note
Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. URIs are not allowed, to prevent platypus attacks. See the _HTMLSanitizer class for more details.
Whitelist, Don’t Blacklist¶
I am often asked why Universal Feed Parser is so hard-assed about HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous HTML tags and attributes:
- script, which can contain malicious script
- applet, embed, and object, which can automatically download and execute malicious code
- meta, which can contain malicious redirects
- onload, onunload, and all other on* attributes, which can contain malicious script
- style, link, and the style attribute, which can contain malicious script
style? Yes, style. CSS definitions can contain executable code.
Embedding Javascript in CSS¶
This sample is taken from http://feedparser.org/docs/examples/rss20.xml:
<description>Watch out for
<span style="background: url(javascript:window.location='http://example.org/')">
nasty tricks</span></description>
This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:
<description>Watch out for
<span style="any: expression(window.location='http://example.org/')">
nasty tricks</span></description>
Internet Explorer for Windows will execute the Javascript in both of these examples.
Now consider that in HTML, attribute values may be entity-encoded in several different ways.
Embedding encoded Javascript in CSS¶
To a browser, this:
<span style="any: expression(window.location='http://example.org/')">
is the same as this (without the line breaks):
<span style="any: expre
ssion(window
.location='h
ttp://exampl
e.org/')">
which is the same as this (without the line breaks):
<span style="any: expr
ession(win
dow.locati
on='http:/
/example.o
rg/')">
And so on, plus several other variations, plus every combination of every variation.
The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.
See also
- How to consume RSS safely
- Explains the platypus attack.