.. _advanced.sanitization: Sanitization ============ Most feeds embed :abbr:`HTML (HyperText Markup Language)` markup within feed elements. Some feeds even embed other types of markup, such as :abbr:`SVG (Scalable Vector Graphics)` or :abbr:`MathML (Mathematical Markup Language)`. Since many feed aggregators use a web browser (or browser component) to display content, :program:`Universal Feed Parser` sanitizes embedded markup to remove things that could pose security risks. These elements are sanitized by default: * :ref:`reference.entry.content` * :ref:`reference.entry.summary` * :ref:`reference.entry.title` * :ref:`reference.feed.info` * :ref:`reference.feed.rights` * :ref:`reference.feed.subtitle` * :ref:`reference.feed.title` .. note:: If the content is declared to be (or is determined to be) :mimetype:`text/plain`, it will not be sanitized. This is to avoid data loss. It is recommended that you check the content type in e.g. :py:attr:`entries[i].summary_detail.type`. If it is :mimetype:`text/plain` then it has not been sanitized (and you should perform HTML escaping before rendering the content). .. _advanced.sanitization.html: :abbr:`HTML (HyperText Markup Language)` Sanitization ----------------------------------------------------- The following :abbr:`HTML (HyperText Markup Language)` elements are allowed by default (all others are stripped): .. hlist:: :columns: 3 * a * abbr * acronym * address * area * article * aside * audio * b * big * blockquote * br * button * canvas * caption * center * cite * code * col * colgroup * command * datagrid * datalist * dd * del * details * dfn * dialog * dir * div * dl * dt * em * event-source * fieldset * figure * font * footer * form * h1 * h2 * h3 * h4 * h5 * h6 * header * hr * i * img * input * ins * kbd * keygen * label * legend * li * m * map * menu * meter * multicol * nav * nextid * noscript * ol * optgroup * option * output * p * pre * progress * q * s * samp * section * select * small * sound * source * spacer * span * strike * strong * sub * sup * table * tbody * td * textarea * tfoot * th * thead * time * tr * tt * u * ul * var * video The following :abbr:`HTML (HyperText Markup Language)` attributes are allowed by default (all others are stripped): .. hlist:: :columns: 3 * abbr * accept * accept-charset * accesskey * action * align * alt * autocomplete * autofocus * autoplay * axis * background * balance * bgcolor * bgproperties * border * bordercolor * bordercolordark * bordercolorlight * bottompadding * cellpadding * cellspacing * ch * challenge * char * charoff * charset * checked * choff * cite * class * clear * color * cols * colspan * compact * contenteditable * coords * data * datafld * datapagesize * datasrc * datetime * default * delay * dir * disabled * draggable * dynsrc * enctype * end * face * for * form * frame * galleryimg * gutter * headers * height * hidden * hidefocus * high * href * hreflang * hspace * icon * id * inputmode * ismap * keytype * label * lang * leftspacing * list * longdesc * loop * loopcount * loopend * loopstart * low * lowsrc * max * maxlength * media * method * min * multiple * name * nohref * noshade * nowrap * open * optimum * pattern * ping * point-size * poster * pqg * preload * prompt * radiogroup * readonly * rel * repeat-max * repeat-min * replace * required * rev * rightspacing * rows * rowspan * rules * scope * selected * shape * size * span * src * start * step * summary * suppress * tabindex * target * template * title * toppadding * type * unselectable * urn * usemap * valign * value * variable * volume * vrml * vspace * width * wrap * xml:lang .. _advanced.sanitization.svg: :abbr:`SVG (Scalable Vector Graphics)` Sanitization --------------------------------------------------- The following SVG elements are allowed by default (all others are stripped): .. hlist:: :columns: 3 * a * animate * animateColor * animateMotion * animateTransform * circle * defs * desc * ellipse * font-face * font-face-name * font-face-src * foreignObject * g * glyph * hkern * line * linearGradient * marker * metadata * missing-glyph * mpath * path * polygon * polyline * radialGradient * rect * set * stop * svg * switch * text * title * tspan * use The following :abbr:`SVG (Scalable Vector Graphics)` attributes are allowed by default (all others are stripped): .. hlist:: :columns: 3 * accent-height * accumulate * additive * alphabetic * arabic-form * ascent * attributeName * attributeType * baseProfile * bbox * begin * by * calcMode * cap-height * class * color * color-rendering * content * cx * cy * d * descent * display * dur * dx * dy * end * fill * fill-opacity * fill-rule * font-family * font-size * font-stretch * font-style * font-variant * font-weight * from * fx * fy * g1 * g2 * glyph-name * gradientUnits * hanging * height * horiz-adv-x * horiz-origin-x * id * ideographic * k * keyPoints * keySplines * keyTimes * lang * marker-end * marker-mid * marker-start * markerHeight * markerUnits * markerWidth * mathematical * max * min * name * offset * opacity * orient * origin * overline-position * overline-thickness * panose-1 * path * pathLength * points * preserveAspectRatio * r * refX * refY * repeatCount * repeatDur * requiredExtensions * requiredFeatures * restart * rotate * rx * ry * slope * stemh * stemv * stop-color * stop-opacity * strikethrough-position * strikethrough-thickness * stroke * stroke-dasharray * stroke-dashoffset * stroke-linecap * stroke-linejoin * stroke-miterlimit * stroke-opacity * stroke-width * systemLanguage * target * text-anchor * to * transform * type * u1 * u2 * underline-position * underline-thickness * unicode * unicode-range * units-per-em * values * version * viewBox * visibility * width * widths * x * x-height * x1 * x2 * xlink:actuate * xlink:arcrole * xlink:href * xlink:role * xlink:show * xlink:title * xlink:type * xml:base * xml:lang * xml:space * xmlns * xmlns:xlink * y * y1 * y2 * zoomAndPan .. _advanced.sanitization.mathml: :abbr:`MathML (Mathematical Markup Language)` Sanitization ---------------------------------------------------------- The following :abbr:`MathML (Mathematical Markup Language)` elements are allowed by default (all others are stripped): .. hlist:: :columns: 3 * annotation * annotation-xml * maction * maligngroup * malignmark * math * menclose * merror * mfenced * mfrac * mglyph * mi * mlabeledtr * mlongdiv * mmultiscripts * mn * mo * mover * mpadded * mphantom * mprescripts * mroot * mrow * ms * mscarries * mscarry * msgroup * msline * mspace * msqrt * msrow * mstack * mstyle * msub * msubsup * msup * mtable * mtd * mtext * mtr * munder * munderover * none * semantics The following :abbr:`MathML (Mathematical Markup Language)` attributes are allowed by default (all others are stripped): .. hlist:: :columns: 3 * accent * accentunder * actiontype * align * alignmentscope * altimg * altimg-height * altimg-valign * altimg-width * alttext * bevelled * charalign * close * columnalign * columnlines * columnspacing * columnspan * columnwidth * crossout * decimalpoint * denomalign * depth * dir * display * displaystyle * edge * encoding * equalcolumns * equalrows * fence * fontstyle * fontweight * form * frame * framespacing * groupalign * height * href * id * indentalign * indentalignfirst * indentalignlast * indentshift * indentshiftfirst * indentshiftlast * indenttarget * infixlinebreakstyle * largeop * length * linebreak * linebreakmultchar * linebreakstyle * lineleading * linethickness * location * longdivstyle * lquote * lspace * mathbackground * mathcolor * mathsize * mathvariant * maxsize * minlabelspacing * minsize * movablelimits * notation * numalign * open * other * overflow * position * rowalign * rowlines * rowspacing * rowspan * rquote * rspace * scriptlevel * scriptminsize * scriptsizemultiplier * selection * separator * separators * shift * side * src * stackalign * stretchy * subscriptshift * superscriptshift * symmetric * voffset * width * xlink:href * xlink:show * xlink:type * xmlns * xmlns:xlink .. _advanced.sanitization.css: :abbr:`CSS (Cascading Style Sheets)` Sanitization ------------------------------------------------- The following :abbr:`CSS (Cascading Style Sheets)` properties are allowed by default in style attributes (all others are stripped): .. hlist:: :columns: 3 * azimuth * background-color * border-bottom-color * border-collapse * border-color * border-left-color * border-right-color * border-top-color * clear * color * cursor * direction * display * elevation * float * font * font-family * font-size * font-style * font-variant * font-weight * height * letter-spacing * line-height * overflow * pause * pause-after * pause-before * pitch * pitch-range * richness * speak * speak-header * speak-numeral * speak-punctuation * speech-rate * stress * text-align * text-decoration * text-indent * unicode-bidi * vertical-align * voice-family * volume * white-space * width .. note:: Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. :abbr:`URI (Uniform Resource Identifier)`\s are not allowed, to prevent `platypus attacks `_. See the _HTMLSanitizer class for more details. Whitelist, Don't Blacklist -------------------------- I am often asked why :program:`Universal Feed Parser` is so hard-assed about :abbr:`HTML (HyperText Markup Language)` and :abbr:`CSS (Cascading Style Sheets)` sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous :abbr:`HTML (HyperText Markup Language)` tags and attributes: * script, which can contain malicious script * applet, embed, and object, which can automatically download and execute malicious code * meta, which can contain malicious redirects * onload, onunload, and all other on* attributes, which can contain malicious script * style, link, and the style attribute, which can contain malicious script *style?* Yes, style. :abbr:`CSS (Cascading Style Sheets)` definitions can contain executable code. Embedding Javascript in :abbr:`CSS (Cascading Style Sheets)` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This sample is taken from `http://feedparser.org/docs/examples/rss20.xml `_: .. sourcecode:: html Watch out for <span style="background: url(javascript:window.location='http://example.org/')"> nasty tricks</span> This sample is more advanced, and does not contain the keyword javascript: that many naive :abbr:`HTML (HyperText Markup Language)` sanitizers scan for: .. sourcecode:: html Watch out for <span style="any: expression(window.location='http://example.org/')"> nasty tricks</span> Internet Explorer for Windows will execute the Javascript in both of these examples. Now consider that in :abbr:`HTML (HyperText Markup Language)`, attribute values may be entity-encoded in several different ways. Embedding encoded Javascript in :abbr:`CSS (Cascading Style Sheets)` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To a browser, this: .. sourcecode:: html is the same as this (without the line breaks): .. sourcecode:: html which is the same as this (without the line breaks): .. sourcecode:: html And so on, plus several other variations, plus every combination of every variation. The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why :program:`Universal Feed Parser` uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code. .. seealso:: `How to consume RSS safely `_ Explains the platypus attack.