Sanitization¶

Most feeds embed HTML markup within feed elements. Some feeds even embed other types of markup, such as SVG or MathML. Since many feed aggregators use a web browser (or browser component) to display content, Universal Feed Parser sanitizes embedded markup to remove things that could pose security risks.

These elements are sanitized by default:

Note

If the content is declared to be (or is determined to be) text/plain, it will not be sanitized. This is to avoid data loss. It is recommended that you check the content type in e.g. entries[i].summary_detail.type. If it is text/plain then it has not been sanitized (and you should perform HTML escaping before rendering the content).

HTML Sanitization¶

The following HTML elements are allowed by default (all others are stripped):

a
abbr
acronym
address
area
article
aside
audio
b
big
blockquote
br
button
canvas
caption
center
cite
code
col
colgroup
command
datagrid
datalist
dd
del
details
dfn
dialog
dir
div
dl
dt
em

event-source
fieldset
figure
font
footer
form
h1
h2
h3
h4
h5
h6
header
hr
i
img
input
ins
kbd
keygen
label
legend
li
m
map
menu
meter
multicol
nav
nextid
noscript
ol
optgroup

option
output
p
pre
progress
q
s
samp
section
select
small
sound
source
spacer
span
strike
strong
sub
sup
table
tbody
td
textarea
tfoot
th
thead
time
tr
tt
u
ul
var
video

The following HTML attributes are allowed by default (all others are stripped):

abbr
accept
accept-charset
accesskey
action
align
alt
autocomplete
autofocus
autoplay
axis
background
balance
bgcolor
bgproperties
border
bordercolor
bordercolordark
bordercolorlight
bottompadding
cellpadding
cellspacing
ch
challenge
char
charoff
charset
checked
choff
cite
class
clear
color
cols
colspan
compact
contenteditable
coords
data
datafld
datapagesize
datasrc
datetime
default
delay
dir
disabled

draggable
dynsrc
enctype
end
face
for
form
frame
galleryimg
gutter
headers
height
hidden
hidefocus
high
href
hreflang
hspace
icon
id
inputmode
ismap
keytype
label
lang
leftspacing
list
longdesc
loop
loopcount
loopend
loopstart
low
lowsrc
max
maxlength
media
method
min
multiple
name
nohref
noshade
nowrap
open
optimum
pattern

ping
point-size
poster
pqg
preload
prompt
radiogroup
readonly
rel
repeat-max
repeat-min
replace
required
rev
rightspacing
rows
rowspan
rules
scope
selected
shape
size
span
src
start
step
summary
suppress
tabindex
target
template
title
toppadding
type
unselectable
urn
usemap
valign
value
variable
volume
vrml
vspace
width
wrap
xml:lang

SVG Sanitization¶

The following SVG elements are allowed by default (all others are stripped):

a
animate
animateColor
animateMotion
animateTransform
circle
defs
desc
ellipse
font-face
font-face-name
font-face-src

foreignObject
g
glyph
hkern
line
linearGradient
marker
metadata
missing-glyph
mpath
path
polygon

polyline
radialGradient
rect
set
stop
svg
switch
text
title
tspan
use

The following SVG attributes are allowed by default (all others are stripped):

accent-height
accumulate
additive
alphabetic
arabic-form
ascent
attributeName
attributeType
baseProfile
bbox
begin
by
calcMode
cap-height
class
color
color-rendering
content
cx
cy
d
descent
display
dur
dx
dy
end
fill
fill-opacity
fill-rule
font-family
font-size
font-stretch
font-style
font-variant
font-weight
from
fx
fy
g1
g2
glyph-name
gradientUnits
hanging
height
horiz-adv-x
horiz-origin-x

id
ideographic
k
keyPoints
keySplines
keyTimes
lang
marker-end
marker-mid
marker-start
markerHeight
markerUnits
markerWidth
mathematical
max
min
name
offset
opacity
orient
origin
overline-position
overline-thickness
panose-1
path
pathLength
points
preserveAspectRatio
r
refX
refY
repeatCount
repeatDur
requiredExtensions
requiredFeatures
restart
rotate
rx
ry
slope
stemh
stemv
stop-color
stop-opacity
strikethrough-position
strikethrough-thickness
stroke

stroke-dasharray
stroke-dashoffset
stroke-linecap
stroke-linejoin
stroke-miterlimit
stroke-opacity
stroke-width
systemLanguage
target
text-anchor
to
transform
type
u1
u2
underline-position
underline-thickness
unicode
unicode-range
units-per-em
values
version
viewBox
visibility
width
widths
x
x-height
x1
x2
xlink:actuate
xlink:arcrole
xlink:href
xlink:role
xlink:show
xlink:title
xlink:type
xml:base
xml:lang
xml:space
xmlns
xmlns:xlink
y
y1
y2
zoomAndPan

MathML Sanitization¶

The following MathML elements are allowed by default (all others are stripped):

annotation
annotation-xml
maction
maligngroup
malignmark
math
menclose
merror
mfenced
mfrac
mglyph
mi
mlabeledtr
mlongdiv
mmultiscripts

mn
mo
mover
mpadded
mphantom
mprescripts
mroot
mrow
ms
mscarries
mscarry
msgroup
msline
mspace
msqrt

msrow
mstack
mstyle
msub
msubsup
msup
mtable
mtd
mtext
mtr
munder
munderover
none
semantics

The following MathML attributes are allowed by default (all others are stripped):

accent
accentunder
actiontype
align
alignmentscope
altimg
altimg-height
altimg-valign
altimg-width
alttext
bevelled
charalign
close
columnalign
columnlines
columnspacing
columnspan
columnwidth
crossout
decimalpoint
denomalign
depth
dir
display
displaystyle
edge
encoding
equalcolumns
equalrows
fence
fontstyle
fontweight
form

frame
framespacing
groupalign
height
href
id
indentalign
indentalignfirst
indentalignlast
indentshift
indentshiftfirst
indentshiftlast
indenttarget
infixlinebreakstyle
largeop
length
linebreak
linebreakmultchar
linebreakstyle
lineleading
linethickness
location
longdivstyle
lquote
lspace
mathbackground
mathcolor
mathsize
mathvariant
maxsize
minlabelspacing
minsize
movablelimits

notation
numalign
open
other
overflow
position
rowalign
rowlines
rowspacing
rowspan
rquote
rspace
scriptlevel
scriptminsize
scriptsizemultiplier
selection
separator
separators
shift
side
src
stackalign
stretchy
subscriptshift
superscriptshift
symmetric
voffset
width
xlink:href
xlink:show
xlink:type
xmlns
xmlns:xlink

CSS Sanitization¶

The following CSS properties are allowed by default in style attributes (all others are stripped):

azimuth
background-color
border-bottom-color
border-collapse
border-color
border-left-color
border-right-color
border-top-color
clear
color
cursor
direction
display
elevation
float
font

font-family
font-size
font-style
font-variant
font-weight
height
letter-spacing
line-height
overflow
pause
pause-after
pause-before
pitch
pitch-range
richness

speak
speak-header
speak-numeral
speak-punctuation
speech-rate
stress
text-align
text-decoration
text-indent
unicode-bidi
vertical-align
voice-family
volume
white-space
width

Note

Not all possible CSS values are allowed for these properties. The allowable values are restricted by a whitelist and a regular expression that allows color values and lengths. URIs are not allowed, to prevent platypus attacks. See the _HTMLSanitizer class for more details.

Whitelist, Don’t Blacklist¶

I am often asked why Universal Feed Parser is so hard-assed about HTML and CSS sanitizing. To illustrate the problem, here is an incomplete list of potentially dangerous HTML tags and attributes:

script, which can contain malicious script
applet, embed, and object, which can automatically download and execute malicious code
meta, which can contain malicious redirects
onload, onunload, and all other on* attributes, which can contain malicious script
style, link, and the style attribute, which can contain malicious script

style? Yes, style. CSS definitions can contain executable code.

Embedding Javascript in CSS¶

This sample is taken from http://feedparser.org/docs/examples/rss20.xml:

<description>Watch out for
&lt;span style="background: url(javascript:window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description>

This sample is more advanced, and does not contain the keyword javascript: that many naive HTML sanitizers scan for:

<description>Watch out for
&lt;span style="any: expression(window.location='http://example.org/')"&gt;
nasty tricks&lt;/span&gt;</description>

Internet Explorer for Windows will execute the Javascript in both of these examples.

Now consider that in HTML, attribute values may be entity-encoded in several different ways.

Embedding encoded Javascript in CSS¶

To a browser, this:

<span style="any: expression(window.location='http://example.org/')">

is the same as this (without the line breaks):

<span style="&#97;&#110;&#121;&#58;&#32;&#101;&#120;&#112;&#114;&#101;
&#115;&#115;&#105;&#111;&#110;&#40;&#119;&#105;&#110;&#100;&#111;&#119;
&#46;&#108;&#111;&#99;&#97;&#116;&#105;&#111;&#110;&#61;&#39;&#104;
&#116;&#116;&#112;&#58;&#47;&#47;&#101;&#120;&#97;&#109;&#112;&#108;
&#101;&#46;&#111;&#114;&#103;&#47;&#39;&#41;">

which is the same as this (without the line breaks):

<span style="&#x61;&#x6e;&#x79;&#x3a;&#x20;&#x65;&#x78;&#x70;&#x72;
&#x65;&#x73;&#x73;&#x69;&#x6f;&#x6e;&#x28;&#x77;&#x69;&#x6e;
&#x64;&#x6f;&#x77;&#x2e;&#x6c;&#x6f;&#x63;&#x61;&#x74;&#x69;
&#x6f;&#x6e;&#x3d;&#x27;&#x68;&#x74;&#x74;&#x70;&#x3a;&#x2f;
&#x2f;&#x65;&#x78;&#x61;&#x6d;&#x70;&#x6c;&#x65;&#x2e;&#x6f;
&#x72;&#x67;&#x2f;&#x27;&#x29;">

And so on, plus several other variations, plus every combination of every variation.

The more I investigate, the more cases I find where Internet Explorer for Windows will treat seemingly innocuous markup as code and blithely execute it. This is why Universal Feed Parser uses a whitelist and not a blacklist. I am reasonably confident that none of the elements or attributes on the whitelist are security risks. I am not at all confident about elements or attributes that I have not explicitly investigated. And I have no confidence at all in my ability to detect strings within attribute values that Internet Explorer for Windows will treat as executable code.

Sanitization¶

HTML Sanitization¶

SVG Sanitization¶

MathML Sanitization¶

CSS Sanitization¶

Whitelist, Don’t Blacklist¶

Embedding Javascript in CSS¶

Embedding encoded Javascript in CSS¶

Table Of Contents

Previous topic

Next topic

This Page