python-webencodings¶
This is a Python implementation of the WHATWG Encoding standard.
- Latest documentation: http://packages.python.org/webencodings/
- Source code and issue tracker: https://github.com/gsnedders/python-webencodings
- PyPI releases: http://pypi.python.org/pypi/webencodings
- License: BSD
- Python 2.6+ and 3.3+
In order to be compatible with legacy web content
when interpreting something like Content-Type: text/html; charset=latin1
,
tools need to use a particular set of aliases for encoding labels
as well as some overriding rules.
For example, US-ASCII
and iso-8859-1
on the web are actually
aliases for windows-1252
, and an UTF-8 or UTF-16 BOM takes precedence
over any other encoding declaration.
The Encoding standard defines all such details so that implementations do
not have to reverse-engineer each other.
This module has encoding labels and BOM detection, but the actual implementation for encoders and decoders is Python’s.
Byte order marks¶
When decoding, for compatibility with deployed content, a byte order mark (also known as BOM) is considered more authoritative than anything else. The corresponding U+FFFE code point is not part of the decoded output.
Encoding nevers prepends a BOM, but the output can start with a BOM if the input starts with a U+FFFE code point. In that case encoding then decoding will not round-trip.
Error handling¶
As in the stdlib, error handling for encoding defaults to strict
:
raise an exception if there is an error.
For decoding however the default is replace
, unlike the stdlib.
Invalid bytes are decoded as �
(U+FFFD, the replacement character).
The reason is that when showing legacy content to the user,
it might be better to succeed decoding only part of it rather than blow up.
This is of course not the case is all situations:
sometimes you want stuff to blow up so you can detect errors early.
API¶
-
webencodings.
lookup
(label)[source]¶ Look for an encoding by its label. This is the spec’s get an encoding algorithm. Supported labels are listed there.
Parameters: label – A string. Returns: An Encoding
object, orNone
for an unknown label.
-
class
webencodings.
Encoding
[source]¶ Reresents a character encoding such as UTF-8, that can be used for decoding or encoding.
-
name
¶ Canonical name of the encoding
-
codec_info
¶ The actual implementation of the encoding, a stdlib
CodecInfo
object. Seecodecs.register()
.
-
-
webencodings.
UTF8
= <Encoding utf-8>¶ The UTF-8 encoding. Should be used for new content and formats.
-
webencodings.
decode
(input, fallback_encoding, errors=u'replace')[source]¶ Decode a single string.
Parameters: - input – A byte string
- fallback_encoding – An
Encoding
object or a label string. The encoding to use ifinput
does note have a BOM. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.Returns: A
(output, encoding)
tuple of an Unicode string and anEncoding
.
-
webencodings.
encode
(input, encoding=<Encoding utf-8>, errors=u'strict')[source]¶ Encode a single string.
Parameters: - input – An Unicode string.
- encoding – An
Encoding
object or a label string. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.Returns: A byte string.
-
webencodings.
iter_decode
(input, fallback_encoding, errors=u'replace')[source]¶ “Pull”-based decoder.
Parameters: - input –
An iterable of byte strings.
The input is first consumed just enough to determine the encoding based on the precense of a BOM, then consumed on demand when the return value is.
- fallback_encoding – An
Encoding
object or a label string. The encoding to use ifinput
does note have a BOM. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.Returns: An
(output, encoding)
tuple.output
is an iterable of Unicode strings,encoding
is theEncoding
that is being used.- input –
-
webencodings.
iter_encode
(input, encoding=<Encoding utf-8>, errors=u'strict')[source]¶ “Pull”-based encoder.
Parameters: - input – An iterable of Unicode strings.
- encoding – An
Encoding
object or a label string. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.Returns: An iterable of byte strings.
-
class
webencodings.
IncrementalDecoder
(fallback_encoding, errors=u'replace')[source]¶ “Push”-based decoder.
Parameters: - fallback_encoding – An
Encoding
object or a label string. The encoding to use ifinput
does note have a BOM. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.- fallback_encoding – An
-
class
webencodings.
IncrementalEncoder
(encoding=<Encoding utf-8>, errors=u'strict')[source]¶ “Push”-based encoder.
Parameters: - encoding – An
Encoding
object or a label string. - errors – Type of error handling. See
codecs.register()
.
Raises: LookupError
for an unknown encoding label.- encoding – An
-
webencodings.
ascii_lower
(string)[source]¶ Transform (only) ASCII letters to lower case: A-Z is mapped to a-z.
Parameters: string – An Unicode string. Returns: A new Unicode string. This is used for ASCII case-insensitive matching of encoding labels. The same matching is also used, among other things, for CSS keywords.
This is different from the
lower()
method of Unicode strings which also affect non-ASCII characters, sometimes mapping them into the ASCII range:>>> keyword = u'Bac\N{KELVIN SIGN}ground' >>> assert keyword.lower() == u'background' >>> assert ascii_lower(keyword) != keyword.lower() >>> assert ascii_lower(keyword) == u'bac\N{KELVIN SIGN}ground'