Functions to handle conversion of byte str and unicode strings.
Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0 Added getwriter()
Changed in version kitchen: 0.2.2 ; API kitchen.text 2.1.0 Added exception_to_unicode(), exception_to_bytes(), EXCEPTION_CONVERTERS, and BYTE_EXCEPTION_CONVERTERS
Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1 Deprecated BYTE_EXCEPTION_CONVERTERS as we’ve simplified exception_to_unicode() and exception_to_bytes() to make it unnecessary
Python2 has two string types, str and unicode. unicode represents an abstract sequence of text characters. It can hold any character that is present in the unicode standard. str can hold any byte of data. The operating system and python work together to display these bytes as characters in many cases but you should always keep in mind that the information is really a sequence of bytes, not a sequence of characters. In python2 these types are interchangeable a large amount of the time. They are one of the few pairs of types that automatically convert when used in equality:
>>> # string is converted to unicode and then compared
>>> "I am a string" == u"I am a string"
True
>>> # Other types, like int, don't have this special treatment
>>> 5 == "5"
False
However, this automatic conversion tends to lull people into a false sense of security. As long as you’re dealing with ASCII characters the automatic conversion will save you from seeing any differences. Once you start using characters that are not in ASCII, you will start getting UnicodeError and UnicodeWarning as the automatic conversions between the types fail:
>>> "I am an ñ" == u"I am an ñ"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
Why do these conversions fail? The reason is that the python2 unicode type represents an abstract sequence of unicode text known as code points. str, on the other hand, really represents a sequence of bytes. Those bytes are converted by your operating system to appear as characters on your screen using a particular encoding (usually with a default defined by the operating system and customizable by the individual user.) Although ASCII characters are fairly standard in what bytes represent each character, the bytes outside of the ASCII range are not. In general, each encoding will map a different character to a particular byte. Newer encodings map individual characters to multiple bytes (which the older encodings will instead treat as multiple characters). In the face of these differences, python refuses to guess at an encoding and instead issues a warning or exception and refuses to convert.
See also
So what is the best method of dealing with this weltering babble of incoherent encodings? The basic strategy is to explicitly turn everything into unicode when it first enters your program. Then, when you send it to output, you can transform the unicode back into bytes. Doing this allows you to control the encodings that are used and avoid getting tracebacks due to UnicodeError. Using the functions defined in this module, that looks something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | >>> from kitchen.text.converters import to_unicode, to_bytes
>>> name = raw_input('Enter your name: ')
Enter your name: Toshio くらとみ
>>> name
'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf'
>>> type(name)
<type 'str'>
>>> unicode_name = to_unicode(name)
>>> type(unicode_name)
<type 'unicode'>
>>> unicode_name
u'Toshio \u304f\u3089\u3068\u307f'
>>> # Do a lot of other things before needing to save/output again:
>>> output = open('datafile', 'w')
>>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
|
A few notes:
Looking at line 6, you’ll notice that the input we took from the user was a byte str. In general, anytime we’re getting a value from outside of python (The filesystem, reading data from the network, interacting with an external command, reading values from the environment) we are interacting with something that will want to give us a byte str. Some python standard library modules and third party libraries will automatically attempt to convert a byte str to unicode strings for you. This is both a boon and a curse. If the library can guess correctly about the encoding that the data is in, it will return unicode objects to you without you having to convert. However, if it can’t guess correctly, you may end up with one of several problems:
On line 8, we convert from a byte str to a unicode string. to_unicode() does this for us. It has some error handling and sane defaults that make this a nicer function to use than calling str.decode() directly:
All three of these can be overridden using different keyword arguments to the function. See the to_unicode() documentation for more information.
On line 15 we push the data back out to a file. Two things you should note here:
The default strategy of decoding to unicode strings when you take data in and encoding to a byte str when you send the data back out works great for most problems but there are a few times when you shouldn’t:
In each of these instances, there is a reason to keep around the byte str version of a value. Here’s a few hints to keep your sanity in these situations:
Keep your unicode and str values separate. Just like the pain caused when you have to use someone else’s library that returns both unicode and str you can cause yourself pain if you have functions that can return both types or variables that could hold either type of value.
Name your variables so that you can tell whether you’re storing byte str or unicode string. One of the first things you end up having to do when debugging is determine what type of string you have in a variable and what type of string you are expecting. Naming your variables consistently so that you can tell which type they are supposed to hold will save you from at least one of those steps.
When you get values initially, make sure that you’re dealing with the type of value that you expect as you save it. You can use isinstance() or to_bytes() since to_bytes() doesn’t do any modifications of the string if it’s already a str. When using to_bytes() for this purpose you might want to use:
try:
b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict')
except:
handle_errors_somehow()
The reason is that the default of to_bytes() will take characters that are illegal in the chosen encoding and transform them to replacement characters. Since the point of keeping this data as a byte str is to keep the exact same bytes when you send it outside of your code, changing things to replacement characters should be rasing red flags that something is wrong. Setting errors to strict will raise an exception which gives you an opportunity to fail gracefully.
Sometimes you will want to print out the values that you have in your byte str. When you do this you will need to make sure that you transform unicode to str before combining them. Also be sure that any other function calls (including gettext) are going to give you strings that are the same type. For instance:
print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
Even when you have a good conceptual understanding of how python2 treats unicode and str there are still some things that can surprise you. In most cases this is because, as noted earlier, python or one of the python libraries you depend on is trying to convert a value automatically and failing. Explicit conversion at the appropriate place usually solves that.
One common idiom for getting a simple, string representation of an object is to use:
str(obj)
Unfortunately, this is not safe. Sometimes str(obj) will return unicode. Sometimes it will return a byte str. Sometimes, it will attempt to convert from a unicode string to a byte str, fail, and throw a UnicodeError. To be safe from all of these, first decide whether you need unicode or str to be returned. Then use to_unicode() or to_bytes() to get the simple representation like this:
u_representation = to_unicode(obj, nonstring='simplerepr')
b_representation = to_bytes(obj, nonstring='simplerepr')
python has a builtin print() statement that outputs strings to the terminal. This originated in a time when python only dealt with byte str. When unicode strings came about, some enhancements were made to the print() statement so that it could print those as well. The enhancements make print() work most of the time. However, the times when it doesn’t work tend to make for cryptic debugging.
The basic issue is that print() has to figure out what encoding to use when it prints a unicode string to the terminal. When python is attached to your terminal (ie, you’re running the interpreter or running a script that prints to the screen) python is able to take the encoding value from your locale settings LC_ALL or LC_CTYPE and print the characters allowed by that encoding. On most modern Unix systems, the encoding is utf-8 which means that you can print any unicode character without problem.
There are two common cases of things going wrong:
Someone has a locale set that does not accept all valid unicode characters. For instance:
$ LC_ALL=C python
>>> print u'\ufffd'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
This often happens when a script that you’ve written and debugged from the terminal is run from an automated environment like cron. It also occurs when you have written a script using a utf-8 aware locale and released it for consumption by people all over the internet. Inevitably, someone is running with a locale that can’t handle all unicode characters and you get a traceback reported.
You redirect output to a file. Python isn’t using the values in LC_ALL unconditionally to decide what encoding to use. Instead it is using the encoding set for the terminal you are printing to which is set to accept different encodings by LC_ALL. If you redirect to a file, you are no longer printing to the terminal so LC_ALL won’t have any effect. At this point, python will decide it can’t find an encoding and fallback to ASCII which will likely lead to UnicodeError being raised. You can see this in a short script:
#! /usr/bin/python -tt
print u'\ufffd'
And then look at the difference between running it normally and redirecting to a file:
$ ./test.py
�
$ ./test.py > t
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u'\ufffd'
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
The short answer to dealing with this is to always use bytes when writing output. You can do this by explicitly converting to bytes like this:
from kitchen.text.converters import to_bytes
u_string = u'\ufffd'
print to_bytes(u_string)
or you can wrap stdout and stderr with a StreamWriter. A StreamWriter is convenient in that you can assign it to encode for sys.stdout or sys.stderr and then have output automatically converted but it has the drawback of still being able to throw UnicodeError if the writer can’t encode all possible unicode codepoints. Kitchen provides an alternate version which can be retrieved with kitchen.text.converters.getwriter() which will not traceback in its standard configuration.
The hash() of the ASCII characters is the same for unicode and byte str. When you use them in dict keys, they evaluate to the same dictionary slot:
>>> u_string = u'a'
>>> b_string = 'a'
>>> hash(u_string), hash(b_string)
(12416037344, 12416037344)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'a': 'bytes'}
When you deal with key values outside of ASCII, unicode and byte str evaluate unequally no matter what their character content or hash value:
>>> u_string = u'ñ'
>>> b_string = u_string.encode('utf-8')
>>> print u_string
ñ
>>> print b_string
ñ
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'}
>>> b_string2 = '\\xf1'
>>> hash(u_string), hash(b_string2)
(30848092528, 30848092528)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string2] = 'bytes'
{u'\\xf1': 'unicode', '\\xf1': 'bytes'}
How do you work with this one? Remember rule #1: Keep your unicode and byte str values separate. That goes for keys in a dictionary just like anything else.
For any given dictionary, make sure that all your keys are either unicode or str. Do not mix the two. If you’re being given both unicode and str but you don’t need to preserve separate keys for each, I recommend using to_unicode() or to_bytes() to convert all keys to one type or the other like this:
>>> from kitchen.text.converters import to_unicode
>>> u_string = u'one'
>>> b_string = 'two'
>>> d = {}
>>> d[to_unicode(u_string)] = 1
>>> d[to_unicode(b_string)] = 2
>>> d
{u'two': 2, u'one': 1}
These issues also apply to using dicts with tuple keys that contain a mixture of unicode and str. Once again the best fix is to standardise on either str or unicode.
If you absolutely need to store values in a dictionary where the keys could be either unicode or str you can use StrictDict which has separate entries for all unicode and byte str and deals correctly with any tuple containing mixed unicode and byte str.
Convert an object into a unicode string
Parameters: |
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Raises: |
|
||||||||||
Returns: | unicode string or the original object depending on the value of nonstring. |
Usually this should be used on a byte str but it can take both byte str and unicode strings intelligently. Nonstring objects are handled in different ways depending on the setting of the nonstring parameter.
The default values of this function are set so as to always return a unicode string and never raise an error when converting from a byte str to a unicode string. However, when you do not pass validly encoded text (or a nonstring object), you may end up with output that you don’t expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed default value to simplerepr
Convert an object into a byte str
Parameters: |
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Raises: |
|
||||||||||
Returns: | byte str or the original object depending on the value of nonstring. |
Warning
If you pass a byte str into this function the byte str is returned unmodified. It is not re-encoded with the specified encoding. The easiest way to achieve that is:
to_bytes(to_unicode(text), encoding='utf-8')
The initial to_unicode() call will ensure text is a unicode string. Then, to_bytes() will turn that into a byte str with the specified encoding.
Usually, this should be used on a unicode string but it can take either a byte str or a unicode string intelligently. Nonstring objects are handled in different ways depending on the setting of the nonstring parameter.
The default values of this function are set so as to always return a byte str and never raise an error when converting from unicode to bytes. However, when you do not pass an encoding that can validly encode the object (or a non-string object), you may end up with output that you don’t expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.
Changed in version 0.2.1a2: Deprecated non_string in favor of nonstring parameter and changed default value to simplerepr
Return a codecs.StreamWriter that resists tracing back.
Parameters: | encoding – Encoding to use for transforming unicode strings into byte str. |
---|---|
Return type: | codecs.StreamWriter |
Returns: | StreamWriter that you can instantiate to wrap output streams to automatically translate unicode strings into encoding. |
This is a reimplemetation of codecs.getwriter() that returns a StreamWriter that resists issuing tracebacks. The StreamWriter that is returned uses kitchen.text.converters.to_bytes() to convert unicode strings into byte str. The departures from codecs.getwriter() are:
Example usage:
$ LC_ALL=C python
>>> import sys
>>> from kitchen.text.converters import getwriter
>>> UTF8Writer = getwriter('utf-8')
>>> unwrapped_stdout = sys.stdout
>>> sys.stdout = UTF8Writer(unwrapped_stdout)
>>> print 'caf\xc3\xa9'
café
>>> print u'caf\xe9'
café
>>> ASCIIWriter = getwriter('ascii')
>>> sys.stdout = ASCIIWriter(unwrapped_stdout)
>>> print 'caf\xc3\xa9'
café
>>> print u'caf\xe9'
caf?
See also
API docs for codecs.StreamWriter and codecs.getwriter() and Print Fails on the python wiki.
New in version kitchen: 0.2a2, API: kitchen.text 1.1.0
Deprecated
This function converts something to a byte str if it isn’t one. It’s used to call str() or unicode() on the object to get its simple representation without danger of getting a UnicodeError. You should be using to_unicode() or to_bytes() explicitly instead.
If you need unicode strings:
to_unicode(obj, nonstring='simplerepr')
If you need byte str:
to_bytes(obj, nonstring='simplerepr')
Deprecated
Convert unicode to an encoded utf-8 byte str. You should be using to_bytes() instead:
to_bytes(obj, encoding='utf-8', non_string='passthru')
Take a unicode string and turn it into a byte str suitable for xml
Parameters: |
|
||||||
---|---|---|---|---|---|---|---|
Raises: |
|
||||||
Return type: | byte str |
||||||
Returns: | representation of the unicode string as a valid XML byte str |
XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example: ASCII Null). There are also special characters that must be escaped if they are present in the input (example: <). This function takes care of all of those issues for you.
There are a few different ways to use this function depending on your needs. The simplest invocation is like this:
unicode_to_xml(u'String with non-ASCII characters: <"á と">')
This will return the following to you, encoded in utf-8:
'String with non-ASCII characters: <"á と">'
Pretty straightforward. Now, what if you need to encode your document in something other than utf-8? For instance, latin-1? Let’s see:
unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1')
'String with non-ASCII characters: <"á と">'
Because the と character is not available in the latin-1 charset, it is replaced with と in our output. This is an xml character reference which represents the character at unicode codepoint 12392, the と character.
When you want to reverse this, use xml_to_unicode() which will turn a byte str into a unicode string and replace the xml character references with the unicode characters.
XML also has the quirk of not allowing control characters in its output. The control_chars parameter allows us to specify what to do with those. For use cases that don’t need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value of replace works well:
unicode_to_xml(u'String with disallowed control chars: \u0000\u0007')
'String with disallowed control chars: ??'
If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on utf-7, a verbose encoding that encodes control characters (as well as non-ASCII unicode values) to characters from within the ASCII printable characters. The good thing about doing this is that the code is pretty simple. You just need to use utf-7 both when encoding the field for xml and when decoding it for use in your python program:
unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7')
'String with unicode: +MGg and control char: +AAc-'
# [...]
xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7')
u'String with unicode: と and control char: \u0007'
As you can see, the utf-7 encoding will transform even characters that would be representable in utf-8. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code:
encoding = 'utf-8'
u_string = u'String with unicode: と and control char: \u0007'
try:
# First attempt to encode to utf8
data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
except XmlEncodeError:
# Fallback to utf-7
encoding = 'utf-7'
data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data))
# [...]
encoding = tag.attributes.encoding
u_string = xml_to_unicode(u_string, encoding=encoding)
Using code similar to that, you can have some fields encoded using your default encoding and fallback to utf-7 if there are control characters present.
Note
If your goal is to preserve the control characters you cannot save the entire file as utf-7 and set the xml encoding parameter to utf-7 if your goal is to preserve the control characters. Because XML doesn’t allow control characters, you have to encode those separate from any encoding work that the XML parser itself knows about.
See also
Transform a byte str from an xml file into a unicode string
Parameters: |
|
---|---|
Return type: | unicode string |
Returns: | string decoded from byte_string |
This function attempts to reverse what unicode_to_xml() does. It takes a byte str (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte str into a unicode string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use xml_to_bytes() and bytes_to_xml() or use on of the strategies documented in unicode_to_xml() instead.
Make sure a byte str is validly encoded for xml output
Parameters: |
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Raises: |
|
||||||||||||
Return type: | byte str |
||||||||||||
Returns: | representation of the byte str in the output encoding with any bytes that aren’t available in xml taken care of. |
Use this when you have a byte str representing text that you need to make suitable for output to xml. There are several cases where this is the case. For instance, if you need to transform some strings encoded in latin-1 to utf-8 for output:
utf8_string = byte_string_to_xml(latin1_string, input_encoding='latin-1')
If you already have strings in the proper encoding you may still want to use this function to remove control characters:
cleaned_string = byte_string_to_xml(string, input_encoding='utf-8', output_encoding='utf-8')
See also
Transform a byte str from an xml file into unicode string
Parameters: |
|
---|---|
Returns: | unicode string decoded from byte_string |
This function attempts to reverse what unicode_to_xml() does. It takes a byte str (presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the byte str into a unicode string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to use xml_to_bytes() and bytes_to_xml() or use one of the strategies documented in unicode_to_xml() instead.
Return a byte str encoded so it is valid inside of any xml file
Parameters: |
|
---|---|
Return type: | |
Returns: | byte str representation of the input. This will be encoded using base64. |
This function is made especially to put binary information into xml documents.
This function is intended for encoding things that must be preserved byte-for-byte. If you want to encode a byte string that’s text and don’t mind losing the actual bytes you probably want to try byte_string_to_xml() or guess_encoding_to_xml() instead.
Note
Although the current implementation uses base64.b64encode() and there’s no plans to change it, that isn’t guaranteed. If you want to make sure that you can encode and decode these messages it’s best to use xml_to_bytes() if you use this function to encode.
Decode a string encoded using bytes_to_xml()
Parameters: |
|
---|---|
Return type: | byte str |
Returns: | byte str that’s the decoded input |
If you’ve got fields in an xml document that were encoded with bytes_to_xml() then you want to use this function to undecode them. It converts a base64 encoded string into a byte str.
Note
Although the current implementation uses base64.b64decode() and there’s no plans to change it, that isn’t guaranteed. If you want to make sure that you can encode and decode these messages it’s best to use bytes_to_xml() if you use this function to decode.
Return a byte str suitable for inclusion in xml
Parameters: |
|
---|---|
Returns: |
Deprecated: Use guess_encoding_to_xml() instead
representation. Its main use is to extract a string (unicode or str) from an exception object in exception_to_unicode() and exception_to_bytes(). The functions here will try the exception’s args[0] and the exception itself (roughly equivalent to str(exception)) to extract the message. This is only a default and can be easily overridden when calling those functions. There are several reasons you might wish to do that. If you have exceptions where the best string representing the exception is not returned by the default functions, you can add another function to extract from a different field:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_unicode)
class MyError(Exception):
def __init__(self, message):
self.value = message
c = [lambda e: e.value]
c.extend(EXCEPTION_CONVERTERS)
try:
raise MyError('An Exception message')
except MyError, e:
print exception_to_unicode(e, converters=c)
Another reason would be if you’re converting to a byte str and you know the str needs to be a non-utf-8 encoding. exception_to_bytes() defaults to utf-8 but if you convert into a byte str explicitly using a converter then you can choose a different encoding:
from kitchen.text.converters import (EXCEPTION_CONVERTERS,
exception_to_bytes, to_bytes)
c = [lambda e: to_bytes(e.args[0], encoding='euc_jp'),
lambda e: to_bytes(e, encoding='euc_jp')]
c.extend(EXCEPTION_CONVERTERS)
try:
do_something()
except Exception, e:
log = open('logfile.euc_jp', 'a')
log.write('%s
log.close()
Each function in this list should take the exception as its sole argument and return a string containing the message representing the exception. The functions may return the message as a :byte class:str, a unicode string, or even an object if you trust the object to return a decent string representation. The exception_to_unicode() and exception_to_bytes() functions will make sure to convert the string to the proper type before returning.
New in version 0.2.2.
Deprecated: Use EXCEPTION_CONVERTERS instead.
Tuple of functions to try to use to convert an exception into a string representation. This tuple is similar to the one in EXCEPTION_CONVERTERS but it’s used with exception_to_bytes() instead. Ideally, these functions should do their best to return the data as a byte str but the results will be run through to_bytes() before being returned.
New in version 0.2.2.
Changed in version 1.0.1: Deprecated as simplifications allow EXCEPTION_CONVERTERS to perform the same function.
Convert an exception object into a unicode representation
Parameters: |
|
---|---|
Returns: | unicode string representation of the exception. The value extracted by the converters will be converted into unicode before being returned using the utf-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions in converters) |
New in version 0.2.2.
Convert an exception object into a str representation
Parameters: |
|
---|---|
Returns: | byte str representation of the exception. The value extracted by the converters will be converted into str before being returned using the utf-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions in converters) |
New in version 0.2.2.
Changed in version 1.0.1: Code simplification allowed us to switch to using EXCEPTION_CONVERTERS as the default value of converters.