Making Sense of Python Unicode
June 7, 2009
There are many Unicode tutorials available for Python, but few offer an understanding of how Unicode works or how it's applied in the real world. This tutorial will teach you what you need to know about Unicode by example, as succinctly as possible. Unicode support in Python 2.4 through 2.6 (If you're using Python 3.0 stop reading right now! The way all this works has changed!) is difficult to understand because it's counter-intuitive, full of gotchas, and Python will not hesitate to let you know whenever the slightest thing goes wrong. If you're an American programmer like me, with very little use for non-ASCII characters, it is very easy to write and test a online web service that fails in production the moment someone speaking a foreign language tries to use your application. Note: We will not cover UTF-16 because it introduces advanced concepts like BOMs, endianness, and I have rarely had to use this crazy encoding in my particular line of work Getting StartedOpen up a Python console. Make sure your terminal supports UTF-8:
Strings aren't a list of characters like a normal person would expect, they really just contain binary data: >>> print "@ symbol as ASCII Hex: \x40"
@ symbol as ASCII Hex: @
>>> print "@ symbol as ASCII Octal: \100"
@ symbol as ASCII Octal: @
>>> print "null char \x00 is ok"
null char is ok
"Unicode strings" in Python are not binary data. They are a list of characters. Unicode is Non-ASCII characters like chinese characters, western characters with accents, etc. can be represented as binary data in a variety of ways. To do anything with Unicode strings, you have to turn it into a binary representation. Our terminal is set to 'UTF-8' so let's use that: >>> print u"hello kitty".encode('utf-8')
hello kitty
>>> print u"hello \u0040 unicode \u5b57".encode('utf-8')
hello @ unicode 字
>>> u'\u5b57'.encode('utf-8')
'\xe5\xad\x97'
Python is smart enough to know that my terminal is configured for UTF-8 and can do this automatically. >>> print u"\u5b57"
切
You can differentiate between these two types of strings. >>> type(u"\u5b57") is unicode
True
>>> isinstance(u"\u5b57", unicode)
True
>>> isinstance(u"\u5b57", str)
False
>>> isinstance(u"\u5b57", basestring)
True
>>> isinstance('\xe5\xad\x97', basestring)
True
>>> isinstance(u"\u5b57".encode('utf-8'), str)
True
"Encodings" like UTF-8 were invented because Unicode defines a list about 4 billion characters long. This means each character would need take up 32-bits rather than 8-bits. The masses revolted at the idea of having to purchase larger hard drives to store their Word documents. UTF-8 is the most commonly used because it can display any unicode symbol and is backwards compatible with ASCII. >>> u'@'.encode('ascii') == '@'
True
>>> u'@'.encode('utf-8') == '@'
True
>>> hex(ord("@")) == '0x40'
True
>>> print '\x40'
@
>>> print u'\x40'
@
Unlike evil encodings like UTF-16. >>> u'@'.encode('utf-16') == '@'
False
But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary. >>> len(u'\u0040'.encode('utf-8'))
1
>>> len(u"\u5b57".encode('utf-8'))
3
>>> u'\u5b57'.encode('utf-8')
'\xe5\xad\x97'
>>> u'\u0040'.encode('utf-8')
'@' (or '\x40')
If you want to know the real length of a string (or more correctly the number of characters,) you should encode your binary string into a unicode string: >>> len(u'\u5b57')
1
>>> raw_utf8_data = '\xe5\xad\x97'
>>> len(raw_utf8_data)
3
>>> print raw_utf8_data
>>> raw_utf8_data.decode('utf-8')
u'\u5b57'
>>> len(raw_utf8_data.decode('utf-8'))
1
But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings. >>> u'\u5b57'.encode('big5')
'\xa6\x72'
>>> len(u'\u5b57'.encode('big5'))
2
Before Unicode and UTF-8 came about, most western computers used the 'latin-1' (AKA iso-8559-1) encoding by default. ASCII only defines 128 characters and Latin-1 did a nice job filling in the the other 127 left slots available in each byte. >>> u'\u00D8'.encode('latin-1')
'\xD8'
>>> u'\xD8'.encode('latin-1')
'\xD8'
>>> print u'\u00D8'
Ø
>>> '\xD8'.decode('latin-1')
u'\xD8' # The first 256 characters in the Unicode standard are the same as latin-1!
>>> print '\xD8'.decode('latin-1').encode('utf-8')
Ø
Our terminal is set to UTF-8, so our poor little terminal won't know what to do if we try to print raw latin-1 binary data: >>> print '\xD8'
�
But if you go to your terminal settings, and change your encoding from UTF-8 to Latin-1 (or ISO-8559-1) it will work! >>> print '\xD8'
Ø
Sometimes when accepting data over the network, we'll run into invalid characters. In such cases you probably don't want your application to crash. >>> happy_utf8_data = u"hello \u0040 unicode \u5b57".encode('utf-8')
>>> evil_utf8_data = happy_utf8_data + '\xff\xff\xff'
>>> print evil_utf8_data.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 19: unexpected code byte
>>> print evil_utf8_data.decode('utf-8', 'replace')
hello @ unicode 字���
>>> print evil_utf8_data.decode('utf-8', 'ignore')
hello @ unicode 字
Python + Unicode + I/ORemember how Python was smart enough to automatically decode to UTF-8 when printing Unicode strings to our terminal? Try not to get spoiled because Python won't extend you that luxury in other areas: >>> f = open('/tmp/lawl', "wb")
>>> f.write(u'\u5b57')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 0: ordinal not in range(128)
>>> f.close()
Python isn't a language that likes to make guesses, therefore it will assume all characters are ASCII unless otherwise specified. It's your job to explicitly say what you mean so Python will always do what you want (or crash and do nothing at all.) >>> import sys
>>> sys.getdefaultencoding()
'ascii'
Rather than searching Google on how to override Python's default behavior (which might break other libraries,) let's fix our code: >>> f = open('/tmp/lawl', "wb")
>>> f.write(u'\u5b57'.encode('utf-8'))
>>> f.close()
>>> f = open('/tmp/lawl', "rb")
>>> raw_utf8_data = f.read()
>>> f.close()
>>> raw_utf8_data
'\xe5\xad\x97'
>>> raw_utf8_data.decode('utf-8')
u'\u5b57'
>>> f.close()
Python provides an easier way to accomplish the above: >>> import codecs
>>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8')
>>> f.write(u'\u5b57')
>>> f.close()
>>> f = codecs.open('/tmp/lawl', 'rb', 'utf-8')
>>> f.read()
u'\u5b57'
>>> f.close()
But you better not write() binary strings with non-ASCII data! >>> import codecs
>>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8')
>>> f.write('\xe5\xad\x97')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/codecs.py", line 686, in write
return self.writer.write(data)
File "/usr/lib/python2.6/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
Latin-1 is generally the default encoding used by Windows machines. If your boss sends you a text file with non-ASCII characters and importing it into your database causes Python to go insane, you might want to try the following: happy_unicode_data = open('tps_report.txt').read().decode('latin-1')
UTF-8 is slowly becoming the de-facto standard codec and should be used whenever possible because it is universal and able to store any written character. Most Linux distributions such as Ubuntu already use UTF-8 as their system-wide default. Mixing Binary and Unicode StringsAmerican programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string. >>> 'hello ' + 'there'
'hello there'
>>> 'hello ' + u'there'
u'hello there'
>>> 'hello ' + u'André'
u'hello Andr\xe9'
Python will turn your everyday binary strings into Unicode strings when necessary. But things get trickier if you put non-ASCII characters in Byte strings. >>> 'hello ' + 'André'
'hello Andr\xc3\xa9' # I copy/pasted 'é' as a UTF-8 character
>>> 'hello'.encode('utf-8') # this is incorrect, but python lets it slide because 'hello' is ASCII
'hello'
>>> ('hello ' + 'André').encode('utf-8') # but don't get spoiled!
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)
It's safest to put your non-ASCII characters into Unicode strings. This other Python libraries using I/O don't have have to guess which encoding your binary strings use: File: test.py #!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
def say_hello(name):
return "hello %s" % (name)
logging.warn(say_hello('Jill'))
logging.warn(say_hello(" ".join(['John', u'Doe'])))
logging.warn(say_hello(u'André'))
logging.warn(say_hello('André')) # stop being spoiled!
jart@compy:~$ python test.py
WARNING:root:hello Jill
WARNING:root:hello André
WARNING:root:hello John Doe
Traceback (most recent call last):
File "/usr/lib/python2.6/logging/__init__.py", line 773, in emit
stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Using Unicode Characters in Python Source CodeFile: test.py #!/usr/bin/env python
# -*- coding: utf-8 -*-
# this will crash if Python isn't able to figure out what encodings
# your terminal supports
print u"hello @ unicode 字"
# this is safer if you KNOW your terminal will support UTF-8, or would
# rather just have it not crash and print jibberish
print u"hello @ unicode 字".encode('utf-8')
# same concept applies, but you lose the benefits of unicode strings
# most importantly Python will assume byte strings are ASCII the moment
# they hit any standard I/o
print "hello @ unicode 字"
jart@compy:~$ python test.py
hello @ unicode 字
hello @ unicode 字
hello @ unicode 字
jart@compy:~$ python test.py >/dev/null
Traceback (most recent call last):
File "test.py", line 6, in <module>
print u"hello @ unicode 字"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 16: ordinal not in range(128)
Some editors will help you out and make sure your source code is encoded properly.
Python + Unicode + HTMLIf you have a web application and accept user input from a form, you
should specify in your HTML headers "Content-Type: text/html;
charset=UTF-8" to let the browser know to send you UTF-8 data. To be
extra sure you can add the following to your <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
HTML defines unicode entitites in base 10 decimal. If you use the Unicode Browser reference link below, you'll need to use a calculator to convert them to hex. >>> print u"Ö in HTML should be \u00D6 in Python"
Ö in HTML should be Ö in Python
>>> hex(0214)
'0xD6'
Bullet-Proof Type Casting of StringsIn large Python projects, you might write a function that can be passed a wide variety of types, including byte strings, unicode strings, objects, numbers, etc. Sometimes it simply isn't practical to be super-strict and you just want to turn an arbitrary Python term into a byte string or a unicode string, and have it just work: If you use Django, similar methods can be found
in File: smart_encoding.py # Copyright (c) 2009 Lobstertech, Inc.
# Licensed under the LGPL
import types
def smart_unicode(s, encoding='utf-8', errors='strict'):
if type(s) in (unicode, int, long, float, types.NoneType):
return unicode(s)
elif type(s) is str or hasattr(s, '__unicode__'):
return unicode(s, encoding, errors)
else:
return unicode(str(s), encoding, errors)
def smart_str(s, encoding='utf-8', errors='strict', from_encoding='utf-8'):
if type(s) in (int, long, float, types.NoneType):
return str(s)
elif type(s) is str:
if encoding != from_encoding:
return s.decode(from_encoding, errors).encode(encoding, errors)
else:
return s
elif type(s) is unicode:
return s.encode(encoding, errors)
elif hasattr(s, '__str__'):
return smart_str(str(s), encoding, errors, from_encoding)
elif hasattr(s, '__unicode__'):
return smart_str(unicode(s), encoding, errors, from_encoding)
else:
return smart_str(str(s), encoding, errors, from_encoding)
File: test_smart_encoding.py # -*- coding: utf-8 -*-
# Copyright (c) 2009 Lobstertech, Inc.
# Licensed under the LGPL
#
# py.test -xl test_smart_encoding.py
import py.test
from smart_encoding import smart_unicode, smart_str
def test_smart_str():
assert type(smart_str('hello world')) is str
assert smart_str('hello world') == 'hello world'
assert smart_str(u'hello world') == u'hello world'
assert type(smart_str(u'hello world')) is str
assert type(smart_str(u'hello world')) is str
assert smart_str(u"\u96c6") == '\xe9\x9b\x86'
assert smart_str(u"\u96c6", "big5") == '\xb6\xb0'
py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "big5"))
py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "ascii"))
assert smart_str('\xb6\xb0', "big5", errors="ignore") == ''
assert smart_str('\xb6\xb0', "big5", errors="replace") == '??'
assert smart_str('hello \xb6\xb0', "ascii", errors="replace") == 'hello ??'
assert smart_str('\xe9\x9b\x86 \xb6\xb0', "big5", errors="replace") == '\xb6\xb0 ??'
assert smart_str('\xb6\xb0', "big5", from_encoding="big5") == '\xb6\xb0'
assert smart_str('\xb6\xb0', "utf-8", from_encoding="big5") == '\xe9\x9b\x86'
def test_smart_unicode():
assert type(smart_unicode('hello world')) is unicode
assert smart_unicode('hello world') == u'hello world'
assert smart_unicode(u'hello world') == u'hello world'
assert type(smart_unicode(u'hello world')) is unicode
assert type(smart_unicode(u'hello world')) is unicode
assert smart_unicode(u"\u96c6") == u"\u96c6"
assert smart_unicode(u"\u96c6", "big5") == u"\u96c6"
assert smart_unicode("\xb6\xb0", "big5") == u"\u96c6"
assert smart_unicode("hi \xa3", "latin-1") == u'hi \xa3'
assert smart_unicode("hi \xa3", "latin-1") == u'hi \u00a3'
py.test.raises(UnicodeDecodeError, lambda: smart_unicode("hi \xa3", "ascii"))
assert smart_unicode("hi \xa3", "ascii", errors="ignore") == u"hi "
assert smart_unicode("hi \xa3", "ascii", errors="replace") == u"hi \ufffd" # unicode question mark
def test_object():
class Lawl(object): pass
class Mog: pass
assert type(smart_unicode(Lawl())) is unicode
assert smart_unicode(Lawl()).startswith('<')
assert type(smart_unicode(Mog())) is unicode
assert smart_unicode(Mog()).startswith('<')
class Hurt:
def __str__(self):
return '\xe9\x9b\x86'
assert smart_unicode(Hurt()) == u"\u96c6"
assert smart_str(Hurt()) == '\xe9\x9b\x86'
class TheHurting:
def __str__(self):
return '\xb6\xb0'
assert smart_unicode(TheHurting(), 'big5') == u"\u96c6"
Python + Unicode + DatabasesPostgreSQL uses UTF-8 by default and has never given me any problems. MySQL uses a Latin-1 and has a compulsive need to magically re-encode your data every possible chance it gets. Latin-1 is a very undesirable charset to be restricted to if you are writing a web application because you lose the ability to store non-western characters and cute little symbols like: ☃ Worst of all, MySQL will silently turn any invalid characters into '?' marks, silently corrupting your data. When writing this article, I discovered that when I created this website in Django, even after setting UTF-8 as my default charset in Django, and even after I thought> I had bullied MySQL into using UTF-8 by default, somehow all my tables used this weird Sweedish variant of Latin-1 by default. If you find yourself in a similar situation and switching to PostgreSQL is out of the question, you might want to try the following to get your MySQL database to properly handle UTF-8: Backup your database, and just to be safe you should import the data into a standby database just in case a disaster happens. If you don't need to worry about migrating lots of data with latin1 characters and just want it to work now, do this:
When UTF8 and Latin-1 Get Mixed Together :(If you need to migrate a large set of latin-1 data, your database might contain both latin1 AND utf-8 data, you're in big trouble. If you do a SQL dump, read the file in Python, and decode it as utf8, you'll get unicode decode errors. If you decode it as latin1, you'll get corrupted characters. But you're in luck! I have a happy little script which might help you: convert_mixed_utf_latin1.py #!/usr/bin/env python
#
# This script will read 'backup.sql' and decode it as utf8.
# If it comes across any characters it can't decode, it
# will assume they are latin1.
#
# It will then output a file that is fully utf8 named 'backup.new.sql'
#
data = open('backup.sql').read()
final = []
while True:
try:
final.append(data.decode('utf8'))
break
except UnicodeDecodeError, exc:
print "oh snap: %r -> %r" % (data[exc.start], data[exc.start].decode('latin1').encode('utf8'))
# everything up to crazy character should be good
final.append(data[:exc.start].decode('utf8'))
# crazy character is probably latin1
final.append(data[exc.start].decode('latin'))
# remove already encoded stuff
data = data[exc.start+1:]
f = open('backup.new.sql', 'wb')
f.write("".join(final).encode('utf8'))
f.close()
That script is basically a super-enhanced version jart@compy:~$ mysqldump --opt -u root happydb >backup.sql
jart@compy:~$ echo 'create database happydb_new;" | mysql -uroot
jart@compy:~$ convert_mixed_utf_latin1.py
jart@compy:~$ sed -i -e 's/latin1/utf8/' backup.new.sql
jart@compy:~$ mysql -uroot happydb_new <backup.new.sql
Hands On With Asian SpamIf your boss ever asks you to write a ticket system that automatically imports emails into a database, one of your most crucial responsibilities is making sure that what asian spammers have to say is rendered properly. Here is a piece of spam I received recently. (Note: I saved this email to a file from Thunderbird. Because Thunderbird renders to UTF-8, had I viewed the source and copied/pasted the email, this would not work because I would have been copy/pasting UTF-8 characters rather than the goofy encoding this spammer is using.)
If we download the above file and try to view it in our UTF-8 terminal, we will see jibberish:
This is because the email is encoded in Content-Type: text/plain;charset="GB2312"Python has tools for making sense out of these emails. >>> import email
>>> msg = email.message_from_file(open('chinese_spam.txt'))
>>> text = msg.get_payload()
>>> msg.get_content_charset()
'gb2312'
>>> unicode_text = text.decode('gb2312')
>>> type(unicode_text) == unicode
True
>>> print unicode_text.encode('utf-8')
[...]
◆工作经验:
10多年高科技企业产品研发和研发管理工作经历,先后担任过项目经理、研究管理部经
理、开发部经理等职位,在长期的产品研发管理实践中积累了丰富的技术和管理经验。在华
[...]
But the subject header is also encoded! >>> print msg['Subject']
=?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?=
>>> msg['Subject'].decode(msg.get_content_charset()).encode('utf-8')
'=?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?='
It seems our decoding powers no longer apply because the Content Type was specified in a header and only applies to the message portion (payload) of the email. The subject is also a header and therefore doesn't apply. If you look closely, the subject appears to be encoded in Base64 and tells you which encoding to use. Thankfully Python comes to the rescue once again! >>> import email.Header
>>> (text, encoding) = email.Header.decode_header(msg['Subject'])[0]
>>> print text
��Ʒ�з���������Ա���Ĺ�����ѵ�
>>> type(text) is unicode
False
>>> encoding
'gb2312'
>>> print text.decode(encoding).encode('utf-8')
产品研发及技术人员核心管理技能训练
Don't forget that in practice (as discussed above,) you'll probably
want to specify use Advanced: The full intricacies of parsing email is beyond the scope of this tutorial. Here are some quick tips that might help you move in the right direction a little quicker:
Reference information |
Tag Cloudaccounting assembly asterisk c django erlang games hacking i18n latin1 linux mysql networking python qos speaking tc travel tutorial unicode utf8 web Archive
July 2009 Popular Content
Asterisk Voice Changer
(8337 Views) Recent Comments
Thottle Linux Network Speed To 56k
on Jun 27 by JosieATKINSON25 |




Comments
on May 12, 2010 [Permalink]
Technical error:
A character in UTF-8 takes between one and four bytes, never five or six. Once upon a time, an old version of UTF-8 supported five and six byte sequences.
Anon E Mouse
on May 12, 2010
[Permalink]
Technical error:
A character in UTF-8 takes between one and four bytes, never five or six. Once upon a time, an old version of UTF-8 supported five and six byte sequences.
saperduper
on May 12, 2010
[Permalink]
excellent article! thanks!
wally_fish on May 12, 2010 [Permalink]
latin1 isn't limited to Windows machines. For Ubuntu<9.04, it was the default encoding, and lots of people using Western European languages (including me) have gobloads of textfiles in latin1.
So you *really* need to consider which bunch of files will have which encoding (except for XML files, which have a header that cElementTree evaluates), or everything will be messed up the moment that you add Mr. Westerståhl from Bölsö to your customer database. And most of the time, you'll have a metric ton of existing data in latin1 (or EUC-JP, GB-5, whatever) so that using UTF-8 only is not an option.
Deepak on May 13, 2010 [Permalink]
Is there a similar tutorial for Py 3+?
Alan Franzoni
on May 13, 2010
[Permalink]
"unicode strings" should be better called "unicode objects". They're really an abstraction internal to python matching a unicode code point. Real-world strings must be instances of 'str'. Sometimes the conversion is implicit.
Also, when reading source code files, python looks at the -*- coding: STH -*- declaration at the beginning of the file in order to understand which encoding your text file is using.
Deepak:
in python 3, all strings are unicode objects by default. There's a separate type, bytearray, to handle binary strings.
The same effect can be achieved in python >= 2.6 with the statement
from __future__ import unicode_literals
for each python file you want to treat as unicode-only.
jart on May 13, 2010 [Permalink]
@AlanFranzoni
> "unicode strings" should be better called "unicode objects".
Calling them objects might be superfluous because all values in Python are essentially objects in one way or another.
@AnonEMouse
Thanks :)
@Deepak
Maybe I'll write one, but I'm not sure the demand is out there.
@wally_fish
> And most of the time, you'll have a metric ton of existing data in latin1
If they're just a bunch of files, what's holding you back from writing a shell script to convert them using iconv? Do you have apps tied to them? Are they involved in production services running 24/7?
Character encodings can be soo crazy, I'm sorry to hear you've been bitten by their dark side.
Post Comment