Making Sense of Python Unicode
There are many Unicode tutorials available for Python, but few offer an understanding of how Unicode works or how it's applied in the real world. This tutorial will teach you what you need to know about Unicode by example, as succinctly as possible.
Table of Contents
- Introduction
- Getting Started
- Python + Unicode + I/O
- Mixing Binary and Unicode Strings
- Using Unicode Characters in Python Source Code
- Python + Unicode + HTML
- Bullet-Proof Type Casting of Strings
- Python + Unicode + Databases
- When UTF8 and Latin-1 Get Mixed Together :(
- Hands On With Asian Spam
- Reference Material
Introduction¶
Unicode support in Python 2.x is difficult to understand, counter-intuitive, and oftentimes full of surprises because Python will not hesitate to let you know whenever the slightest thing goes wrong. If you're an American programmer like me, with very little use for non-ASCII characters, it is very easy to write and test a online web service that fails in production the moment someone speaking a foreign language tries to use your application.
Note: We will not cover UTF-16 because it introduces advanced concepts like BOMs, endianness, and I have rarely had to use this crazy encoding in my particular line of work
Getting Started¶
Open up a Python console. Make sure your terminal supports UTF-8:
Strings aren't a list of characters like a normal person would expect, they really just contain binary data:
>>> print "@ symbol as ASCII Hex: \x40" @ symbol as ASCII Hex: @ >>> print "@ symbol as ASCII Octal: \100" @ symbol as ASCII Octal: @ >>> print "null char \x00 is ok" null char is ok
"Unicode strings" in Python are not binary data. They are a list of characters.
Unicode is Non-ASCII characters like chinese characters, western characters with accents, etc. can be represented as binary data in a variety of ways.
To do anything with Unicode strings, you have to turn it into a binary representation. Our terminal is set to 'UTF-8' so let's use that:
>>> print u"hello kitty".encode('utf-8') hello kitty >>> print u"hello \u0040 unicode \u5b57".encode('utf-8') hello @ unicode 字 >>> u'\u5b57'.encode('utf-8') '\xe5\xad\x97'
Python is smart enough to know that my terminal is configured for UTF-8 and can do this automatically.
>>> print u"\u5b57" 切
You can differentiate between these two types of strings.
>>> type(u"\u5b57") is unicode True >>> isinstance(u"\u5b57", unicode) True >>> isinstance(u"\u5b57", str) False >>> isinstance(u"\u5b57", basestring) True >>> isinstance('\xe5\xad\x97', basestring) True >>> isinstance(u"\u5b57".encode('utf-8'), str) True
"Encodings" like UTF-8 were invented because Unicode defines a list about 4 billion characters long. This means each character would need take up 32-bits rather than 8-bits. The masses revolted at the idea of having to purchase larger hard drives to store their Word documents.
UTF-8 is the most commonly used because it can display any unicode symbol and is backwards compatible with ASCII.
>>> u'@'.encode('ascii') == '@' True >>> u'@'.encode('utf-8') == '@' True >>> hex(ord("@")) == '0x40' True >>> print '\x40' @ >>> print u'\x40' @
Unlike evil encodings like UTF-16.
>>> u'@'.encode('utf-16') == '@' False
But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary.
>>> len(u'\u0040'.encode('utf-8')) 1 >>> len(u"\u5b57".encode('utf-8')) 3 >>> u'\u5b57'.encode('utf-8') '\xe5\xad\x97' >>> u'\u0040'.encode('utf-8') '@' (or '\x40')
If you want to know the real length of a string (or more correctly the number of characters,) you should encode your binary string into a unicode string:
>>> len(u'\u5b57') 1 >>> raw_utf8_data = '\xe5\xad\x97' >>> len(raw_utf8_data) 3 >>> print raw_utf8_data >>> raw_utf8_data.decode('utf-8') u'\u5b57' >>> len(raw_utf8_data.decode('utf-8')) 1
But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings.
>>> u'\u5b57'.encode('big5') '\xa6\x72' >>> len(u'\u5b57'.encode('big5')) 2
Before Unicode and UTF-8 came about, most western computers used the 'latin-1' (AKA iso-8559-1) encoding by default. ASCII only defines 128 characters and Latin-1 did a nice job filling in the the other 127 left slots available in each byte.
>>> u'\u00D8'.encode('latin-1') '\xD8' >>> u'\xD8'.encode('latin-1') '\xD8' >>> print u'\u00D8' Ø >>> '\xD8'.decode('latin-1') u'\xD8' # The first 256 characters in the Unicode standard are the same as latin-1! >>> print '\xD8'.decode('latin-1').encode('utf-8') Ø
Our terminal is set to UTF-8, so our poor little terminal won't know what to do if we try to print raw latin-1 binary data:
>>> print '\xD8' �
But if you go to your terminal settings, and change your encoding from UTF-8 to Latin-1 (or ISO-8559-1) it will work!
>>> print '\xD8' Ø
Sometimes when accepting data over the network, we'll run into invalid characters. In such cases you probably don't want your application to crash.
>>> happy_utf8_data = u"hello \u0040 unicode \u5b57".encode('utf-8') >>> evil_utf8_data = happy_utf8_data + '\xff\xff\xff' >>> print evil_utf8_data.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 19: unexpected code byte >>> print evil_utf8_data.decode('utf-8', 'replace') hello @ unicode 字��� >>> print evil_utf8_data.decode('utf-8', 'ignore') hello @ unicode 字
Python + Unicode + I/O¶
Remember how Python was smart enough to automatically decode to UTF-8 when printing Unicode strings to our terminal? Try not to get spoiled because Python won't extend you that luxury in other areas:
>>> f = open('/tmp/lawl', "wb") >>> f.write(u'\u5b57') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 0: ordinal not in range(128) >>> f.close()
Python isn't a language that likes to make guesses, therefore it will assume all characters are ASCII unless otherwise specified. It's your job to explicitly say what you mean so Python will always do what you want (or crash and do nothing at all.)
>>> import sys >>> sys.getdefaultencoding() 'ascii'
Rather than searching Google on how to override Python's default behavior (which might break other libraries,) let's fix our code:
>>> f = open('/tmp/lawl', "wb") >>> f.write(u'\u5b57'.encode('utf-8')) >>> f.close() >>> f = open('/tmp/lawl', "rb") >>> raw_utf8_data = f.read() >>> f.close() >>> raw_utf8_data '\xe5\xad\x97' >>> raw_utf8_data.decode('utf-8') u'\u5b57' >>> f.close()
Python provides an easier way to accomplish the above:
>>> import codecs >>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8') >>> f.write(u'\u5b57') >>> f.close() >>> f = codecs.open('/tmp/lawl', 'rb', 'utf-8') >>> f.read() u'\u5b57' >>> f.close()
But you better not write() binary strings with non-ASCII data!
>>> import codecs >>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8') >>> f.write('\xe5\xad\x97') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.6/codecs.py", line 686, in write return self.writer.write(data) File "/usr/lib/python2.6/codecs.py", line 351, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
Latin-1 is generally the default encoding used by Windows machines. If your boss sends you a text file with non-ASCII characters and importing it into your database causes Python to go insane, you might want to try the following:
happy_unicode_data = open('tps_report.txt').read().decode('latin-1')
UTF-8 is slowly becoming the de-facto standard codec and should be used whenever possible because it is universal and able to store any written character. Most Linux distributions such as Ubuntu already use UTF-8 as their system-wide default.
Mixing Binary and Unicode Strings¶
When your strings contain only ASCII, it's unnecessary to put a 'u' in front of every single string.
>>> 'hello ' + 'there' 'hello there' >>> 'hello ' + u'there' u'hello there' >>> 'hello ' + u'André' u'hello Andr\xe9'
Python will turn your everyday binary strings into Unicode strings when necessary. But things get trickier if you put non-ASCII characters in Byte strings.
>>> 'hello ' + 'André' 'hello Andr\xc3\xa9' # I copy/pasted 'é' as a UTF-8 character >>> 'hello'.encode('utf-8') # this is incorrect, but python lets it slide because 'hello' is ASCII 'hello' >>> ('hello ' + 'André').encode('utf-8') # but don't get spoiled! Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)
It's safest to put your non-ASCII characters into Unicode strings. This other Python libraries using I/O don't have have to guess which encoding your binary strings use:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging def say_hello(name): return "hello %s" % (name) logging.warn(say_hello('Jill')) logging.warn(say_hello(" ".join(['John', u'Doe']))) logging.warn(say_hello(u'André')) logging.warn(say_hello('André')) # stop being spoiled!
jart@compy:~$ python test.py WARNING:root:hello Jill WARNING:root:hello André WARNING:root:hello John Doe Traceback (most recent call last): File "/usr/lib/python2.6/logging/__init__.py", line 773, in emit stream.write(fs % msg.encode("UTF-8")) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)
Using Unicode Characters in Python Source Code¶
#!/usr/bin/env python # -*- coding: utf-8 -*- # this will crash if Python isn't able to figure out what encodings # your terminal supports print u"hello @ unicode 字" # this is safer if you KNOW your terminal will support UTF-8, or would # rather just have it not crash and print jibberish print u"hello @ unicode 字".encode('utf-8') # same concept applies, but you lose the benefits of unicode strings # most importantly Python will assume byte strings are ASCII the moment # they hit any standard I/o print "hello @ unicode 字"
jart@compy:~$ python test.py hello @ unicode 字 hello @ unicode 字 hello @ unicode 字 jart@compy:~$ python test.py >/dev/null Traceback (most recent call last): File "test.py", line 6, in <module> print u"hello @ unicode 字" UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 16: ordinal not in range(128)
Some editors will help you out and make sure your source code is encoded properly.
Python + Unicode + HTML¶
If you have a web application and accept user input from a form, you
should specify in your HTML headers "Content-Type: text/html;
charset=UTF-8" to let the browser know to send you UTF-8 data. To be
extra sure you can add the following to your <head>
section:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
HTML defines unicode entitites in base 10 decimal. If you use the Unicode Browser reference link below, you'll need to use a calculator to convert them to hex.
>>> print u"Ö in HTML should be \u00D6 in Python" Ö in HTML should be Ö in Python >>> hex(0214) '0xD6'
Bullet-Proof Type Casting of Strings¶
In large Python projects, you might write a function that can be passed a wide variety of types, including byte strings, unicode strings, objects, numbers, etc. Sometimes it simply isn't practical to be super-strict and you just want to turn an arbitrary Python term into a byte string or a unicode string, and have it just work:
If you use Django, similar methods can be found
in django.utils.encoding
# Copyright (c) 2009 Lobstertech, Inc. # Licensed MIT import types def smart_unicode(s, encoding='utf-8', errors='strict'): if type(s) in (unicode, int, long, float, types.NoneType): return unicode(s) elif type(s) is str or hasattr(s, '__unicode__'): return unicode(s, encoding, errors) else: return unicode(str(s), encoding, errors) def smart_str(s, encoding='utf-8', errors='strict', from_encoding='utf-8'): if type(s) in (int, long, float, types.NoneType): return str(s) elif type(s) is str: if encoding != from_encoding: return s.decode(from_encoding, errors).encode(encoding, errors) else: return s elif type(s) is unicode: return s.encode(encoding, errors) elif hasattr(s, '__str__'): return smart_str(str(s), encoding, errors, from_encoding) elif hasattr(s, '__unicode__'): return smart_str(unicode(s), encoding, errors, from_encoding) else: return smart_str(str(s), encoding, errors, from_encoding)
# -*- coding: utf-8 -*- # Copyright (c) 2009 Lobstertech, Inc. # Licensed MIT # # py.test -xl test_smart_encoding.py import py.test from smart_encoding import smart_unicode, smart_str def test_smart_str(): assert type(smart_str('hello world')) is str assert smart_str('hello world') == 'hello world' assert smart_str(u'hello world') == u'hello world' assert type(smart_str(u'hello world')) is str assert type(smart_str(u'hello world')) is str assert smart_str(u"\u96c6") == '\xe9\x9b\x86' assert smart_str(u"\u96c6", "big5") == '\xb6\xb0' py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "big5")) py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "ascii")) assert smart_str('\xb6\xb0', "big5", errors="ignore") == '' assert smart_str('\xb6\xb0', "big5", errors="replace") == '??' assert smart_str('hello \xb6\xb0', "ascii", errors="replace") == 'hello ??' assert smart_str('\xe9\x9b\x86 \xb6\xb0', "big5", errors="replace") == '\xb6\xb0 ??' assert smart_str('\xb6\xb0', "big5", from_encoding="big5") == '\xb6\xb0' assert smart_str('\xb6\xb0', "utf-8", from_encoding="big5") == '\xe9\x9b\x86' def test_smart_unicode(): assert type(smart_unicode('hello world')) is unicode assert smart_unicode('hello world') == u'hello world' assert smart_unicode(u'hello world') == u'hello world' assert type(smart_unicode(u'hello world')) is unicode assert type(smart_unicode(u'hello world')) is unicode assert smart_unicode(u"\u96c6") == u"\u96c6" assert smart_unicode(u"\u96c6", "big5") == u"\u96c6" assert smart_unicode("\xb6\xb0", "big5") == u"\u96c6" assert smart_unicode("hi \xa3", "latin-1") == u'hi \xa3' assert smart_unicode("hi \xa3", "latin-1") == u'hi \u00a3' py.test.raises(UnicodeDecodeError, lambda: smart_unicode("hi \xa3", "ascii")) assert smart_unicode("hi \xa3", "ascii", errors="ignore") == u"hi " assert smart_unicode("hi \xa3", "ascii", errors="replace") == u"hi \ufffd" # unicode question mark def test_object(): class Lawl(object): pass class Mog: pass assert type(smart_unicode(Lawl())) is unicode assert smart_unicode(Lawl()).startswith('<') assert type(smart_unicode(Mog())) is unicode assert smart_unicode(Mog()).startswith('<') class Hurt: def __str__(self): return '\xe9\x9b\x86' assert smart_unicode(Hurt()) == u"\u96c6" assert smart_str(Hurt()) == '\xe9\x9b\x86' class TheHurting: def __str__(self): return '\xb6\xb0' assert smart_unicode(TheHurting(), 'big5') == u"\u96c6"
Python + Unicode + Databases¶
PostgreSQL uses UTF-8 by default and has never given me any problems. MySQL uses a Latin-1 and has a compulsive need to magically re-encode your data every possible chance it gets. Latin-1 is a very undesirable charset to be restricted to if you are writing a web application because you lose the ability to store non-western characters and cute little symbols like: ☃ Worst of all, MySQL will silently turn any invalid characters into '?' marks, silently corrupting your data.
When writing this article, I discovered that when I created this website in Django, even after setting UTF-8 as my default charset in Django, and even after I thought> I had bullied MySQL into using UTF-8 by default, somehow all my tables used this weird Sweedish variant of Latin-1 by default.
If you find yourself in a similar situation and switching to PostgreSQL is out of the question, you might want to try the following to get your MySQL database to properly handle UTF-8:
Backup your database, and just to be safe you should import the data into a standby database just in case a disaster happens.
If you don't need to worry about migrating lots of data with latin1 characters and just want it to work now, do this:
- Run 'alter database charset = utf8;' on all your databases.
- Run 'alter table charset = utf8;' on all your tables.
- But that's not all! Go into phpMyAdmin, go into every single table, and then into every single field, and change the 'collation' to 'utf8_unicode_ci'. (The '_ci' means MySQL will continue to do it's case-insensitive thing.)
- There is also a setting somewhere in the system that defines the default charset for clients connecting to the database. You might not need to change this as I believe most web frameworks like Django will automatically set the encoding manually each time when establishes a connection.
- You might also want to edit my.cnf:
; /etc/my.cnf [mysqld] ; ... default-character-set=utf8 default-collation=utf8_general_ci ; ... [client] default-character-set=utf8
- If you're using PHP you're still probably not going to see UTF-8 characters no matter what you do. If you're using PEAR DB you can fix it with the following after connecting:
$DB->Query("SET CHARACTER SET UTF8"); $DB->Query("SET NAMES UTF8");
When UTF8 and Latin-1 Get Mixed Together :(¶
If you need to migrate a large set of latin-1 data, your database might contain both latin1 AND utf-8 data, you're in big trouble. If you do a SQL dump, read the file in Python, and decode it as utf8, you'll get unicode decode errors. If you decode it as latin1, you'll get corrupted characters. But you're in luck! I have a happy little script which might help you:
#!/usr/bin/env python # # convert_mixed_utf_latin1.py # # This script will read 'backup.sql' and decode it as utf8. # If it comes across any characters it can't decode, it # will assume they are latin1. # # It will then output a file that is fully utf8 named 'backup.new.sql' # data = open('backup.sql').read() final = [] while True: try: final.append(data.decode('utf8')) break except UnicodeDecodeError, exc: print "oh snap: %r -> %r" % ( data[exc.start], data[exc.start].decode('latin1').encode('utf8')) # everything up to crazy character should be good final.append(data[:exc.start].decode('utf8')) # crazy character is probably latin1 final.append(data[exc.start].decode('latin1')) # remove already encoded stuff data = data[exc.start+1:] f = open('backup.new.sql', 'wb') f.write("".join(final).encode('utf8')) f.close()
That script is an enhanced version of decode()
that falls back to latin1 when the normal decode()
routine gets confused. Let's try it below, and also use sed to
change the table charsets from latin1 to utf8 as well:
jart@compy:~$ mysqldump --opt -u root happydb >backup.sql jart@compy:~$ echo 'create database happydb_new;' | mysql -uroot jart@compy:~$ convert_mixed_utf_latin1.py jart@compy:~$ sed -i -e 's/latin1/utf8/' backup.new.sql jart@compy:~$ mysql -uroot happydb_new <backup.new.sql
Hands On With Asian Spam¶
If your boss ever asks you to write a ticket system that automatically imports emails into a database, one of your most crucial responsibilities is making sure that what asian spammers have to say is rendered properly.
Here is a piece of spam I received recently. (Note: I saved this email to a file from Thunderbird. Because Thunderbird renders to UTF-8, had I viewed the source and copied/pasted the email, this would not work because I would have been copy/pasting UTF-8 characters rather than the goofy encoding this spammer is using.)
- chinese_spam.txt (12KB)
If we download the above file and try to view it in our UTF-8 terminal, we will see jibberish:
This is because the email is encoded in GB2312. You can
tell by looking at the email headers:
Content-Type: text/plain;charset="GB2312"
Python has tools for making sense out of these emails.
>>> import email >>> msg = email.message_from_file(open('chinese_spam.txt')) >>> text = msg.get_payload() >>> msg.get_content_charset() 'gb2312' >>> unicode_text = text.decode('gb2312') >>> type(unicode_text) == unicode True >>> print unicode_text.encode('utf-8') [...] ◆工作经验: 10多年高科技企业产品研发和研发管理工作经历,先后担任过项目经理、研究管理部经 理、开发部经理等职位,在长期的产品研发管理实践中积累了丰富的技术和管理经验。在华 [...]
But the subject header is also encoded!
>>> print msg['Subject'] =?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?= >>> msg['Subject'].decode(msg.get_content_charset()).encode('utf-8') '=?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?='
It seems our decoding powers no longer apply because the Content Type was specified in a header and only applies to the message portion (payload) of the email. The subject is also a header and therefore doesn't apply.
If you look closely, the subject appears to be encoded in Base64 and tells you which encoding to use. Thankfully Python comes to the rescue once again!
>>> import email.Header >>> (text, encoding) = email.Header.decode_header(msg['Subject'])[0] >>> print text ��Ʒ�з���������Ա���Ĺ�����ѵ� >>> type(text) is unicode False >>> encoding 'gb2312' >>> print text.decode(encoding).encode('utf-8') 产品研发及技术人员核心管理技能训练
Don't forget that in practice (as discussed above,) you'll probably
want to specify use decode(encoding, 'replace') because
every once in a while you'll receive legitimate emails that contain one
or two invalid characters.
Advanced: The full intricacies of parsing email is beyond the scope of this tutorial. Here are some quick tips that might help you move in the right direction a little quicker:
msg.get_payload(decode=1)Doesn't decode characters, but rather handles Content-Transfer-Encoding: quoted-printable which is used in plain-text emails for basic formatting. If your emails contain a lot of crazy '=' signs, you might want to use this.for part in msg.walk():Helps you deal with multi-part MIME messages. Each part has the same methods likeget_content_charsetthat we used above.