Making Sense of Python Unicode

June 7, 2009

Author: jart — Originally Posted 276 Days Ago Article Tags « django i18n latin1 mysql python tutorial unicode utf8 web »

There are many Unicode tutorials available for Python, but few offer an understanding of how Unicode works or how it's applied in the real world. This tutorial will teach you what you need to know about Unicode by example, as succinctly as possible.

Unicode support in Python 2.4 through 2.6 (If you're using Python 3.0 stop reading right now! The way all this works has changed!) is difficult to understand because it's counter-intuitive, full of gotchas, and Python will not hesitate to let you know whenever the slightest thing goes wrong. If you're an American programmer like me, with very little use for non-ASCII characters, it is very easy to write and test a online web service that fails in production the moment someone speaking a foreign language tries to use your application.

Note: We will not cover UTF-16 because it introduces advanced concepts like BOMs, endianness, and I have rarely had to use this crazy encoding in my particular line of work

Getting Started

Open up a Python console. Make sure your terminal supports UTF-8:

Strings aren't a list of characters like a normal person would expect, they really just contain binary data:

>>> print "@ symbol as ASCII Hex: \x40"
@ symbol as ASCII Hex: @
>>> print "@ symbol as ASCII Octal: \100"
@ symbol as ASCII Octal: @
>>> print "null char \x00 is ok"
null char  is ok

"Unicode strings" in Python are not binary data. They are a list of characters.

Unicode is Non-ASCII characters like chinese characters, western characters with accents, etc. can be represented as binary data in a variety of ways.

To do anything with Unicode strings, you have to turn it into a binary representation. Our terminal is set to 'UTF-8' so let's use that:

>>> print u"hello kitty".encode('utf-8')
hello kitty
>>> print u"hello \u0040 unicode \u5b57".encode('utf-8')
hello @ unicode 字
>>> u'\u5b57'.encode('utf-8')
'\xe5\xad\x97'

Python is smart enough to know that my terminal is configured for UTF-8 and can do this automatically.

>>> print u"\u5b57"

You can differentiate between these two types of strings.

>>> type(u"\u5b57") is unicode
True
>>> isinstance(u"\u5b57", unicode)
True
>>> isinstance(u"\u5b57", str)
False
>>> isinstance(u"\u5b57", basestring)
True
>>> isinstance('\xe5\xad\x97', basestring)
True
>>> isinstance(u"\u5b57".encode('utf-8'), str)
True

"Encodings" like UTF-8 were invented because Unicode defines a list about 4 billion characters long. This means each character would need take up 32-bits rather than 8-bits. The masses revolted at the idea of having to purchase larger hard drives to store their Word documents.

UTF-8 is the most commonly used because it can display any unicode symbol and is backwards compatible with ASCII.

>>> u'@'.encode('ascii') == '@'
True
>>> u'@'.encode('utf-8') == '@'
True
>>> hex(ord("@")) == '0x40'
True
>>> print '\x40'
@
>>> print u'\x40'
@

Unlike evil encodings like UTF-16.

>>> u'@'.encode('utf-16') == '@'
False

But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary.

>>> len(u'\u0040'.encode('utf-8'))
1
>>> len(u"\u5b57".encode('utf-8'))
3
>>> u'\u5b57'.encode('utf-8')
'\xe5\xad\x97'
>>> u'\u0040'.encode('utf-8')
'@' (or '\x40')

If you want to know the real length of a string (or more correctly the number of characters,) you should encode your binary string into a unicode string:

>>> len(u'\u5b57')
1
>>> raw_utf8_data = '\xe5\xad\x97'
>>> len(raw_utf8_data)
3
>>> print raw_utf8_data
>>> raw_utf8_data.decode('utf-8')
u'\u5b57'
>>> len(raw_utf8_data.decode('utf-8'))
1

But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings.

>>> u'\u5b57'.encode('big5')
'\xa6\x72'
>>> len(u'\u5b57'.encode('big5'))
2

Before Unicode and UTF-8 came about, most western computers used the 'latin-1' (AKA iso-8559-1) encoding by default. ASCII only defines 128 characters and Latin-1 did a nice job filling in the the other 127 left slots available in each byte.

>>> u'\u00D8'.encode('latin-1')
'\xD8'
>>> u'\xD8'.encode('latin-1')
'\xD8'
>>> print u'\u00D8'
Ø
>>> '\xD8'.decode('latin-1')
u'\xD8' # The first 256 characters in the Unicode standard are the same as latin-1!
>>> print '\xD8'.decode('latin-1').encode('utf-8')
Ø

Our terminal is set to UTF-8, so our poor little terminal won't know what to do if we try to print raw latin-1 binary data:

>>> print '\xD8'

But if you go to your terminal settings, and change your encoding from UTF-8 to Latin-1 (or ISO-8559-1) it will work!

>>> print '\xD8'
Ø

Sometimes when accepting data over the network, we'll run into invalid characters. In such cases you probably don't want your application to crash.

>>> happy_utf8_data = u"hello \u0040 unicode \u5b57".encode('utf-8')
>>> evil_utf8_data = happy_utf8_data + '\xff\xff\xff'
>>> print evil_utf8_data.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 19: unexpected code byte
>>> print evil_utf8_data.decode('utf-8', 'replace')
hello @ unicode 字���
>>> print evil_utf8_data.decode('utf-8', 'ignore')
hello @ unicode 字

Python + Unicode + I/O

Remember how Python was smart enough to automatically decode to UTF-8 when printing Unicode strings to our terminal? Try not to get spoiled because Python won't extend you that luxury in other areas:

>>> f = open('/tmp/lawl', "wb")
>>> f.write(u'\u5b57')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 0: ordinal not in range(128)
>>> f.close()

Python isn't a language that likes to make guesses, therefore it will assume all characters are ASCII unless otherwise specified. It's your job to explicitly say what you mean so Python will always do what you want (or crash and do nothing at all.)

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Rather than searching Google on how to override Python's default behavior (which might break other libraries,) let's fix our code:

>>> f = open('/tmp/lawl', "wb")
>>> f.write(u'\u5b57'.encode('utf-8'))
>>> f.close()
>>> f = open('/tmp/lawl', "rb")
>>> raw_utf8_data = f.read()
>>> f.close()
>>> raw_utf8_data
'\xe5\xad\x97'
>>> raw_utf8_data.decode('utf-8')
u'\u5b57'
>>> f.close()

Python provides an easier way to accomplish the above:

>>> import codecs
>>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8')
>>> f.write(u'\u5b57')
>>> f.close()
>>> f = codecs.open('/tmp/lawl', 'rb', 'utf-8')
>>> f.read()
u'\u5b57'
>>> f.close()

But you better not write() binary strings with non-ASCII data!

>>> import codecs
>>> f = codecs.open('/tmp/lawl', 'wb', 'utf-8')
>>> f.write('\xe5\xad\x97')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/codecs.py", line 686, in write
    return self.writer.write(data)
  File "/usr/lib/python2.6/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

Latin-1 is generally the default encoding used by Windows machines. If your boss sends you a text file with non-ASCII characters and importing it into your database causes Python to go insane, you might want to try the following:

happy_unicode_data = open('tps_report.txt').read().decode('latin-1')

UTF-8 is slowly becoming the de-facto standard codec and should be used whenever possible because it is universal and able to store any written character. Most Linux distributions such as Ubuntu already use UTF-8 as their system-wide default.

Mixing Binary and Unicode Strings

American programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string.

>>> 'hello ' + 'there'
'hello there'
>>> 'hello ' + u'there'
u'hello there'
>>> 'hello ' + u'André'
u'hello Andr\xe9'

Python will turn your everyday binary strings into Unicode strings when necessary. But things get trickier if you put non-ASCII characters in Byte strings.

>>> 'hello ' + 'André'
'hello Andr\xc3\xa9' # I copy/pasted 'é' as a UTF-8 character
>>> 'hello'.encode('utf-8') # this is incorrect, but python lets it slide because 'hello' is ASCII
'hello'
>>> ('hello ' + 'André').encode('utf-8') # but don't get spoiled!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

It's safest to put your non-ASCII characters into Unicode strings. This other Python libraries using I/O don't have have to guess which encoding your binary strings use:

File: test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging

def say_hello(name):
    return "hello %s" % (name)

logging.warn(say_hello('Jill'))
logging.warn(say_hello(" ".join(['John', u'Doe'])))
logging.warn(say_hello(u'André'))
logging.warn(say_hello('André')) # stop being spoiled!
jart@compy:~$ python test.py
WARNING:root:hello Jill
WARNING:root:hello André
WARNING:root:hello John Doe
Traceback (most recent call last):
  File "/usr/lib/python2.6/logging/__init__.py", line 773, in emit
    stream.write(fs % msg.encode("UTF-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 23: ordinal not in range(128)

Using Unicode Characters in Python Source Code

File: test.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-

# this will crash if Python isn't able to figure out what encodings
# your terminal supports
print u"hello @ unicode 字"

# this is safer if you KNOW your terminal will support UTF-8, or would
# rather just have it not crash and print jibberish
print u"hello @ unicode 字".encode('utf-8')

# same concept applies, but you lose the benefits of unicode strings
# most importantly Python will assume byte strings are ASCII the moment
# they hit any standard I/o
print "hello @ unicode 字"
jart@compy:~$ python test.py
hello @ unicode 字
hello @ unicode 字
hello @ unicode 字
jart@compy:~$ python test.py >/dev/null
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print u"hello @ unicode 字"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5b57' in position 16: ordinal not in range(128)

Some editors will help you out and make sure your source code is encoded properly.

Emacs Unicode Bad Save

Python + Unicode + HTML

If you have a web application and accept user input from a form, you should specify in your HTML headers "Content-Type: text/html; charset=UTF-8" to let the browser know to send you UTF-8 data. To be extra sure you can add the following to your <head> section:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />

HTML defines unicode entitites in base 10 decimal. If you use the Unicode Browser reference link below, you'll need to use a calculator to convert them to hex.

>>> print u"&#0214; in HTML should be \u00D6 in Python"
&#0214; in HTML should be Ö in Python
>>> hex(0214)
'0xD6'

Bullet-Proof Type Casting of Strings

In large Python projects, you might write a function that can be passed a wide variety of types, including byte strings, unicode strings, objects, numbers, etc. Sometimes it simply isn't practical to be super-strict and you just want to turn an arbitrary Python term into a byte string or a unicode string, and have it just work:

If you use Django, similar methods can be found in django.utils.encoding

File: smart_encoding.py
# Copyright (c) 2009 Lobstertech, Inc.
# Licensed under the LGPL
import types

def smart_unicode(s, encoding='utf-8', errors='strict'):
    if type(s) in (unicode, int, long, float, types.NoneType):
        return unicode(s)
    elif type(s) is str or hasattr(s, '__unicode__'):
        return unicode(s, encoding, errors)
    else:
        return unicode(str(s), encoding, errors)

def smart_str(s, encoding='utf-8', errors='strict', from_encoding='utf-8'):
    if type(s) in (int, long, float, types.NoneType):
        return str(s)
    elif type(s) is str:
        if encoding != from_encoding:
            return s.decode(from_encoding, errors).encode(encoding, errors)
        else:
            return s
    elif type(s) is unicode:
        return s.encode(encoding, errors)
    elif hasattr(s, '__str__'):
        return smart_str(str(s), encoding, errors, from_encoding)
    elif hasattr(s, '__unicode__'):
        return smart_str(unicode(s), encoding, errors, from_encoding)
    else:
        return smart_str(str(s), encoding, errors, from_encoding)
File: test_smart_encoding.py
# -*- coding: utf-8 -*-
# Copyright (c) 2009 Lobstertech, Inc.
# Licensed under the LGPL
#
# py.test -xl test_smart_encoding.py

import py.test
from smart_encoding import smart_unicode, smart_str

def test_smart_str():
    assert type(smart_str('hello world')) is str
    assert smart_str('hello world') == 'hello world'
    assert smart_str(u'hello world') == u'hello world'
    assert type(smart_str(u'hello world')) is str
    assert type(smart_str(u'hello world')) is str

    assert smart_str(u"\u96c6") == '\xe9\x9b\x86'
    assert smart_str(u"\u96c6", "big5") == '\xb6\xb0'
    py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "big5"))
    py.test.raises(UnicodeDecodeError, lambda: smart_str('\xb6\xb0', "ascii"))

    assert smart_str('\xb6\xb0', "big5", errors="ignore") == ''
    assert smart_str('\xb6\xb0', "big5", errors="replace") == '??'
    assert smart_str('hello \xb6\xb0', "ascii", errors="replace") == 'hello ??'
    assert smart_str('\xe9\x9b\x86 \xb6\xb0', "big5", errors="replace") == '\xb6\xb0 ??'
    assert smart_str('\xb6\xb0', "big5", from_encoding="big5") == '\xb6\xb0'
    assert smart_str('\xb6\xb0', "utf-8", from_encoding="big5") == '\xe9\x9b\x86'

def test_smart_unicode():
    assert type(smart_unicode('hello world')) is unicode
    assert smart_unicode('hello world') == u'hello world'
    assert smart_unicode(u'hello world') == u'hello world'
    assert type(smart_unicode(u'hello world')) is unicode
    assert type(smart_unicode(u'hello world')) is unicode

    assert smart_unicode(u"\u96c6") == u"\u96c6"
    assert smart_unicode(u"\u96c6", "big5") == u"\u96c6"
    assert smart_unicode("\xb6\xb0", "big5") == u"\u96c6"
    assert smart_unicode("hi \xa3", "latin-1") == u'hi \xa3'
    assert smart_unicode("hi \xa3", "latin-1") == u'hi \u00a3'
    py.test.raises(UnicodeDecodeError, lambda: smart_unicode("hi \xa3", "ascii"))
    assert smart_unicode("hi \xa3", "ascii", errors="ignore") == u"hi "
    assert smart_unicode("hi \xa3", "ascii", errors="replace") == u"hi \ufffd" # unicode question mark

def test_object():
    class Lawl(object): pass
    class Mog: pass
    assert type(smart_unicode(Lawl())) is unicode
    assert smart_unicode(Lawl()).startswith('<')
    assert type(smart_unicode(Mog())) is unicode
    assert smart_unicode(Mog()).startswith('<')
    
    class Hurt:
        def __str__(self):
            return '\xe9\x9b\x86'
    assert smart_unicode(Hurt()) == u"\u96c6"
    assert smart_str(Hurt()) == '\xe9\x9b\x86'

    class TheHurting:
        def __str__(self):
            return '\xb6\xb0'
    assert smart_unicode(TheHurting(), 'big5') == u"\u96c6"

Python + Unicode + Databases

PostgreSQL uses UTF-8 by default and has never given me any problems. MySQL uses a Latin-1 and has a compulsive need to magically re-encode your data every possible chance it gets. Latin-1 is a very undesirable charset to be restricted to if you are writing a web application because you lose the ability to store non-western characters and cute little symbols like: Worst of all, MySQL will silently turn any invalid characters into '?' marks, silently corrupting your data.

When writing this article, I discovered that when I created this website in Django, even after setting UTF-8 as my default charset in Django, and even after I thought> I had bullied MySQL into using UTF-8 by default, somehow all my tables used this weird Sweedish variant of Latin-1 by default.

If you find yourself in a similar situation and switching to PostgreSQL is out of the question, you might want to try the following to get your MySQL database to properly handle UTF-8:

Backup your database, and just to be safe you should import the data into a standby database just in case a disaster happens.

If you don't need to worry about migrating lots of data with latin1 characters and just want it to work now, do this:

  • Run 'alter database charset = utf8;' on all your databases.
  • Run 'alter table charset = utf8;' on all your tables.
  • But that's not all! Go into phpMyAdmin, go into every single table, and then into every single field, and change the 'collation' to 'utf8_unicode_ci'. (The '_ci' means MySQL will continue to do it's case-insensitive thing.)
  • There is also a setting somewhere in the system that defines the default charset for clients connecting to the database. You might not need to change this as I believe most web frameworks like Django will automatically set the encoding manually each time when establishes a connection.
  • You might also want to edit my.cnf:
    File: /etc/my.cnf
    [mysqld]
    ...
    default-character-set=utf8
    default-collation=utf8_general_ci
    ...
    [client]
    default-character-set=utf8
    
  • If you're using PHP you're still probably not going to see UTF-8 characters no matter what you do. If you're using PEAR DB you can fix it with the following after connecting:
    $DB->Query("SET CHARACTER SET UTF8");
    $DB->Query("SET NAMES UTF8");
    

When UTF8 and Latin-1 Get Mixed Together :(

If you need to migrate a large set of latin-1 data, your database might contain both latin1 AND utf-8 data, you're in big trouble. If you do a SQL dump, read the file in Python, and decode it as utf8, you'll get unicode decode errors. If you decode it as latin1, you'll get corrupted characters. But you're in luck! I have a happy little script which might help you:

convert_mixed_utf_latin1.py
#!/usr/bin/env python
#
# This script will read 'backup.sql' and decode it as utf8.
# If it comes across any characters it can't decode, it
# will assume they are latin1.
#
# It will then output a file that is fully utf8 named 'backup.new.sql'
#

data = open('backup.sql').read()
final = []
while True:
 try:
  final.append(data.decode('utf8'))
  break
 except UnicodeDecodeError, exc:
  print "oh snap: %r -> %r" % (data[exc.start], data[exc.start].decode('latin1').encode('utf8'))
  # everything up to crazy character should be good
  final.append(data[:exc.start].decode('utf8'))
  # crazy character is probably latin1
  final.append(data[exc.start].decode('latin'))
  # remove already encoded stuff
  data = data[exc.start+1:]
f = open('backup.new.sql', 'wb')
f.write("".join(final).encode('utf8'))
f.close()

That script is basically a super-enhanced version decode() that falls back to latin1 when the normal decode() routine gets confused. Let's try it below, and also use sed to change the table charsets from latin1 to utf8 as well:

jart@compy:~$ mysqldump --opt -u root happydb >backup.sql
jart@compy:~$ echo 'create database happydb_new;" | mysql -uroot
jart@compy:~$ convert_mixed_utf_latin1.py
jart@compy:~$ sed -i -e 's/latin1/utf8/' backup.new.sql
jart@compy:~$ mysql -uroot happydb_new <backup.new.sql

Hands On With Asian Spam

If your boss ever asks you to write a ticket system that automatically imports emails into a database, one of your most crucial responsibilities is making sure that what asian spammers have to say is rendered properly.

Here is a piece of spam I received recently. (Note: I saved this email to a file from Thunderbird. Because Thunderbird renders to UTF-8, had I viewed the source and copied/pasted the email, this would not work because I would have been copy/pasting UTF-8 characters rather than the goofy encoding this spammer is using.)

[TXT] chinese_spam.txt (12KB)

If we download the above file and try to view it in our UTF-8 terminal, we will see jibberish:

Chinese Spam Viewed in UTF-8 Terminal

This is because the email is encoded in GB2312. You can tell by looking at the email headers:

Content-Type: text/plain;charset="GB2312"

Python has tools for making sense out of these emails.

>>> import email
>>> msg = email.message_from_file(open('chinese_spam.txt'))
>>> text = msg.get_payload()
>>> msg.get_content_charset()
'gb2312'
>>> unicode_text = text.decode('gb2312')
>>> type(unicode_text) == unicode
True
>>> print unicode_text.encode('utf-8')
[...]
◆工作经验:
    10多年高科技企业产品研发和研发管理工作经历,先后担任过项目经理、研究管理部经
理、开发部经理等职位,在长期的产品研发管理实践中积累了丰富的技术和管理经验。在华
[...]

But the subject header is also encoded!

>>> print msg['Subject']
=?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?=
>>> msg['Subject'].decode(msg.get_content_charset()).encode('utf-8')
'=?GB2312?B?svrGt9HQt6K8sLy8yvXIy9SxusvQxLncwO28vMTc0bXBtw==?='

It seems our decoding powers no longer apply because the Content Type was specified in a header and only applies to the message portion (payload) of the email. The subject is also a header and therefore doesn't apply.

If you look closely, the subject appears to be encoded in Base64 and tells you which encoding to use. Thankfully Python comes to the rescue once again!

>>> import email.Header
>>> (text, encoding) = email.Header.decode_header(msg['Subject'])[0]
>>> print text
��Ʒ�з���������Ա���Ĺ�����ѵ�
>>> type(text) is unicode
False
>>> encoding
'gb2312'
>>> print text.decode(encoding).encode('utf-8')
产品研发及技术人员核心管理技能训练

Don't forget that in practice (as discussed above,) you'll probably want to specify use decode(encoding, 'replace') because every once in a while you'll receive legitimate emails that contain one or two invalid characters.

Advanced: The full intricacies of parsing email is beyond the scope of this tutorial. Here are some quick tips that might help you move in the right direction a little quicker:

  • msg.get_payload(decode=1) Doesn't decode characters, but rather handles Content-Transfer-Encoding: quoted-printable which is used in plain-text emails for basic formatting. If your emails contain a lot of crazy '=' signs, you might want to use this.
  • for part in msg.walk(): Helps you deal with multi-part MIME messages. Each part has the same methods like get_content_charset that we used above.

Reference information

Comments

No comments found.

Post Comment