Unicode in Python 2

From Apertium
Jump to navigation Jump to search

How to deal with Unicode in python2

There are a lot of bugs in various of the python2 scripts to do with unicode. However, python2 handles unicode input/output just fine if you stick to certain safe methods.

Here's how to read unicode from the terminal, do something with the text, and print it back as unicode:

#!/usr/bin/python2
# coding=utf-8
# -*- encoding: utf-8 -*-
from __future__ import unicode_literals

import sys

for line in sys.stdin:
    # Turn the input str into a unicode object:
    uline = line.decode('utf-8')
    
    # Do something with the text, e.g. get the first char:
    if len(uline)>1:
        firstchar = uline[0]
    
    # Turn the unicode object into a string before outputting:
    print firstchar.encode('utf-8')

If you stick to the habit of always decoding what you read, and encoding what you print, you'll have no trouble with Unicode in python2.

This way you don't need any hacks (e.g. adding a sitecustomize.py saying sys.setdefaultencoding('utf-8'), a practice which is discouraged.)

Note: The reason for including the coding=utf-8 and encoding:utf-8 comment lines, and import unicode_literals, is to tell python2 that if it sees a hardcoded, non-ascii character in the file, treat it as unicode.

Explanation

Try leaving out the .decode('utf-8') and .encode('utf-8') and see what happens when you give it e.g. "šfoo" as input. It'll try splitting in the middle of a letter, and give you output like � instead of š.

The reason for this is that the input to python is of type str, not type unicode. You can see what the str looks like in the python interpreter:

>>> "šfoo"
'\xc5\xa1foo'

The š is represented as two "characters": \xc5\xa1. And here's what happens when you try to get the first char:

>>> "šfoo"[0]
'\xc5'

However, if decode the unicode before doing [0], the string is split between the real letters.

If you only leave off the decode, not the encode, you'll get:

Traceback (most recent call last):
  File "foo.py", line 16, in <module>
    print mangled.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)

(Ie. the half-letter \xc5 wasn't possible to treat as ascii.)

If you only leave off the encode, it might print fine to your terminal, but will give an error when you redirect to a file:

Traceback (most recent call last):
  File "foo.py", line 16, in <module>
    print mangled
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0161' in position 0: ordinal not in range(128)

The tracebacks in themselves say what's missing (decode vs encode).