<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.apertium.org/w/index.php?action=history&amp;feed=atom&amp;title=Unicode_in_Python_2</id>
	<title>Unicode in Python 2 - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.apertium.org/w/index.php?action=history&amp;feed=atom&amp;title=Unicode_in_Python_2"/>
	<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;action=history"/>
	<updated>2026-05-14T09:05:02Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.34.1</generator>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=37204&amp;oldid=prev</id>
		<title>Unhammer: /* How to deal with Unicode in python2 */</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=37204&amp;oldid=prev"/>
		<updated>2012-11-21T14:26:22Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;How to deal with Unicode in python2&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 14:26, 21 November 2012&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 8:&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 8:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;# coding=utf-8&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;# coding=utf-8&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;# -*- encoding: utf-8 -*-&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;# -*- encoding: utf-8 -*-&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-deleted&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;from __future__ import unicode_literals&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;import sys&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;import sys&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 31:&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 32:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;[http://bugs.python.org/issue9549 discouraged].)&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;[http://bugs.python.org/issue9549 discouraged].)&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-deletedline diff-side-deleted&quot;&gt;&lt;div&gt;&#039;&#039;Note: The reason for including the coding=utf-8 and encoding:utf-8 comment lines is to tell python2 that if it sees a hardcoded, non-ascii character in the file, treat it as unicode.&#039;&#039;&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;&#039;&#039;Note: The reason for including the coding=utf-8 and encoding:utf-8 comment lines&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt;, and import unicode_literals,&lt;/ins&gt; is to tell python2 that if it sees a hardcoded, non-ascii character in the file, treat it as unicode.&#039;&#039;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;==Explanation==&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;==Explanation==&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Unhammer</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=29641&amp;oldid=prev</id>
		<title>Unhammer: /* How to deal with Unicode in python2 */</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=29641&amp;oldid=prev"/>
		<updated>2011-12-01T11:27:20Z</updated>

		<summary type="html">&lt;p&gt;&lt;span dir=&quot;auto&quot;&gt;&lt;span class=&quot;autocomment&quot;&gt;How to deal with Unicode in python2&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;table class=&quot;diff diff-contentalign-left&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #222; text-align: center;&quot;&gt;Revision as of 11:27, 1 December 2011&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 31:&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 31:&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;[http://bugs.python.org/issue9549 discouraged].)&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;[http://bugs.python.org/issue9549 discouraged].)&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-deletedline diff-side-deleted&quot;&gt;&lt;div&gt;&#039;&#039;Note: The reason for including the coding=utf-8 and encoding:utf-8&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-addedline diff-side-added&quot;&gt;&lt;div&gt;&#039;&#039;Note: The reason for including the coding=utf-8 and encoding:utf-8&lt;ins class=&quot;diffchange diffchange-inline&quot;&gt; comment lines is to tell python2 that if it sees a hardcoded, non-ascii character in the file, treat it as unicode.&#039;&#039;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-deletedline diff-side-deleted&quot;&gt;&lt;div&gt;comment lines is to tell python2 that if it sees a hardcoded,&lt;/div&gt;&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-added&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-deletedline diff-side-deleted&quot;&gt;&lt;div&gt;non-ascii character in the file, treat it as unicode.&#039;&#039;&lt;/div&gt;&lt;/td&gt;
  &lt;td colspan=&quot;2&quot; class=&quot;diff-empty diff-side-added&quot;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;br /&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;br /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-deleted&quot;&gt;&lt;div&gt;==Explanation==&lt;/div&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;
  &lt;td class=&quot;diff-context diff-side-added&quot;&gt;&lt;div&gt;==Explanation==&lt;/div&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Unhammer</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=29640&amp;oldid=prev</id>
		<title>Unhammer: Created page with &#039;==How to deal with Unicode in python2== There are a lot of bugs in various of the python2 scripts to do with unicode. However, python2 handles unicode input/output just fine if y…&#039;</title>
		<link rel="alternate" type="text/html" href="https://wiki.apertium.org/w/index.php?title=Unicode_in_Python_2&amp;diff=29640&amp;oldid=prev"/>
		<updated>2011-12-01T11:26:54Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;#039;==How to deal with Unicode in python2== There are a lot of bugs in various of the python2 scripts to do with unicode. However, python2 handles unicode input/output just fine if y…&amp;#039;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;==How to deal with Unicode in python2==&lt;br /&gt;
There are a lot of bugs in various of the python2 scripts to do with unicode. However, python2 handles unicode input/output just fine if you stick to certain safe methods.&lt;br /&gt;
&lt;br /&gt;
Here&amp;#039;s how to read unicode from the terminal, do something with the text, and print it back as unicode:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#!/usr/bin/python2&lt;br /&gt;
# coding=utf-8&lt;br /&gt;
# -*- encoding: utf-8 -*-&lt;br /&gt;
&lt;br /&gt;
import sys&lt;br /&gt;
&lt;br /&gt;
for line in sys.stdin:&lt;br /&gt;
    # Turn the input str into a unicode object:&lt;br /&gt;
    uline = line.decode(&amp;#039;utf-8&amp;#039;)&lt;br /&gt;
    &lt;br /&gt;
    # Do something with the text, e.g. get the first char:&lt;br /&gt;
    if len(uline)&amp;gt;1:&lt;br /&gt;
        firstchar = uline[0]&lt;br /&gt;
    &lt;br /&gt;
    # Turn the unicode object into a string before outputting:&lt;br /&gt;
    print firstchar.encode(&amp;#039;utf-8&amp;#039;)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If you stick to the habit of always decoding what you read, and&lt;br /&gt;
encoding what you print, you&amp;#039;ll have no trouble with Unicode in&lt;br /&gt;
python2. &lt;br /&gt;
&lt;br /&gt;
This way you don&amp;#039;t need any hacks (e.g. adding a sitecustomize.py&lt;br /&gt;
saying sys.setdefaultencoding(&amp;#039;utf-8&amp;#039;), a practice which is&lt;br /&gt;
[http://bugs.python.org/issue9549 discouraged].)&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;Note: The reason for including the coding=utf-8 and encoding:utf-8&lt;br /&gt;
comment lines is to tell python2 that if it sees a hardcoded,&lt;br /&gt;
non-ascii character in the file, treat it as unicode.&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
==Explanation==&lt;br /&gt;
&lt;br /&gt;
Try leaving out the &amp;lt;code&amp;gt;.decode(&amp;#039;utf-8&amp;#039;)&amp;lt;/code&amp;gt; and&lt;br /&gt;
&amp;lt;code&amp;gt;.encode(&amp;#039;utf-8&amp;#039;)&amp;lt;/code&amp;gt; and see what happens when you give it&lt;br /&gt;
e.g. &amp;quot;šfoo&amp;quot; as input. It&amp;#039;ll try splitting in the middle of a letter,&lt;br /&gt;
and give you output like � instead of š.&lt;br /&gt;
&lt;br /&gt;
The reason for this is that the input to python is of type str, not&lt;br /&gt;
type unicode. You can see what the str looks like in the python interpreter:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;šfoo&amp;quot;&lt;br /&gt;
&amp;#039;\xc5\xa1foo&amp;#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The š is represented as two &amp;quot;characters&amp;quot;: \xc5\xa1. And here&amp;#039;s what happens when you try to get the first char:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&amp;gt;&amp;gt;&amp;gt; &amp;quot;šfoo&amp;quot;[0]&lt;br /&gt;
&amp;#039;\xc5&amp;#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
However, if decode the unicode before doing &amp;lt;code&amp;gt;[0]&amp;lt;/code&amp;gt;, the string is split between the real letters.&lt;br /&gt;
&lt;br /&gt;
If you only leave off the decode, not the encode, you&amp;#039;ll get:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Traceback (most recent call last):&lt;br /&gt;
  File &amp;quot;foo.py&amp;quot;, line 16, in &amp;lt;module&amp;gt;&lt;br /&gt;
    print mangled.encode(&amp;#039;utf-8&amp;#039;)&lt;br /&gt;
UnicodeDecodeError: &amp;#039;ascii&amp;#039; codec can&amp;#039;t decode byte 0xc5 in position 0: ordinal not in range(128)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
(Ie. the half-letter \xc5 wasn&amp;#039;t possible to treat as ascii.)&lt;br /&gt;
&lt;br /&gt;
If you only leave off the encode, it might print fine to your&lt;br /&gt;
terminal, but will give an error when you redirect to a file:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Traceback (most recent call last):&lt;br /&gt;
  File &amp;quot;foo.py&amp;quot;, line 16, in &amp;lt;module&amp;gt;&lt;br /&gt;
    print mangled&lt;br /&gt;
UnicodeEncodeError: &amp;#039;ascii&amp;#039; codec can&amp;#039;t encode character u&amp;#039;\u0161&amp;#039; in position 0: ordinal not in range(128)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The tracebacks in themselves say what&amp;#039;s missing (decode vs encode).&lt;br /&gt;
&lt;br /&gt;
[[Category:Development]]&lt;/div&gt;</summary>
		<author><name>Unhammer</name></author>
		
	</entry>
</feed>