Encodings and Unicode¶
Character encodings remain a fractious and often exasperating part of IT.
If you are deal with more than ASCII characters with any regularity whatsoever, for the love of God and all that is holy, use Python 3. It has greatly superior support for Unicode characters, and will generally make your life much eaiser.
say()
and fmt()
try to avoid encoding gotchas by working with
Unicode strings.
In Python 3, all strings are Unicode strings, and all files and I/O streams are
inherently smart enough to read and write to reasonable encodings needed to
store Unicode on disk. But in Python 2, there is a choice between str
and
unicode
, and most files are not smart enough to use rational encodings.
Indeed, files that appear to have an encoding
attribute will not let you set
that attribute, and they will not enforce that encoding when doing file IO.
!@#$%^&!!!
So if you must use Python 2:
- Seriously reconsider what led you to this point. Work to upgrade to modern Python (e.g. 3.7+) ASAP.
- Use
unicode
strings whenever possible. - Seriously consider
from __future__ import unicode_literals
to automate the upleveling of string literals to Unicode. - If you use the basie
str
type, include only ASCII characters, not encoded bytes from UTF-8 or whatever. If you don’t do this, any trouble results be on your head. - If
say
opens a file for you, it will do it with thecodecs
module with a default encoding of UTF-8. If you havesay
write to a file that you open, you must usecodecs.open()
,io.open()
, or a similar mechanism that supports proper encoding. Else errors will result.
say
has a long history of trying to make Python 2 automatically “do the
right thing” even when basic Python 2 facilities do not. We have discovered,
like so many others before us, that was a fool’s errand. Python 2 is simply
ill-prepared for day-in, day-out use of Unicode characters that are all around
us in the modern global world. While say
continues some of this with respect
to the default standard output (stdout
) stream, many of the previous
back-bends to support auto-encoding have been withdrawn. If you choose to use
Python 2, you are responsible for opening files in a responsible,
encoding-friendly way.:
from __future__ import unicode_literals
from codecs import open
contents = u'Contains\u2012Unicode!'
with open('outfile.txt', 'w', encoding='utf-8') as f:
say(contents, file=f)