[OS X TeX] Input encoding question
koch at math.uoregon.edu
Fri Feb 20 01:06:54 EST 2009
Axel and Nathan,
On Feb 19, 2009, at 8:55 PM, Axel E. Retif wrote:
> On 19 Feb, 2009, at 22:38, Nathan Paxton wrote:
>> If that's the case (which I don't doubt), then will TexShop ever
>> move to UTF 8 as its default encoding? Is there a reason that it
>> retains applemac as its default?
> Good question. I think Richard Koch will tell us soon.
Most encodings represent files as a sequence of bytes, one byte per
character. For these encodings, any sequence of bytes is legal. So if
you save a file in Mac OS Roman, and read it back as Latin 1, the
computer will happily do so, but some characters (with codes above
127) won't be what you expect.
Since text files don't contain a field listing their encoding, it
isn't possible to look at a file and know for certain which encoding
was used to save it. Some editors use heuristics to try to guess the
encoding, but TeXShop doesn't do that --- I don't like programs to
try to guess what I want behind my back.
You can change the encoding of files in TeXShop as follows (please
backup your files before using this technique):
Choose "Open...". When the open dialog appears, select the file's
current encoding in the pulldown menu at the bottom of the open dialog
before clicking the "Open" button. TeXShop will read the file using
that encoding. Internally TeXShop represents the file using unicode.
Choose "Save As..." In the save dialog, select the new encoding in the
pulldown menu at the bottom of the save dialog before clicking the
Let me explain why it would be a bad idea to make unicode the default
The unicode standard does not fix the format of unicode files on disk.
Instead there are several possible ways to encode such files,
including UTF-8 and UTF-16.
Suppose a file just contains standard ascii characters, but it is
saved in UTF-16. Then each ascii character is represented by two
bytes, so the file on disk is not a standard ascii file. If you try to
typeset such a file with standard tex or latex, it won't typeset. Many
editors will reject the file, so if you send it to a friend, the
friend may be totally confused. If your friend sends you a standard
ascii file but your standard encoding is UTF-16, you'll just get
garbage when you open it.
So UTF-16 would cause no end of trouble.
UTF-8 is a different unicode encoding. It represents the first 127
standard ASCII characters in standard ascii, but all other characters
are "coded" in the file as a sequence of bytes greater than 127.
A random sequence of bytes will usually not be a legal UTF-8 file. The
bytes have to satisfy certain rules before they can be decoded
Suppose you set UTF-8 as your default encoding. If you only use the
first 127 ASCII characters, and if your friend only uses the first 127
ASCII characters, you'll be in good shape. But suppose your friend's
source contains a few unusual characters, and your friend didn't use
UTF-8. Then your friend's file will almost surely not be legal UTF-8
and if you try to open it, the computer will report an illegal file
and give you nothing.
With the current default encoding or Latin 1 or most other encodings,
files always open and ascii always works great, and the only trouble
you'll run into is that a few characters may not be what you expect.
But with any unicode encoding, your TeX friends will be confused when
they get files from you, and you'll be confused when you try to open
By the way, when TeXShop is told to open a file with UTF-8, but finds
that the file isn't UTF-8, it doesn't give up. Instead, it uses Mac OS
Roman as a fallback. Nevertheless, it is a bad idea to foist unicode
on beginners. Unicode is fine if you know what you are doing.
More information about the MacOSX-TeX