[OS X TeX] Input encoding question

Richard Koch koch at math.uoregon.edu
Fri Feb 20 01:06:54 EST 2009

Axel and Nathan,

On Feb 19, 2009, at 8:55 PM, Axel E. Retif wrote:

> On  19 Feb, 2009, at 22:38, Nathan Paxton wrote:
>> 	If that's the case (which I don't doubt), then will TexShop ever  
>> move to UTF 8 as its default encoding? Is there a reason that it  
>> retains applemac as its default?
> Good question. I think Richard Koch will tell us soon.

Most encodings represent files as a sequence of bytes, one byte per  
character. For these encodings, any sequence of bytes is legal. So if  
you save a file in Mac OS Roman, and read it back as Latin 1, the  
computer will happily do so, but some characters (with codes above  
127) won't be what you expect.

Since text files don't contain a field listing their encoding, it  
isn't possible to look at a file and know for certain which encoding  
was used to save it. Some editors use heuristics to try to guess the  
encoding, but TeXShop doesn't do that  --- I don't like programs to  
try to guess what I want behind my back.

You can change the encoding of files in TeXShop as follows (please  
backup your files before using this technique):

Choose "Open...". When the open dialog appears, select the file's  
current encoding in the pulldown menu at the bottom of the open dialog  
before clicking the "Open" button. TeXShop will read the file using  
that encoding. Internally TeXShop represents the file using unicode.

Choose "Save As..." In the save dialog, select the new encoding in the  
pulldown menu at the bottom of the save dialog before clicking the  
"Save" button.


Let me explain why it would be a bad idea to make unicode the default  
TeXShop encoding.

The unicode standard does not fix the format of unicode files on disk.  
Instead there are several possible ways to encode such files,  
including UTF-8 and UTF-16.

Suppose a file just contains standard ascii characters, but it is  
saved in UTF-16. Then each ascii character is represented by two  
bytes, so the file on disk is not a standard ascii file. If you try to  
typeset such a file with standard tex or latex, it won't typeset. Many  
editors will reject the file, so if you send it to a friend, the  
friend may be totally confused. If your friend sends you a standard  
ascii file but your standard encoding is UTF-16, you'll just get  
garbage when you open it.

So UTF-16 would cause no end of trouble.

UTF-8 is a different unicode encoding. It represents the first 127  
standard ASCII characters in standard ascii, but all other characters  
are "coded" in the file as a sequence of bytes greater than 127.

A random sequence of bytes will usually not be a legal UTF-8 file. The  
bytes have to satisfy certain rules before they can be decoded  

Suppose you set UTF-8 as your default encoding. If you only use the  
first 127 ASCII characters, and if your friend only uses the first 127  
ASCII characters, you'll be in good shape. But suppose your friend's  
source contains a few unusual characters, and your friend didn't use  
UTF-8. Then your friend's file will almost surely not be legal UTF-8  
and if you try to open it, the computer will report an illegal file  
and give you nothing.

With the current default encoding or Latin 1 or most other encodings,  
files always open and ascii always works great, and the only trouble  
you'll run into is that a few characters may not be what you expect.

But with any unicode encoding, your TeX friends will be confused when  
they get files from you, and you'll be confused when you try to open  
their files.


By the way, when TeXShop is told to open a file with UTF-8, but finds  
that the file isn't UTF-8, it doesn't give up. Instead, it uses Mac OS  
Roman as a fallback. Nevertheless, it is a bad idea to foist unicode  
on beginners. Unicode is fine if you know what you are doing.

Dick Koch

More information about the MacOSX-TeX mailing list