[OS X TeX] Re: Some Encoding & Keyboard Questions

Bastian Philipps bph at gmx.info
Fri Feb 3 11:19:50 EST 2006


Herbert Schulz said the following on 3.2.2006 15:54 Uhr:
> [...] 2)If I have my default file encoding set to UTF-8 how does
> TeXShop know that a certain file is not in UTF-8 when it reads it? If
> I open a MacOSRoman (my actual default - just because) file a dialog
> box comes up saying it isn't UTF-8 and will be read in as MacOSRoman.
> Is there some sort of BOM at the start of a UTF-8 file that
> distinguishes it from other (indistinguishable by TeXShop) formats?

If I may be so frank, I will just quote Richard Koch from a private mail
to me concerning the handling of text encoding in TeXShop:

Richard Koch said the following on 16.1.2006 20:44 Uhr:
> Bastian,
> Richard Koch said the following on 16.1.2006 20:44 Uhr:

> I suspect that TeXShop is working correctly. Let me explain how TeXShop
> knows that a file is opened with the wrong encoding.
> 
> Two kinds of encoding are available in TeXShop and other programs. The
> first kind is an encoding with 256 possible characters. Each byte in such
> a file is a legal character, but the encoding determines which unicode
> character corresponds to each byte.
> Thus 0xa2 means one thing is you use MacOSRoman and another thing
> if you use ISO Latin 9, but it is a legal entry in both encodings.
> 
> If you create a file with ISO Latin 9 and save it, and then load it as 
> MacOSRoman,
> some of the characters will be wrong. But TeXShop won't know that because
> the file is legal in both encodings.
> 
> However UTF-8 is different. It is a file format in which the standard 
> 128 ascii
> characters are encoded as usual, but
> the remaining unicode characters are coded in a special way which takes
> 2 or more bytes. Moreover, a random stream of bytes will usually not be
> a legal utf-8 file.
> 
> Here is how TeXShop works: Internally it uses unicode. When it comes
> time to write out the file, the internal representation is converted to 
> a string
> using an encoding. (This is necessary even if the encoding is a Unicode
> encoding, because the Unicode standard doesn't specify a particular way
> of writing unicode to disk. So utf-8 is one possible unicode encoding, but
> not the only one.)
> 
> What happens if there is a unicode character in the text which is not 
> available
> in the particular encoding chosen? Apple's routines contain a parameter 
> which
> indicates whether this should create an error or if instead the 
> character should
> just be ignored or converted to something else. I choose "ignore or 
> convert to
> something else." So if you type, say, a Euro symbol, but the encoding 
> doesn't
> support it, then TeXShop will still write out the file.
> 
> There is somewhat similar code when you read text from disk. Apple's 
> routines
> require that an encoding be specified, and then the file is converted 
> into Apple's
> internal unicode form and displayed in the editor.
> 
> But this time there is another problem. Suppose the encoding is utf-8 
> unicode,
> and the file isn't legal urtf-8. Then when Apple's code reads the file, 
> it suddenly
> says "wait, this doesn't make sense." In that case, it stops reading and 
> reports
> an error to TeXShop. TeXShop then puts up the dialog you have reported
> and reads the file again in MacOSRoman. (Every file is a legal MacOSRoman
> file.)
> 
> Now I think you understand. If you write out a file as ISO Latin 9 and read
> it back as, say, ISO Latin 1, the file will be legal ISO Latin 1, so 
> TeXShop doesn't
> report a problem. But some of the characters will be wrong.

Greetings
Bastian

------------------------- Info --------------------------
Mac-TeX Website: http://www.esm.psu.edu/mac-tex/
          & FAQ: http://latex.yauh.de/faq/
TeX FAQ: http://www.tex.ac.uk/faq
List Archive: http://tug.org/pipermail/macostex-archives/




More information about the MacOSX-TeX mailing list