[OS X TeX] converting ligatures into text
Maarten Sneep
maarten.sneep at xs4all.nl
Fri Apr 22 09:45:16 EDT 2005
On 22 apr 2005, at 15:37, Lawrence Paulson wrote:
> I have to extract text from a large number of PDF documents produced
> using TeX. Because (I presume) of TeX's non-standard font encodings,
> cut and paste often goes wrong. In particular, ligatures get garbled:
> I get di±cult instead of difficult.
What tool do you use to extract the text? Copy & paste from Acrobat?
pdftotext (part of xpdf, you could compile it yourself, or get an
installer from http://www.bluem.net/downloads/pdftotext_en/).
> Does anybody know of a program (or of a definitive set of replacements
> that could be given to Perl) for cleaning up such text?
That would depend on the various encodings, and expectations of the
encoding of the text in the file you create. I think this is a tough
one to answer, in general.
Maarten
--------------------- Info ---------------------
Mac-TeX Website: http://www.esm.psu.edu/mac-tex/
& FAQ: http://latex.yauh.de/faq/
TeX FAQ: http://www.tex.ac.uk/faq
List Post: <mailto:MacOSX-TeX at email.esm.psu.edu>
More information about the MacOSX-TeX
mailing list