[OS X TeX] Counting words in a latex file
Peter Dyballa
Peter_Dyballa at Web.DE
Fri Jun 2 05:00:02 EDT 2006
Am 01.06.2006 um 23:54 schrieb Jan Erik Moström:
> Is there a simple way of doing a word count in a latex file,
> preferable one that
> can exclude figures/tables etc?
A very exact method can be to use the PDF output. Xpdf brings a
programme pdftotext (plus some other very useful) that extracts text
from a PDF file and is clever enough to undo hyphenation. I made a
few tests this morning with highly hyphenated multi-column text and I
have to admit it counts as good as me!
pdftotext file.pdf - | egrep -E '\w\w\w+' | iconv -f ISO-8859-15 -t
UTF-8 | wc
From the text pdftotext extracts from the PDF file egrep extracts
alphanumeric sequences (vulgo words) of at least three characters
length (in this particular case). These are then converted from the
encoding used in (La)TeX to that of my runtime environment, i.e. from
ISO Latin-9 (ISO 8859-15, the one with €) to UTF-8. This conversion
is needed because wc would count erroneous characters in the text
since it assumes an UTF-8 ("2-byte") encoding as set in my environment.
According to the man page detex ignores text in array, eqnarray,
equation, figure, mathmatica, picture, table, and verbatim
environments. According to the man page it's possible to add other
environments to ignore in a comma separated list after the -e option
switch. I am not sure that this actually works as described ...
--
Greetings
Pete
Some day we may discover how to make magnets that can point in any
direction.
------------------------- Info --------------------------
Mac-TeX Website: http://www.esm.psu.edu/mac-tex/
& FAQ: http://latex.yauh.de/faq/
TeX FAQ: http://www.tex.ac.uk/faq
List Archive: http://tug.org/pipermail/macostex-archives/
More information about the MacOSX-TeX
mailing list