[OS X TeX] Counting words in a latex file

Peter Dyballa Peter_Dyballa at Web.DE
Fri Jun 2 05:00:02 EDT 2006


Am 01.06.2006 um 23:54 schrieb Jan Erik Moström:

> Is there a simple way of doing a word count in a latex file,  
> preferable one that
> can exclude figures/tables etc?

A very exact method can be to use the PDF output. Xpdf brings a  
programme pdftotext (plus some other very useful) that extracts text  
from a PDF file and is clever enough to undo hyphenation. I made a  
few tests this morning with highly hyphenated multi-column text and I  
have to admit it counts as good as me!

	pdftotext file.pdf - | egrep -E '\w\w\w+' | iconv -f ISO-8859-15 -t  
UTF-8 | wc

 From the text pdftotext extracts from the PDF file egrep extracts  
alphanumeric sequences (vulgo words) of at least three characters  
length (in this particular case). These are then converted from the  
encoding used in (La)TeX to that of my runtime environment, i.e. from  
ISO Latin-9 (ISO 8859-15, the one with €) to UTF-8. This conversion  
is needed because wc would count erroneous characters in the text  
since it assumes an UTF-8 ("2-byte") encoding as set in my environment.

According to the man page detex ignores text in array, eqnarray,  
equation, figure, mathmatica, picture, table, and verbatim  
environments. According to the man page it's possible to add other  
environments to ignore in a comma separated list after the -e option  
switch. I am not sure that this actually works as described ...

--
Greetings

   Pete

Some day we may discover how to make magnets that can point in any  
direction.

------------------------- Info --------------------------
Mac-TeX Website: http://www.esm.psu.edu/mac-tex/
          & FAQ: http://latex.yauh.de/faq/
TeX FAQ: http://www.tex.ac.uk/faq
List Archive: http://tug.org/pipermail/macostex-archives/




More information about the MacOSX-TeX mailing list