[OS X TeX] OT: Scanners, OCR, searchable pdf files, Acrobat Pro 9

Sun Aug 2 20:35:39 EDT 2009

On Sun, Aug 2, 2009 at 7:00 PM, Claus
Gerhardt<gerhardt at math.uni-heidelberg.de> wrote:
> my handwritten lecture notes and also printed documents, and in the process
> I also experimented with the OCR capabilities of Acrobat Pro 9.
> The S1500M lived up to the high expectations I had because of the raving
> reviews, it works quickly and the results are excellent: several scanning
> modes can be predefined including duplex scanning, resolutions, OCR, ...
> OCR of course works only for printed documents. For scans of handwritten
> notes I usually add a front page (using pdflatex and pdfselect) with some
> keywords. In Acrobat Pro 9 one can also add Bookmarks for easier navigation.
> I have many pdf files of scanned old journal articles which are not
> searchable, but in Acrobat Pro 9 these files can be made searchable after
> specifying a language. A 42 pages mathematical paper was OCRed in less than
> a minute. A check (including Spotlight) worked fine.
> In short the S1500M is a beauty.

Few remarks.
1) Scanning one must identify correctly the type of document. F.e.
scanning mathematical book or handwriting one (usually) should
identify document as text or black/white and select b/w threshold
properly. This allows usually crispy clear text while removing traces
of the paper being lined or block. Also it makes pdf much smaller. My
experiments with two adjacent pages of the old book with yellowish
paper show that scanning in color brings up to 5MB, grayscale 2MB and
b/w with much better quality only around 80KB. All these MBs go to
reproduce paper color, speckles and shadows

2) AP 9 is much better than AP 8 which in turn was much better with
scanned documents than AP 7. One of the problems AP can address arises
when one scans the page from the journal/book which never places
properly on the glass.

Document > Optimize Scanned Document

works like charm removing speckles and rotating pages automatically.
Cropping works really well too

3) OCR becomes a problem with documents which contain plenty of
formulae or non-Latin based. AP 7 was really bad with those, AP 8 was
better and I had no need to experiment extensively with AP 9. The good
news is that ABBY Fine Reader works great with those and it is built
into Djvu Document Express. The bad news is that it is purely Windows
and is *extremely* expensive. Long ago djvu produced much smaller
documents than AP7 and there are many OCRed old journals in djvu
format (see http://numdam.org and http://projecteuclid.org f.e.) but
Djvu Document Express was neglected for many years and AP made a
really big progress.

I am familiar with this as I digitalized by myself many of my articles
spanning from 1969 to 1990

Victor

-- 
========================
Victor Ivrii, Professor, Department of Mathematics, University of Toronto
http://www.math.toronto.edu/ivrii