The reason(s) for the larger file sizes in your PDFs have little to do
with the OCR as it does with the way you are scanning the images and
then the compression techniques being applied.
For example, I downloaded one of the AMPICO bulletins and noticed that
the image was scanned in standard RGB with JPEG compressed applied, even
though it really only uses two colors (three if you count white ;).
However, the application of the JPEG algorithm increased the number of
unique colors in so that it now takes up THREE TIMES (or more) the space
in the PDF. Had you scanned as lossless TIFF or Flate/ZIP, the color
data (and thus the total image size) could significantly be reduced.
(and since you can't "undo" the artifacting the JPEG does - you'd need
to start from scratch to fix this).
I also looked at some of the Player Piano scores and noticed that rather
than a single image, the displayed page is actually composed of MULTIPLE
image "pieces" along with a "mask". This sometimes happens when you (or
software) try to apply fancy techniques such as "segmentation" to an
image and it shouldn't be. And in doing so, you actually make file size
optimization WORSE because the compression has less data to work with.
(more data == better compression).
And that's just my looking at two random files from your collection...
Leonard Rosenthol
PDF Standards Evangelist
Adobe Systems
-----Original Message-----
From: digital-text@yahoogroups.com [mailto:digital-text@yahoogroups.com]
On Behalf Of Terry Smythe
Sent: Saturday, November 17, 2007 1:46 PM
To: digital-text@yahoogroups.com
Subject: [digital-text] OCR vs Graphic
The recent discussion about OCR raises an issue that some may
consider important enough for discussion, relative to the choice
of CR vs graphic scanning for archival purposes.
For past year or so, I have been archiving 80-100 year old
original literature by scanning full pages (books and pamphlets)
into graphic images, then creating PDF files from within Acrobat
Pro 8. See:
http://members.shaw.ca/paud122/docs.htm
These PDF files provide accurate representation of the original
documents. Using AP8's internal OCR, the files are
searchable, with extracts possible. These are my desireable
attributes.
The good news is the accurate visual representation of the
actual document, and reasonably quick creation. The bad news
is large file sizes.
I have just OCR scanned an original 1927 catalog of piano rolls,
106 pages. The good news is small file size. The bad news
is that it is not an accurate visual representation of the
original document, and required substantial proof-reading and
editing.. See:
http://mmd.foxtail.com/Smythe/US_Music_Roll_Catalog_May_1927.pdf
For this latest adventure, I used ABBYY FineReader Pro 8,
through an HP 4670 "See Thru" scanner, set to 600dpi B&W for all
pages but the covers. The original is 7" x 4.5". The 80
year old newsprint type paper is now yellowed, good physical
condition, but poor contrast. Print quality is generally
marginal with character size at about 6 point Times Roman.
File size emerged at 1.6 megs.
It only took about 45 minutes to do the scanning, 2 opposing
pages simultaneously. AFR8 automatically spilts the 2 pages
into single pages. OCR reading took about 3-4 minutes for
all 106 pages. Then proof-reading and error correction took
about 5 hours.
This catalog has within it some 12-14 foreign languages, all of
which are within AFR8's OCR recognition features.
Unfortunately, as the number of languages to be recognized are
increased, the more likely that certain characters will emerge
with super/sub scripts that are simply not there in the
original.
Furthermore, the small character size, coupled with poorly
formed characters often with many voids and blended characters,
and substandard printing quality with numerous tiny ink
splashes, all combined to frustrate accurate OCR recognition.
First time through, with 14 foreign languages selected, AFR8
took about 4-5 minutes for recognition. But transfer into AP8
proved to be impossible as AP8 encountered special characters
that it did not like, creating a fatal error, refusing to accept
these special characters, shutting down prematurely.
Reducing the foreign languages down to English, French and
German still produced special characters that AP8 did not like.
Ended up selecting only English and it was finally accepted by
AP8.
The final result is not an accurate representation of the
original document. The only real benefits appear to be the
small file size, and excellent quality image.
Conversely, taking the same 600dpi B&W raw images in AFR8 and
sending them directly into AP8 occured very quickly, with a
companion AP8 OCR processing in similar time. Final file size
emerged at 2.2 megs. However image quality is significantly
poorer. See:
http://mmd.foxtail.com/Smythe/US_Music_Roll_Catalog_May_1927_grap
hic.pdf
I've been tinkering with OCR software for past 10 years or so,
seeing it get progressively better. So far, AFR8 is about the
best OCR software I've encountered. But it is still far from
perfect. I am not comfortable with this adventure, and will
likely continue with scanning in graphic mode into AP8.
I thought perhaps others in this group may find this little
adventure of interest. Remember, I am not a professional in
this world. Just doing what I can out of my home as a
volunteer with modest equipment and software. I would love
to have access to a BookEye scanner and companion software, but
that is 'way beyond my reach. Thoughts of others?
Regards,
Terry
Terry Smythe 204-832-3982 (land line)
55 Rowand Avenue 204-981-3229 (cell)
Winnipeg, MB, Canada R3J 2N6 smythe@...
Preserving a unique slice of our Musical Heritage
http://members.shaw.ca/smythe/rebirth.htm
Yahoo! Groups Links