Double Key Entry: Accuracy and costs?
There are essentially two methods to digitize ink-on-paper texts:
manual key entry, and scanning/OCR.
David Sewell, in a prior message to this group:
noted that double key entry, combined with software "diffing" and
software+human reconciliation of differences, is still "the gold
standard for capturing text from hardcopy" with regards to accuracy.
For those not familiar with "double key entry", the text is literally
retyped (or rekeyed) into a text editor, character-by-character, by
two different people. The resulting two digital texts are digitally
compared ("diffed"), and any differences are reconciled by both
software and humans.
Obviously double key entry will lead to very high accuracy when done
right. It can also better handle unusual situations that strain the
capability of OCR engines.
I have a few questions to ask of everyone. Feel free to contribute by
answering any of the following, or to answer other important questions
I do not ask below:
1) What is the typical accuracy of double key entry as explained
(Does "triple key entry" offer any benefits when very high accuracy
text is desired, or is that unnecessary?)
2) Of the errors that remain after double key entry, what are they
Lee Passey suggested to me that oftentimes they would be common
transposition of letters which is the bane of most typists. For
example, "teh" instead of "the".
3) If someone hired a commercial company to do double key entry, what
is the typical going rate? (Let's only consider the rekeying stage,
not the digital comparison/processing stage.)
I would presume a lot of rekeying is done in countries with low
labor rates, such as India.
4) Does anyone see a role, as David Sewell suggested in his message
(see the above link), of adding OCR to the mix so as to reduce
costs for a given level of accuracy?
For example, let's consider scanning the text pages, have a single
key entry done, and also produce a high-grade OCR version (maybe
itself produced by a combination of OCR engines as David mentioned)
and "diff" those two.
- Jon Jermey wrote:
> Accuracy of proofreading can be improved at a minimal cost by anThe 'Comerford' example is very specific to a particular text. A person
> intelligent reading of the text. For instance, if 'Mr Comerford' is
> mentioned on page 1 and 'Mr Cornerford' on pages 2 to 10, an intelligent
> English-speaking proofreader can use their knowledge of English
> orthography to decide that 'Comerford' is correct and "Cornerford' is
> incorrect. If they are properly equipped with software they can then do
> a global change from 'Cornerford' to 'Comerford' and correct what may be
> dozens of errors without even needing to see them. Other global changes
> -- 'tlie' to 'the', for instance -- are even more obvious. In fact I
> have set up a Word macro which globally corrects about thirty common
> errors of this kind, and I run this whenever I start proofing a new text.
unfamiliar with that text would be at odds to know which is correct.
But for the 'Mr" I would have guessed it to be the name of a small
village. While changing 'tlie' tp 'the' may be the most common
correction, who's to say that the error was not from transposing the
letters in 'tile'? Failing to see these situations does not exactly
produce an intelligent reading. The search can be automated, but each
instance should be given a reality check.
> This is one of my objections to DP - that the same error needs to beThat strikes me as a lesser evil. It weighs greater accuracy against
> manually corrected each time, no matter how obvious the alteration is.
the acceleration of an otherwise tedious task.