Who would have thought it?
- Another reason for putting large numbers of human heads together to solve a massive problem.
Published in the NY Times.
Deciphering Old Texts,
One Woozy, Curvy Word at a Time
By GUY GUGLIOTTA
Published: March 28, 2011
In the old days, anybody interested in seeing a Mets game during a trip
to New York would have to call the team, or write away, or wait to get
to the city and visit the box office. No more. Now, all it takes is to
find an online ticket distributor. Sign in, click “Mets,” pick the date
A Speech Lost in Digital Translation
(March 29, 2011)
Get Science News From The New York Times »
But before taking the money, the Web site might first present the reader
with two sets of wavy, distorted letters and ask for a transcription.
These things are called Captchas, and only humans can read them.
Captchas ensure that robots do not hack secure Web sites.
What Web readers do not know, however, is that they have also been
enlisted in a project to transform an old book, magazine, newspaper or
pamphlet into an accurate, searchable and easily sortable computer text
One of the wavy words quite likely came from a digitized image from an
old, musty text, and while the original page has already been scanned
into an online database, the scanning programs made a lot of mistakes.
Mets fans and other Web site users are correcting them. Buy a ticket to
the ballgame, help preserve history.
The set of software tools that accomplishes this feat is called
reCaptcha and was developed by a team of researchers led by Luis von
Ahn, a computer scientist at Carnegie Mellon University.
Its pilot project was to clean up the digitized archive of The New York
Times. Today it has become the principal method used by Google
to authenticate text in Google Books, its vast project to digitize and
disseminate rare and out-of-print texts on the Internet.
Digitization is normally a three-stage process: create a photographic
image of the text, also known as a bitmap; encode the text in a compact,
easily handled and searchable form using optical character recognition
software, commonly called O.C.R.; and, finally, correct the mistakes.
Today’s technology makes the first two steps relatively straightforward.
The third, however, can be extremely difficult. For vintage
19th-century texts in English, O.C.R. programs mess up or miss 10
percent to 30 percent of the words. Only humans can fix the errors. The
standard method, called key and verify, uses two transcribers to type
the text independently and compares the results. This is time-consuming
and extremely expensive.
But in 2006, Dr. von Ahn’s team figured out a way around this obstacle.
The ubiquitous Captchas, familiar to even the most casual Web user, were
the perfect tools. Captchas, short for “completely automated public
Turing test to tell computers and humans apart,” are impossible for
machines to decipher, but easy for humans. (The test is named for the
British computer pioneer Alan
Dr. von Ahn’s group estimated that humans around the world decode at
least 200 million Captchas per day, at 10 seconds per Captcha. This
works out to about 500,000 hours per day — a lot of applied brainpower
being spent on what Dr. von Ahn regards as a fundamentally mindless
“So we asked, ‘Can we do something useful with this time?’ ” Dr. von Ahn
recalled in a telephone interview. Instead of making Captchas out of
random words printed in a woozy way, why not ask Web users to translate
problem words from archival texts?
By Dr. von Ahn’s estimate, reCaptcha is being used by 70 percent to 90
percent of Web sites that have Captchas — including Ticketmaster, Facebook
and local bank branches.
Google bought Dr. von Ahn’s start-up in 2009 — he will not say how much
it paid — and put it to work on Google Books. He says “several million”
words are being translated every day.
The Times, published since 1851, had already optically transcribed its
archive when it contacted Dr. von Ahn. Robert Larson, the
company’s vice president for search products, said the paper had
“looked at various ways” to edit the text, “but Luis’s method was faster
Page images, particularly those printed before 1900, are loaded with
smudges, stains, watermarks and crooked type, all of which give O.C.R.’s
the fits. To fix the errors, Dr. von Ahn uses a number of programs,
which when applied in the proper sequence magically transform troubled
passages into easy-to-read prose.
The first step is done in-house. Two different O.C.R. programs scan the
photographic image. Both will make mistakes, but not necessarily the
ReCaptcha flags as “suspicious” any word that is deciphered differently
by the two programs or that does not appear in an English dictionary.
The dictionary catches words that are misspelled the same way by both
O.C.R.’s. Other programs examine the words on either side of the suspect
word and make another guess based on that analysis.
Then each suspicious word is turned into a Captcha. It is crucial to
understand that the Captcha is a distorted version of the word as
printed in the original photographic image. It is not made from the
O.C.R.’s imagined translation, which is often unintelligible. The
unknown word is then paired with a second Captcha word whose correct
translation is already known. This is the “control.”
Several Web users seeking entry to secure sites are then given both
words and asked to decipher them separately.
A correct answer for the control word proves that the user is a human
and not a machine. Answers for the unknown word are compared with the
O.C.R. guesses and the context analysis. If the system is satisfied that
the answer is correct, then the game is over.
Dr. von Ahn acknowledged that some words cannot be transcribed, usually
because the original text is torn or damaged in some other way. If
enough users fail to identify an unknown, the word is deemed to be
indecipherable and is marked as such.
ReCaptcha also fails badly on cursive, Dr. von Ahn said, adding that
“nobody reads handwriting anymore.” And reCaptcha so far translates only
English words, even though many reCaptcha Web sites have overseas
clients whose users are not necessarily English speakers.
With all these constraints, reCaptcha nevertheless achieves an accuracy
rate above 99 percent, which compares favorably with professional human
transcribers. And Dr. von Ahn is convinced that performance will improve
with experience, of which there will be no shortage.
“We’ll be going for a long time,” he said. “There’s a lot of printed
material out there.”
[Non-text portions of this message have been removed]